Author

Daniel Olson

Year of Award

2025

Document Type

Dissertation

Degree Type

Doctor of Philosophy (PhD)

Degree Name

Computer Science

Department or School/College

Department of Computer Science

Committee Chair

Travis Wheeler

Commitee Members

Doug Brinkerhoff, Lucia Williams, Bruce Bowler, Ben Langmead, Mihai Surdeanu

Keywords

Hidden Markov model, Homology Search, Machine Learning, NEAR, Tandem Repeat, ULTRA

Abstract

The rapid growth of biological sequence data presents significant challenges for bioinformatic analysis. A cornerstone of this analysis, sequence similarity search, is hampered by three main obstacles: sensitivity in detecting distant evolutionary relationships, selectivity in distinguishing true homology from spurious similarity caused by features like tandem repeats, and scalability to handle massive databases. This dissertation introduces two novel computational tools, ULTRA and NEAR, that address these challenges to improve the efficiency, sensitivity, and selectivity of sequence similarity search. Part I of this dissertation addresses the problem of selectivity by focusing on tandem repeats, a common source of false-positive results in homology search. We present ULTRA (ULTRA Locates Tandemly Repetitive Areas), a tool for the accurate de novo annotation of tandemly repetitive DNA. ULTRA employs a hidden Markov model (HMM) that explicitly models insertion and deletion events within repetitive regions, allowing it to identify decayed repeats with greater accuracy than industry-standard repeat annotation tools. Accurate annotation of repeats allows for more effective masking, significantly reducing a major source of error in downstream homology searches. Part II addresses the challenges of sensitivity and scalability by developing a novel deep learning framework for protein homology search. We present NEAR (Neural Embeddings for Amino-acid Relationships), which transforms protein sequences into high-dimensional vector embeddings for rapid similarity search. NEAR utilizes a lightweight 1D Residual Convolutional Neural Network (ResNet) trained specifically for homology detection using a contrastive learning objective guided by trusted sequence alignments. This targeted approach results in a model that is significantly faster than larger protein language models while also producing embeddings that are better suited for homology search. This work goes on to develop new methods for inferring meaningful alignment similarity from the results of fast approximate nearest neighbor search results. Benchmarking NEAR reveals that it is capable of competitive sensitivity and selectivity to HMMER3’s phmmer, while also being significantly faster.

Share

COinS
 

© Copyright 2025 Daniel Olson