Year of Award
2025
Document Type
Dissertation
Degree Type
Doctor of Philosophy (PhD)
Degree Name
Computer Science
Department or School/College
Department of Computer Science
Committee Chair
Travis Wheeler
Commitee Members
Doug Brinkerhoff, Lucia Williams, Bruce Bowler, Ben Langmead, Mihai Surdeanu
Keywords
Hidden Markov model, Homology Search, Machine Learning, NEAR, Tandem Repeat, ULTRA
Abstract
The rapid growth of biological sequence data presents significant challenges for bioinformatic analysis. A cornerstone of this analysis, sequence similarity search, is hampered by three main obstacles: sensitivity in detecting distant evolutionary relationships, selectivity in distinguishing true homology from spurious similarity caused by features like tandem repeats, and scalability to handle massive databases. This dissertation introduces two novel computational tools, ULTRA and NEAR, that address these challenges to improve the efficiency, sensitivity, and selectivity of sequence similarity search. Part I of this dissertation addresses the problem of selectivity by focusing on tandem repeats, a common source of false-positive results in homology search. We present ULTRA (ULTRA Locates Tandemly Repetitive Areas), a tool for the accurate de novo annotation of tandemly repetitive DNA. ULTRA employs a hidden Markov model (HMM) that explicitly models insertion and deletion events within repetitive regions, allowing it to identify decayed repeats with greater accuracy than industry-standard repeat annotation tools. Accurate annotation of repeats allows for more effective masking, significantly reducing a major source of error in downstream homology searches. Part II addresses the challenges of sensitivity and scalability by developing a novel deep learning framework for protein homology search. We present NEAR (Neural Embeddings for Amino-acid Relationships), which transforms protein sequences into high-dimensional vector embeddings for rapid similarity search. NEAR utilizes a lightweight 1D Residual Convolutional Neural Network (ResNet) trained specifically for homology detection using a contrastive learning objective guided by trusted sequence alignments. This targeted approach results in a model that is significantly faster than larger protein language models while also producing embeddings that are better suited for homology search. This work goes on to develop new methods for inferring meaningful alignment similarity from the results of fast approximate nearest neighbor search results. Benchmarking NEAR reveals that it is capable of competitive sensitivity and selectivity to HMMER3’s phmmer, while also being significantly faster.
Recommended Citation
Olson, Daniel, "METHODS TO IMPROVE EFFICIENCY, SENSITIVITY, AND SELECTIVITY OF SEQUENCE SIMILARITY SEARCH" (2025). Graduate Student Theses, Dissertations, & Professional Papers. 12554.
https://scholarworks.umt.edu/etd/12554
© Copyright 2025 Daniel Olson