Year of Award
2024
Document Type
Dissertation
Degree Type
Doctor of Philosophy (PhD)
Degree Name
Computer Science
Department or School/College
Department of Computer Science
Committee Chair
Travis Wheeler
Commitee Members
Doug Brinkerhoff, Erin Landguth, Ross Snider, Scott Beamer
Keywords
Field Programmable Gate Arrays, Profile Hidden Markov Models, Sequence Similarity Search, SIMD
Abstract
Sequence similarity search is one of the most important tasks in the field of bioinformatics. The identification of regions of similarity between one protein or DNA sequence and a separate collection of sequences or models of sequence families has wide-ranging applications. These range from mapping of RNA sequence reads to a given genome to the annotation of gene-encoding regions and the reconstruction of evolutionary relationships of species. As genetic sequencing technologies have improved over the last few decades, the size of genetic sequence databases has grown exponentially. With this growth of sequence databases, more efficient algorithmic approaches are needed to analyze these data.
While software tools for sequence similarity search have become more accurate and sensitive over time, the most sensitive algorithms are often too slow to be effective alone. In order to keep up with the demands of large sequence database, similarity search software tools employ prefitering strategies where fast, heuristic approximation algorithms act as prefilters on the input data to avoid performing expensive calculations on sequences that are unlikely to lead to useful sequence alignments in the final slower alignment stages.
This dissertation focuses on three tools, largely aimed at the prefiltering stages of sequence similarity search pipelines that achieve high sensitivity through the use of probabilistic models of sequence families called profile hidden Markov models (pHMMs). First, a configurable hardware accelerator is presented that implements an ungapped form of the pHMM Viterbi algorithm. In this tool, areas of likely similarity are quickly identified on the hardware accelerator and returned to the host device. Through a custom architecture based on systolic arrays, the hardware accelerator performs over 60x faster than CPU-based implementations of the same algorithm. Second, A novel implementation of the FM-index algorithm is then presented. This implementation achieves state of the art performance on exact-match search on nucleotide and protein sequence data using Single Instruction Multiple Data (SIMD) operations on sequence data represented as strided bit vectors. Third, A full sequence similarity search prefilter is then presented that leverages the previous FM-index implementation to efficiently explore the space of all strings up to a maximum length that could pass a given score threshold when aligned to a pHMM. The overall space of potential alignments is heavily culled by considering which strings appear in the sequence database and which strings lose the ability to pass the score threshold.
Recommended Citation
Anderson, Tim, "SEQUENCE SIMILARITY SEARCH OPTIMIZATION: PREFILTERING STRATEGIES AND EXACT MATCHING TECHNIQUES FOR BIOLOGICAL SEQUENCE DATA" (2024). Graduate Student Theses, Dissertations, & Professional Papers. 12384.
https://scholarworks.umt.edu/etd/12384
© Copyright 2024 Tim Anderson