Author

Tim Anderson

Year of Award

2024

Document Type

Dissertation

Degree Type

Doctor of Philosophy (PhD)

Degree Name

Computer Science

Department or School/College

Department of Computer Science

Committee Chair

Travis Wheeler

Commitee Members

Doug Brinkerhoff, Erin Landguth, Ross Snider, Scott Beamer

Keywords

Field Programmable Gate Arrays, Profile Hidden Markov Models, Sequence Similarity Search, SIMD

Abstract

Sequence similarity search is one of the most important tasks in the field of bioinformatics. The identification of regions of similarity between one protein or DNA sequence and a separate collection of sequences or models of sequence families has wide-ranging applications. These range from mapping of RNA sequence reads to a given genome to the annotation of gene-encoding regions and the reconstruction of evolutionary relationships of species. As genetic sequencing technologies have improved over the last few decades, the size of genetic sequence databases has grown exponentially. With this growth of sequence databases, more efficient algorithmic approaches are needed to analyze these data.

While software tools for sequence similarity search have become more accurate and sensitive over time, the most sensitive algorithms are often too slow to be effective alone. In order to keep up with the demands of large sequence database, similarity search software tools employ prefitering strategies where fast, heuristic approximation algorithms act as prefilters on the input data to avoid performing expensive calculations on sequences that are unlikely to lead to useful sequence alignments in the final slower alignment stages.

This dissertation focuses on three tools, largely aimed at the prefiltering stages of sequence similarity search pipelines that achieve high sensitivity through the use of probabilistic models of sequence families called profile hidden Markov models (pHMMs). First, a configurable hardware accelerator is presented that implements an ungapped form of the pHMM Viterbi algorithm. In this tool, areas of likely similarity are quickly identified on the hardware accelerator and returned to the host device. Through a custom architecture based on systolic arrays, the hardware accelerator performs over 60x faster than CPU-based implementations of the same algorithm. Second, A novel implementation of the FM-index algorithm is then presented. This implementation achieves state of the art performance on exact-match search on nucleotide and protein sequence data using Single Instruction Multiple Data (SIMD) operations on sequence data represented as strided bit vectors. Third, A full sequence similarity search prefilter is then presented that leverages the previous FM-index implementation to efficiently explore the space of all strings up to a maximum length that could pass a given score threshold when aligned to a pHMM. The overall space of potential alignments is heavily culled by considering which strings appear in the sequence database and which strings lose the ability to pass the score threshold.

Share

COinS
 

© Copyright 2024 Tim Anderson