Year of Award
2021
Document Type
Thesis
Degree Type
Master of Science (MS)
Degree Name
Computer Science
Department or School/College
Computer Science
Committee Chair
Travis Wheeler
Commitee Members
Travis Wheeler, Jesse Johnson, Mark Grimes
Keywords
Sequence Alignment, Protein Search, Genetic Database Search, Heuristic Algorithms
Subject Categories
Bioinformatics | Computational Biology
Abstract
Sequence annotation is typically performed by aligning an unlabeled sequence to a collection of known sequences, with the aim of identifying non-random similarities. Given the broad diversity of new sequences and the considerable scale of modern sequence databases, there is significant tension between the competing needs for sensitivity and speed, with multiple tools displacing the venerable BLAST software suite on one axis or another. In recent years, alignment based on profile hidden Markov models (pHMMs) and associated probabilistic inference methods have demonstrated increased sensitivity due in part to consideration of the ensemble of all possible alignments between a query and target using the Forward/Backward algorithm, rather than simply relying on the single highest-probability (Viterbi) alignment. Modern implementations of pHMM search achieve their speed by avoiding computation of the expensive Forward/Backward algorithm for most (HMMER3) or all (MMseqs2) candidate sequence alignments. Here, we describe a heuristic Forward/Backward algorithm that avoids filling in the entire quadratic dynamic programming (DP) matrix, by identifying a sparse cloud of DP cells containing most of the probability mass. The method produces an accurate approximation of the Forward/Backward alignment with high speed and small memory requirements. We demonstrate the utility of this sparse Forward/Backward approach in a tool that we call MMOREseqs; the name is a reference to the fact that our tool utilizes the MMseqs2 software suite to rapidly identify promising seed alignments to serve as a basis for sparse Forward/Backward. MMOREseqs demonstrates improved annotation sensitivity with modest increase in run time over MMseqs2 and is released under the open BSD-3-clause license. Source code and Docker image are available for download at https://github.com/TravisWheelerLab/MMOREseqs.
Recommended Citation
Rich, David H., "SPARSE FORWARD-BACKWARD ALIGNMENT FOR SENSITIVE DATABASE SEARCH WITH SMALL MEMORY AND TIME REQUIREMENTS" (2021). Graduate Student Theses, Dissertations, & Professional Papers. 11763.
https://scholarworks.umt.edu/etd/11763
Included in
© Copyright 2021 David H. Rich