Year of Award
2017
Document Type
Thesis
Degree Type
Master of Science (MS)
Degree Name
Computer Science
Department or School/College
Computer Science
Committee Chair
Douglas W. Raiford Ph.D.
Commitee Members
Alden H. Wright Ph.D., William E. Holben Ph.D.
Keywords
metagenomics, computer science, machine learning
Subject Categories
Artificial Intelligence and Robotics | Bioinformatics | Computational Biology | Numerical Analysis and Scientific Computing | Other Ecology and Evolutionary Biology | Other Genetics and Genomics | Software Engineering
Abstract
Biological sequence datasets are increasing at a prodigious rate. The volume of data in these datasets surpasses what is observed in many other fields of science. New developments wherein metagenomic DNA from complex bacterial communities is recovered and sequenced are producing a new kind of data known as metagenomic data, which is comprised of DNA fragments from many genomes. Developing a utility to analyze such metagenomic data and predict the sample class from which it originated has many possible implications for ecological and medical applications. Within this document is a description of a series of analytical techniques used to process metagenomic data in such a way that it is transformed from the raw sequence information into a reusable data structure that can be processed by feature selection techniques and machine learning algorithms. Analysis and transformation of the data from the raw sequences to a reusable structure is done using k length substrings of DNA, known as k-mers, and storing the count of these observed strings in a Numeric Summarization Vector (NSV). The technique described herein is offered as a proof of concept for research into analyzing metagenomic data without identifying individual organisms contained within the sample. It is tested using leave-one-out and Monte Carlo cross-validation, while varying numerous parameters and verifying the results by using a large pool of independent experiments initiated with the same starting parameters. The pipeline is validated against multiple data sets using two- and three-class problems. Results are presented showing the accuracy as a function of multiple parameters that can be selected by a user of the pipeline. This work shows that there may be a way to process metagenomic data in near real time to analyze and predict the environmental class of a sample with reasonable accuracy. Consider the difficulty in distinguishing the difference between a healthy and diseased gut microbiome, this approach can classify sample data as belonging to one of those states.
Recommended Citation
Kaehler, Russell, "K-MER ANALYSIS PIPELINE FOR CLASSIFICATION OF DNA SEQUENCES FROM METAGENOMIC SAMPLES" (2017). Graduate Student Theses, Dissertations, & Professional Papers. 10967.
https://scholarworks.umt.edu/etd/10967
Included in
Artificial Intelligence and Robotics Commons, Bioinformatics Commons, Computational Biology Commons, Numerical Analysis and Scientific Computing Commons, Other Ecology and Evolutionary Biology Commons, Other Genetics and Genomics Commons, Software Engineering Commons
© Copyright 2017 Russell Kaehler