Year of Award

2017

Document Type

Thesis

Degree Type

Master of Science (MS)

Degree Name

Computer Science

Department or School/College

Department of Computer Science

Committee Chair

Douglas W. Raiford Ph.D.

Commitee Members

Alden H. Wright Ph.D., William E. Holben Ph.D.

Keywords

metagenomics, computer science, machine learning

Subject Categories

Abstract

Biological sequence datasets are increasing at a prodigious rate. The volume of data in these datasets surpasses what is observed in many other fields of science. New developments wherein metagenomic DNA from complex bacterial communities is recovered and sequenced are producing a new kind of data known as metagenomic data, which is comprised of DNA fragments from many genomes. Developing a utility to analyze such metagenomic data and predict the sample class from which it originated has many possible implications for ecological and medical applications. Within this document is a description of a series of analytical techniques used to process metagenomic data in such a way that it is transformed from the raw sequence information into a reusable data structure that can be processed by feature selection techniques and machine learning algorithms. Analysis and transformation of the data from the raw sequences to a reusable structure is done using k length substrings of DNA, known as k-mers, and storing the count of these observed strings in a Numeric Summarization Vector (NSV). The technique described herein is offered as a proof of concept for research into analyzing metagenomic data without identifying individual organisms contained within the sample. It is tested using leave-one-out and Monte Carlo cross-validation, while varying numerous parameters and verifying the results by using a large pool of independent experiments initiated with the same starting parameters. The pipeline is validated against multiple data sets using two- and three-class problems. Results are presented showing the accuracy as a function of multiple parameters that can be selected by a user of the pipeline. This work shows that there may be a way to process metagenomic data in near real time to analyze and predict the environmental class of a sample with reasonable accuracy. Consider the difficulty in distinguishing the difference between a healthy and diseased gut microbiome, this approach can classify sample data as belonging to one of those states.

Recommended Citation

Kaehler, Russell, "K-MER ANALYSIS PIPELINE FOR CLASSIFICATION OF DNA SEQUENCES FROM METAGENOMIC SAMPLES" (2017). Graduate Student Theses, Dissertations, & Professional Papers. 10967.
https://scholarworks.umt.edu/etd/10967

Download

Included in

Artificial Intelligence and Robotics Commons, Bioinformatics Commons, Computational Biology Commons, Numerical Analysis and Scientific Computing Commons, Other Ecology and Evolutionary Biology Commons, Other Genetics and Genomics Commons, Software Engineering Commons

COinS

ScholarWorks at University of Montana

Graduate Student Theses, Dissertations, & Professional Papers

K-MER ANALYSIS PIPELINE FOR CLASSIFICATION OF DNA SEQUENCES FROM METAGENOMIC SAMPLES

Year of Award

Document Type

Degree Type

Degree Name

Department or School/College

Committee Chair

Commitee Members

Keywords

Subject Categories

Abstract

Recommended Citation

Included in

Search

Browse

Author Corner

Links

ScholarWorks at University of Montana

Graduate Student Theses, Dissertations, & Professional Papers

K-MER ANALYSIS PIPELINE FOR CLASSIFICATION OF DNA SEQUENCES FROM METAGENOMIC SAMPLES

Author

Year of Award

Document Type

Degree Type

Degree Name

Department or School/College

Committee Chair

Commitee Members

Keywords

Subject Categories

Abstract

Recommended Citation

Included in

Share

Search

Browse

Author Corner

Links