Year of Award


Document Type


Degree Type

Master of Science (MS)

Degree Name

Computer Science

Department or School/College

Computer Science

Committee Chair

Douglas W. Raiford Ph.D.

Commitee Members

Alden H. Wright Ph.D., William E. Holben Ph.D.


metagenomics, computer science, machine learning


University of Montana

Subject Categories

Artificial Intelligence and Robotics | Bioinformatics | Computational Biology | Numerical Analysis and Scientific Computing | Other Ecology and Evolutionary Biology | Other Genetics and Genomics | Software Engineering


Biological sequence datasets are increasing at a prodigious rate. The volume of data in these datasets surpasses what is observed in many other fields of science. New developments wherein metagenomic DNA from complex bacterial communities is recovered and sequenced are producing a new kind of data known as metagenomic data, which is comprised of DNA fragments from many genomes. Developing a utility to analyze such metagenomic data and predict the sample class from which it originated has many possible implications for ecological and medical applications. Within this document is a description of a series of analytical techniques used to process metagenomic data in such a way that it is transformed from the raw sequence information into a reusable data structure that can be processed by feature selection techniques and machine learning algorithms. Analysis and transformation of the data from the raw sequences to a reusable structure is done using k length substrings of DNA, known as k-mers, and storing the count of these observed strings in a Numeric Summarization Vector (NSV). The technique described herein is offered as a proof of concept for research into analyzing metagenomic data without identifying individual organisms contained within the sample. It is tested using leave-one-out and Monte Carlo cross-validation, while varying numerous parameters and verifying the results by using a large pool of independent experiments initiated with the same starting parameters. The pipeline is validated against multiple data sets using two- and three-class problems. Results are presented showing the accuracy as a function of multiple parameters that can be selected by a user of the pipeline. This work shows that there may be a way to process metagenomic data in near real time to analyze and predict the environmental class of a sample with reasonable accuracy. Consider the difficulty in distinguishing the difference between a healthy and diseased gut microbiome, this approach can classify sample data as belonging to one of those states.



© Copyright 2017 Russell Kaehler