Year of Award

2021

Document Type

Thesis

Degree Type

Master of Science (MS)

Degree Name

Computer Science

Department or School/College

Computer Science

Committee Chair

Rob Smith

Commitee Members

Michael Cassens, Doug Brinkerhoff

Keywords

mass spectrometry, feature detection, noise reduction, algorithm, data mining, bioinformatics

Publisher

University of Montana

Subject Categories

Data Science

Abstract

Mass spectrometry (MS) is used in analysis of chemical samples to identify the molecules present and their quantities. This analytical technique has applications in many fields, from pharmacology to space exploration. Its impacts on medicine are particularly significant, since MS aids in the identification of molecules associated with disease; for instance, in proteomics, MS allows researchers to identify proteins that are associated with autoimmune disorders, cancers, and other conditions. Since the applications are so wide-ranging and the tool is ubiquitous across so many fields, it is critical that the analytical methods used to collect data are sound.

Data analysis in MS is challenging. Experiments produce massive amounts of raw data that need to be processed algorithmically in order to generate interpretable results in a process known as feature detection, which is tasked with distinguishing signals associated with the chemical sample being analyzed from signals associated with background noise. These experimentally meaningful signals are also known as features or extracted ion chromatograms (XIC) and are the fundamental signal unit in mass spectrometry. There are many algorithms for analyzing raw mass spectrometry data tasked with distinguishing real isotopic signals from noise. While one or more of the available algorithms are typically chained together for end-to-end mass spectrometry analysis, analysis of each algorithm in isolation provides a specific measurement of the strengths and weaknesses of each algorithm without the confounding effects that can occur when multiple algorithmic tasks are chained together. Though qualitative opinions on extraction algorithm performance abound, quantitative performance has never been publicly ascertained. Quantitative evaluation has not occurred partly due to the lack of an available quantitative ground truth MS1 data set.

Because XIC must be distinguished from noise, quality algorithms for this purpose are essential. Background noise is introduced through the mobile phase of the chemical matrix in which the sample of interest is introduced to the MS instrument, and as a result, MS data is full of signals representing low-abundance molecules (i.e. low-intensity signals). Noise generally presents in one of two ways: very low-intensity signals that comprise a majority of the data from an MS experiment, and noise features that are moderately low-intensity and can resemble signals from low-abundance molecules deriving from the actual sample of interest. Like XIC algorithms, noise reduction algorithms have yet to be quantitatively evaluated, to our knowledge; the performance of these algorithms is generally evaluated through consensus with other noise reduction algorithms.

Using a recently published, manually-extracted XIC dataset as ground truth data, we evaluate the quality of popular XIC algorithms, including MaxQuant, MZMine2, and several methods from XCMS. XIC algorithms were applied to the manually extracted data using a grid search of possible parameters. Performance varied greatly between different parameter settings, though nearly all algorithms with parameter settings optimized with respect to the number of true positives recovered over 10,000 XIC. We also examine two popular algorithms for reducing background noise, the COmponent Detection Algorithm (CODA) and adaptive iteratively reweighted Penalized Least Squares (airPLS), and compare their performance to the results of feature detection alone using algorithms that achieved the best performance in a previous evaluation. Due to weaknesses inherent in the implementation of these algorithms, both noise reduction algorithms eliminate data identified by feature detection as significant.

Included in

Data Science Commons

Share

COinS
 

© Copyright 2021 Annika R. Tostengard and Rob Smith