Year of Award

2019

Document Type

Thesis

Degree Type

Master of Science (MS)

Degree Name

Computer Science

Department or School/College

Computer Science

Committee Chair

Oliver Serang

Commitee Members

Oliver Serang, Rob Smith, J. Stephen Lodmell

Keywords

de novo, small molecules, algorithms, mass spectrometry, graph isomorphism, glycomics

Subject Categories

Computer Sciences

Abstract

In the analysis of mass spectra, if a superset of the molecules thought to be in a sample is known a priori, then there are well established techniques for the identification of the molecules such as database search and spectral libraries. Linear molecules are chains of subunits. For example, a peptide is a linear molecule with an “alphabet” of 20 possible amino acid subunits. A peptide of length six will have 206 = 64, 000, 000 different possible outcomes. Small molecules, such as sugars and metabolites, are not constrained to linear structures and may branch. These molecules are encoded as undirected graphs rather than simply linear chains. An undirected graph with six subunits (each of which have 20 possible outcomes) will 6 have 206 · 2(6 choose 2) = 2, 097, 152, 000, 000 possible outcomes. The vast amount of complex graphs which small molecules can form can render databases and spectral libraries impossibly large to use or incomplete as many metabolites may still be unidentified. In the absence of a usable database or spectral library, an the alphabet of subunits may be used to connect peaks in the fragmentation spectra; each connection represents a neutral loss of an alphabet mass. This technique is called “de novo sequencing” and relies on the alphabet being known in advance. Often the alphabet of m/z difference values allowed by de novo analysis is not known or is incomplete. A method is proposed that, given fragmentation mass spectra, identifies an alphabet of m/z differences that can build large connected graphs from many intense peaks in each spectrum from a collection. Once an alphabet is obtained, it is informative to find common substructures among the peaks connected by the alphabet. This is the same as finding the largest isomorphic subgraphs on the de novo graphs from all pairs of fragmentation spectra. This maximal subgraph isomorphism problem is a generalization of the subgraph isomorphism problem, which asks whether a graph G1 has a subgraph isomorphic to a graph G2 . Subgraph isomorphism is NP-complete. A novel method of efficiently finding common substructures among the subspectra induced by the alphabet is proposed. This method is then combined with a novel form of hashing, eschewing evaluation of all pairs of fragmentation spectra. These methods are generalized to Euclidean graphs embedded in Zn.

Share

COinS
 

© Copyright 2019 Patrick Anthony Kreitzberg