## Year of Award

2019

## Document Type

Thesis

## Degree Type

Master of Science (MS)

## Degree Name

Computer Science

## Department or School/College

Computer Science

## Committee Chair

Oliver Serang

## Commitee Members

Oliver Serang, Rob Smith, J. Stephen Lodmell

## Keywords

de novo, small molecules, algorithms, mass spectrometry, graph isomorphism, glycomics

## Publisher

University of Montana

## Subject Categories

Computer Sciences

## Abstract

In the analysis of mass spectra, if a superset of the molecules thought to be in a sample is known *a priori*, then there are well established techniques for the identification of the molecules such as database search and spectral libraries. Linear molecules are chains of subunits. For example, a peptide is a linear molecule with an “alphabet” of 20 possible amino acid subunits. A peptide of length six will have 20^{6} = 64, 000, 000 different possible outcomes. Small molecules, such as sugars and metabolites, are not constrained to linear structures and may branch. These molecules are encoded as undirected graphs rather than simply linear chains. An undirected graph with six subunits (each of which have 20 possible outcomes) will 6 have 20^{6} · 2^{(6 choose 2)} = 2, 097, 152, 000, 000 possible outcomes. The vast amount of complex graphs which small molecules can form can render databases and spectral libraries impossibly large to use or incomplete as many metabolites may still be unidentified. In the absence of a usable database or spectral library, an the alphabet of subunits may be used to connect peaks in the fragmentation spectra; each connection represents a neutral loss of an alphabet mass. This technique is called “*de novo* sequencing” and relies on the alphabet being known in advance. Often the alphabet of m/z difference values allowed by *de novo* analysis is not known or is incomplete. A method is proposed that, given fragmentation mass spectra, identifies an alphabet of m/z differences that can build large connected graphs from many intense peaks in each spectrum from a collection. Once an alphabet is obtained, it is informative to find common substructures among the peaks connected by the alphabet. This is the same as finding the largest isomorphic subgraphs on the *de* *novo* graphs from all pairs of fragmentation spectra. This maximal subgraph isomorphism problem is a generalization of the subgraph isomorphism problem, which asks whether a graph G_{1} has a subgraph isomorphic to a graph G_{2} . Subgraph isomorphism is NP-complete. A novel method of efficiently finding common substructures among the subspectra induced by the alphabet is proposed. This method is then combined with a novel form of hashing, eschewing evaluation of all pairs of fragmentation spectra. These methods are generalized to Euclidean graphs embedded in Z^{n}.

## Recommended Citation

Kreitzberg, Patrick Anthony, "ZERO-KNOWLEDGE DE NOVO ALGORITHMS FOR ANALYZING SMALL MOLECULES USING MASS SPECTROMETRY" (2019). *Graduate Student Theses, Dissertations, & Professional Papers*. 11396.

https://scholarworks.umt.edu/etd/11396

#### Included in

© Copyright 2019 Patrick Anthony Kreitzberg