Poster Session #2

Author Information

Jack Roddy

Presentation Type

Poster

Faculty Mentor’s Full Name

Travis Wheeler

Faculty Mentor’s Department

Computer Science

Abstract / Artist's Statement

A key component of modern molecular biology is sequence annotation - labeling the contents of biological sequence. Annotation largely depends on identifying relationships between sequences through the use of sequence alignment. Modern methods for sequence alignment are remarkably good at recognizing when a substring of one sequence is related (aligns) to a substring of another sequence, but are also prone to a form of error known as alignment overextension, in which the alignment extends beyond the true bounds of relatedness. The impact of overextension is substantial - for example, in the annotation of transposable elements in the human genome, we have estimated that 2% of the annotated genome (~30 million nucleotides!) is the result of overextension. Current methods used to combat overextension are only somewhat effective, and can have the unintended consequence of reducing search sensitivity and over-trimming the alignment. We developed Machine Learning approaches to identify and trim overextended regions in sequence alignments. We benchmark the trimming using an artificial sequence dataset that mimics transposable elements inserted into simulated sequence alignment. Our results demonstrate a dramatic decrease in overextension with a minimal amount of over-trimming.

Category

Life Sciences

Share

COinS
 
Apr 17th, 3:00 PM Apr 17th, 4:00 PM

A Convolutional Neural Network to Trim Sequence Alignment Overextension

UC South Ballroom

A key component of modern molecular biology is sequence annotation - labeling the contents of biological sequence. Annotation largely depends on identifying relationships between sequences through the use of sequence alignment. Modern methods for sequence alignment are remarkably good at recognizing when a substring of one sequence is related (aligns) to a substring of another sequence, but are also prone to a form of error known as alignment overextension, in which the alignment extends beyond the true bounds of relatedness. The impact of overextension is substantial - for example, in the annotation of transposable elements in the human genome, we have estimated that 2% of the annotated genome (~30 million nucleotides!) is the result of overextension. Current methods used to combat overextension are only somewhat effective, and can have the unintended consequence of reducing search sensitivity and over-trimming the alignment. We developed Machine Learning approaches to identify and trim overextended regions in sequence alignments. We benchmark the trimming using an artificial sequence dataset that mimics transposable elements inserted into simulated sequence alignment. Our results demonstrate a dramatic decrease in overextension with a minimal amount of over-trimming.