Presentation Type

Poster

Faculty Mentor’s Full Name

Travis Wheeler

Faculty Mentor’s Department

Computer Science

Abstract

Sequence comparison is fundamental to modern molecular biology. The primary focus in the field is on methods that increase the speed of comparison and the sensitivity required to recognize relationships between highly divergent sequences. Our work addresses another important aspect of sequence comparison – avoidance of incorrect sequence annotation. The primary source of such incorrect annotation occurs when software correctly identifies that a substring of one sequence is related (aligns to) to a substring of another sequence, but that the tool incorrectly claims that flanking regions of the two sequences are also related – this is often called alignment overextension. The impact of overextension is substantial - for example, in the annotation of transposable elements in the human genome, we have estimated that 2% of the annotated genome is the result of overextension. Current methods used to combat overextension are only somewhat effective, and can have the unintended consequence of reducing search sensitivity and under-extending the alignment. In our research, we develop a prototype of a method for mitigating overextension which uses hidden Markov models (HMM) to recognize the point at which overextension begins in an alignment. We benchmark these techniques using a an artificial sequence dataset that mimics transposable elements inserted into simulated genomic sequence. We expect that results of this pilot study will lead to dramatic improvement in the annotation of genomic sequences.

Category

Life Sciences

Share

COinS
 
Apr 27th, 3:00 PM Apr 27th, 4:00 PM

Reducing False Sequence Annotation Due to Alignment Overextension

UC South Ballroom

Sequence comparison is fundamental to modern molecular biology. The primary focus in the field is on methods that increase the speed of comparison and the sensitivity required to recognize relationships between highly divergent sequences. Our work addresses another important aspect of sequence comparison – avoidance of incorrect sequence annotation. The primary source of such incorrect annotation occurs when software correctly identifies that a substring of one sequence is related (aligns to) to a substring of another sequence, but that the tool incorrectly claims that flanking regions of the two sequences are also related – this is often called alignment overextension. The impact of overextension is substantial - for example, in the annotation of transposable elements in the human genome, we have estimated that 2% of the annotated genome is the result of overextension. Current methods used to combat overextension are only somewhat effective, and can have the unintended consequence of reducing search sensitivity and under-extending the alignment. In our research, we develop a prototype of a method for mitigating overextension which uses hidden Markov models (HMM) to recognize the point at which overextension begins in an alignment. We benchmark these techniques using a an artificial sequence dataset that mimics transposable elements inserted into simulated genomic sequence. We expect that results of this pilot study will lead to dramatic improvement in the annotation of genomic sequences.