Presentation Type

Oral Presentation

Abstract/Artist Statement

Modern technology has provided a plethora of raw scientific data available to be analyzed. There are over 18,000 genomes sequenced to date, and the number is growing faster each year. Scientists can no longer analyze this massive amount of data by hand. Efficient and proficient computational tools are necessary for scientific progress to keep up with the data growth. Computational biology seeks to produce the algorithms and tools necessary to analyze biological data.

Repetitive sequences in genomes are common, comprising more than 3% of the genome. Sequences such as the three-letter repeat, CAGCAGCAG are called short tandem repeats, and are an important feature of biological sequence. Tandem repeats are often associated with binding domains in proteins as well as human diseases such as Huntington’s disease. In the context of computational biology, short tandem repeats are a significant source of false positive matches in sequence comparison – for example, when two DNA sequences are compared, they may show a highly significant level of similarity simply due to their repetitive nature, rather than to actual shared evolutionary history.. Although detection of tandem repeats is important to bioinformatics and biology in general, current tandem repeat annotation tools miss many repetitive regions that can be easily identified by a human expert.

We present a new tool, TRUCE, the Tandem Repeat Unifying Discoverer which is built with a robust hidden Markov model (HMM) that enables TRUCE to produce accurate tandem repeat annotations even when given highly degenerate repetitive sequences. Unlike current industry standard software used to detect tandem repeats, TRUCE is built on a probabilistic model, allowing it to produce probability scores corresponding to confidence in repeat annotation. Because of this, it can be directly incorporated into homology search tools such as BLAST and HMMER, reducing false positive matches caused by tandem repeats. TRUCE is also nearly 100x faster than the current industry standard tool.

Mentor Name

Travis Wheeler

Share

COinS
 
Apr 20th, 9:45 AM Apr 20th, 10:00 AM

TRUCE: A Hidden Markov Model for Annotation of Tandem Repeats

UC North Ballroom, Presentation Pod 2

Modern technology has provided a plethora of raw scientific data available to be analyzed. There are over 18,000 genomes sequenced to date, and the number is growing faster each year. Scientists can no longer analyze this massive amount of data by hand. Efficient and proficient computational tools are necessary for scientific progress to keep up with the data growth. Computational biology seeks to produce the algorithms and tools necessary to analyze biological data.

Repetitive sequences in genomes are common, comprising more than 3% of the genome. Sequences such as the three-letter repeat, CAGCAGCAG are called short tandem repeats, and are an important feature of biological sequence. Tandem repeats are often associated with binding domains in proteins as well as human diseases such as Huntington’s disease. In the context of computational biology, short tandem repeats are a significant source of false positive matches in sequence comparison – for example, when two DNA sequences are compared, they may show a highly significant level of similarity simply due to their repetitive nature, rather than to actual shared evolutionary history.. Although detection of tandem repeats is important to bioinformatics and biology in general, current tandem repeat annotation tools miss many repetitive regions that can be easily identified by a human expert.

We present a new tool, TRUCE, the Tandem Repeat Unifying Discoverer which is built with a robust hidden Markov model (HMM) that enables TRUCE to produce accurate tandem repeat annotations even when given highly degenerate repetitive sequences. Unlike current industry standard software used to detect tandem repeats, TRUCE is built on a probabilistic model, allowing it to produce probability scores corresponding to confidence in repeat annotation. Because of this, it can be directly incorporated into homology search tools such as BLAST and HMMER, reducing false positive matches caused by tandem repeats. TRUCE is also nearly 100x faster than the current industry standard tool.