Search technique finds DNA sequences in minutes
PITTSBURGH, Pennsylvania: Minutes rather than days. Database searches for DNA sequences that can take biologists and medical researchers days can now be completed in a matter of minutes, thanks to a new search method developed by computer scientists at Carnegie Mellon University.
The method developed by Carl Kingsford, associate professor of computational biology, and Brad Solomon, a doctorate student, is designed for searching so-called “short reads”.
These are DNA and RNA sequences generated by high-throughput sequencing techniques.
It relies on a new indexing data structure, called Sequence Bloom Trees, or SB Ts, that the researchers describe in a report published online by the journal Nature Biotechnology.
“The database contains untold numbers of as-yet undiscovered insights and is heavily used,” Kingsford said.
“Its main problem is that it’s very difficult to search.”
Thousands of hard drives would be needed to store these sequences. Searching through the short reads, which are typically 50 to 200 base-pairs each, to see which ones could be assembled to form a target gene of perhaps 10,000 base-pairs, is cumbersome and can take days in some cases, he noted.
Kingsford and Solomon tested their technique using a database of 2,652 human blood, breast and brain experiments, each of which often contain over a billion base-pairs of RNA sequences.
They found that most searches of that database could be completed in an average of 20 minutes.