ICML 2019 Expo Talk
July 18, 2021
Mapping high-throughput DNA sequencing reads with learned index structures
Mike Lin (doc.ai)
Mike Lin (doc.ai)
Read mapping is a fundamental problem in high-throughput genome sequencing informatics. Many short segments of sequenced DNA (hundreds or thousands of nucleotides) must be matched quickly with similar sequences in a large reference genome (billions). Speed, sensitivity, and specificity trade-offs arise not only from the design and parameters of the mapping algorithm, but also from the DNA sequencing protocol and the extent of repetitive content in the species genome. Modern read mapping tools use advanced string indexing techniques, such as compressed suffix arrays and minimizer sketches, in combination with manually-tuned heuristics and parameters controlling these trade-offs. We explore how these approaches can be augmented with “Learned Index Structures” and other machine learning to tune these trade-offs systematically. We frame read mapping as a learning problem, to predict reference genome loci exhibiting small edit distance with a given sequence read, and train using simulation data customized to recapitulate properties of the species genome and sequencing protocol of interest. The resulting index and prediction models provide learned control over how much exhaustive dynamic programming to invest in each read, streamlining processing of repetitive and low-complexity regions without sacrificing sensitivity elsewhere. Our early results suggest that the fusion of standard indexing/sketching techniques with learned models might accelerate progress in many genome search and clustering tasks, where algorithmic innovation and parameter tuning are continuously pressed to keep pace with sequencing technology development and the ever-increasing scale and diversity of key databases.