CATS-Oct-19-2012

Title

Efficient communication and storage vs accurate variant calls in high throughput sequencing: two sides of the same coin

Speaker

S. Cenk Sahinalp is a Professor of Computing Science at Simon Fraser University, Canada. He received his Ph.D. in Computer Science from the University of Maryland at College Park in early 1997. Sahinalp is an NSF Career Awardee, a Canada Research Chair, a Michael Smith Foundation for Health Research Scholar and an NSERC Discovery Accelerator Awardee. His papers on genomics and bioinformatics have been highlighted by scientific journals and magazines (e.g. Genome Research, Nature Biotech, Genome Technology) and won several awards (e.g. ISMB'12 Best Student Paper, ISMB'11 HitSeq Best Paper). He was/is the conference general chair of RECOMB'11, PC chair of APBC'13, and has served as the sequence analysis area chair for ISMB and CSHL Genome Informatics Conferences. He has also co-founded the RECOMB-Seq conference series on Massively Parallel Sequencing. He is an area co-editor for BMC Bioinformatics and is on the editorial board of Bioinformatics and several other journals. He co-directs the SFU undergraduate program in Bioinformatics and the SFU Bioinformatics for Combating Infectious Diseases Research Program. His research interests include computational genomics, in particular algorithms for high throughput sequence data, network biology, RNA structure and interaction prediction and chemoinformatics algorithms.

Abstract

Given two strings A and B from the DNA alphabet, the Levenshtein edit distance between A and B, $LED(A,B)$, is defined to be the minimum number of single character insertions, deletions and replacements to transform A to B (equivalently B to A). If in addition to the single character edits, one is permitted to perform segmental (block) edits in the form of (i) moving a block from any location to another, (ii) copying a block to any location, and (iii) uncopying (i.e. deleting one of the two occurrences of) a block, the resulting "block edit distance", $BED(A,B)$, captures much of our current understanding of the relation between individual genome sequences. If among two communicating parties, Alice (holding genome sequence A) and Bob (holding genome sequence B), wants to compute $BED(A,B)$ then, theoretically, the total number of bits Bob needs to send to Alice is $O^{\tilde}(BED(A,B)\cdot\log BED(A,B))$ [Cormode et al., SODA 2000]. Considering that between a typical donor genome B and a reference genome A, the number of single character differences are on the order of a few million and the number of structural (i.e., blockwise) differences are in the order of tens of thousands, it should be possible to communicate genomes by exchanging only a few million bytes! Yet, today, the most effective way of communicating genome sequence data involves physically exchanging hard disks.