CATS-Oct-19-2012

From Theory
Revision as of 14:53, 27 August 2012 by Hmahini (talk | contribs)

Title[edit]

Efficient communication and storage vs accurate variant calls in high throughput sequencing: two sides of the same coin

Speaker[edit]

S. Cenk Sahinalp, Simon Fraser University

Abstract[edit]

Given two strings A and B from the DNA alphabet, the Levenshtein edit distance between A and B, $LED(A,B)$, is defined to be the minimum number of single character insertions, deletions and replacements to transform A to B (equivalently B to A). If in addition to the single character edits, one is permitted to perform segmental (block) edits in the form of (i) moving a block from any location to another, (ii) copying a block to any location, and (iii) uncopying (i.e. deleting one of the two occurrences of) a block, the resulting "block edit distance", $BED(A,B)$, captures much of our current understanding of the relation between individual genome sequences. If among two communicating parties, Alice (holding genome sequence A) and Bob (holding genome sequence B), wants to compute $BED(A,B)$ then, theoretically, the total number of bits Bob needs to send to Alice is $O^{\tilde}(BED(A,B)\cdot\log BED(A,B))$ [Cormode et al., SODA 2000]. Considering that between a typical donor genome B and a reference genome A, the number of single character differences are on the order of a few million and the number of structural (i.e., blockwise) differences are in the order of tens of thousands, it should be possible to communicate genomes by exchanging only a few million bytes! Yet, today, the most effective way of communicating genome sequence data involves physically exchanging hard disks.