CATS-Feb-19-2016

From Theory

Title[edit]

Succinct Data Structures and Text Indexing

Speaker[edit]

Rahul Shah

Abstract[edit]

In the era of big data, data sizes are ever increasing, however the information content in the data is often low. Many such data sets can be compressed to much under their original size. The field of succinct data structure attempts to achieve two goals at one shot (1) keeping the space usage of the data structure as low as the information theoretic optimum space required for the underlying data storage and (2) achieving fast query time (polylog in data size) as if the data structure was space unaware.

Classic example where succinct data structures have had a huge impact is the Suffix Trees which are used for full-text indexing. Suffix trees on text data would typically take 15 to 50 times the size of the text. Moreover, many text datasets can be compressed even further. In terms of the complexity, suffix trees take O(n log n) bits of space while the raw text takes O(n log s) bits, where s is the size of the alphabet. Compressed indexes based on Burrows-Wheeler Transform(BWT) have made it possible to have text searching data structures in less than the original text's space.

In this talk, I will cover some of our recent developments in this area, where the BWT based index can be made as powerful as suffix tree for pattern matching variants where suffix trees need augmenting information as well as those variants which need different notions of suffix trees.

Bio: Dr. Rahul Shah is a Roy Paul Daniel Distinguished Associate professor at Louisiana State University (LSU). He did is undergraduate degree from Indian Institute of Technology, Bombay and his Doctoral degree in computer science from Rutgers University. His main areas of interest are data structures and algorithms. He particularly specializes in string matching and external memory algorithms. Prior to joining LSU, he was an assistant research professor at Purdue University. He also spent a year working as a research staff member at IBM India Research Lab. He is currently serving as a program director in algorithmic foundations at NSF.