CATS-Mar-25-2016

From Theory

Title[edit]

DATA SUMMARIZATION AT SCALE

Speaker[edit]

Amin Karbasi

Abstract[edit]

For the last several years, we have witnessed the emergence of datasets of an unprecedented scale across different scientific disciplines. The large volume of such datasets presents new computational challenges as the diverse, feature-rich, and usually high-resolution data does not allow for effective data-intensive inference. In this regard, data summarization is a compelling (and sometimes the only) approach that aims at both exploiting the richness of large-scale data and being computationally tractable; Instead of operating on complex and large data directly, carefully constructed summaries not only enable the execution of various data analytics tasks but also improve their efficiency and scalability.

A systematic way for data summarization is to turn the problem into selecting a subset of data elements optimizing a utility function that quantifies “representativeness” of the selected set. Often-times, these objective functions satisfy sub-modularity, an intuitive notion of diminishing returns stating that selecting any given element earlier helps more than selecting it later. Thus, many problems in data summarization require maximizing submodular set functions subject to cardinality and massive data means we have to solve these problems at scale.

In this talk, I will present our recent efforts in developing practical schemes for data summarization. In particular, I will first discuss Stochastic- Greedy, currently the fastest centralized algorithm for data summarization whose query complexity is only linear in data size. However, to truly summarize data at scale we need to opt for streaming or distributed solutions. To this end, I will then present Sieve-Streaming, the first streaming algorithm for data summarization with a constant-factor approximation guarantee. Finally, I will talk about GreeDi, the first distributed approach towards data summarization. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, including sparse Gaussian process inference and exemplar-based clustering, on tens of millions of data points using Hadoop.