CATS-Feb-26-2016

From Theory

Title[edit]

New Streaming Methods for Heavy Hitters and Norms

Speaker[edit]

Valdimir Braverman

Abstract[edit]

In this talk we will discuss two recent results.

(1) Given a stream $p_1, \ldots, p_m$ of items from a universe $\mathcal{U}$, which, without loss of generality we identify with the set of integers $\{1, 2, \ldots, n\}$, we consider the problem of returning all $\ell_2$-heavy hitters, i.e., those items $j$ for which $f_j \geq \eps \sqrt{F_2}$, where $f_j$ is the number of occurrences of item $j$ in the stream, and $F_2 = \sum_{i \in [n]} f_i^2$. In 2002, Charikar, Chen, and Farach-Colton suggested the CountSketch data structure, which finds all such $j$ using $\Theta(\log^2 n)$ bits of space (for constant $\eps > 0$). The only known lower bound is $\Omega(\log n)$ bits of space, which comes from the need to specify the identities of the items found. We show it is possible to achieve $O(\log n \log \log n)$ bits of space for this problem. This is a joint work with Stephen R. Chestnut, Nikita Ivkin, and David P. Woodruff.

(2) If time permits, we will discuss the recent characterization of the streaming space complexity of every symmetric norm l (a norm on R^n invariant under sign-flips and coordinate-permutations. Specifically, we provide upper and lower bounds on the space complexity of approximating the norm of the stream, where both bounds depend on the median of l(x) when x is drawn uniformly from the l_2 unit sphere. The same quantity governs many phenomena in high-dimensional spaces, such as large-deviation bounds and the critical dimension in Dvoretzky's Theorem. The family of symmetric norms contains several well-studied norms, such as all l_p norms, and indeed we provide a new explanation for the disparity in space complexity between p <=2 and p>2. In addition, we apply our general results to easily derive bounds for several norms were not studied before in the streaming model, including for example the top-k norm and the k-support norm, which was recently shown to be effective for machine learning tasks. This is a joint work with Stephen R. Chestnut, Robert Krauthgamer, and Lin F. Yang.