From Mlrg
Jump to: navigation, search

Pointers and Readings

"Required" readings and software

"Might-be-interesting" readings


  1. Given K (n*n), compute svd(center(K)) using Hebbian algorithm
  2. Given X (n*d), compute svd(center(X)) using Hebbian algorithm
  3. Given A (n*d), compute center(A)*Omega, for Omega random d*p using exact matrix multiplication
  4. Repeat (1) using randomized algorithm
  5. Repeat (2) using randomized algorithm
  6. Given A,B (n*d), compute svd(inv(CC') * DD')), where C=center(A) and D=center(B) if this can be done more efficiently than the naive way of doing it
  7. Given X (n*d) and l, computer svd(A'A+lI) where A=center(X)
  8. Given X (n*d), compute svd(H DD' H'), where D = center(X) and H=inv(I - 11'/n), if this can be done more efficiently than the naive way of doing it

In all the above, "center(X)" means X moved so that it has mean zero. So if X_{i,:} is the ith row of X, then center(X)_{i,:} = X_{i,:} - mean(X). Similarly, by svd(A), we mean the top k singular values and left singular vectors of A, for some user-specified k.

In all cases, we should have the option of either handing over data in a dense format or a sparse format, and the implementation should use appropriately modified routines depending on the data type.

From a timing perspective, I would love to get #1 implemented by mid February. I would be thrilled to get #1-#6 done by the end of the semester, but would still be very happy to get #1-#5 done in that time frame.

Desired Interface

The goal would be to have a program, say "svd" that would allow me to select from the following options, depending on my application:

  • Do I want to run the Hebbian version or the randomized version (and if the latter, what value of "p" oversampling should I use)
  • Have I pre-centered the data, or should the algorithm do that for me?
  • Do I want to use a sparse matrix representation or a dense matrix representation?

The input for a sparse representation should be one line per row of the input matrix, which looks like "3:9.1 10:0.03 62:-5.2", meaning that all columns are zero except the 3rd, 10th and 62nd, which have the given values. (This is the same file format as used by SVM-light, libSVM and Vowpal Wabbit.)

The input for dense representation should just be the obvious one line per row, space separated columns.

Implementation Notes

It seems worthwhile to try to implement things on top of the basic infrastructure of Vowpal Wabbit. The two cool things that this will do are:

  • Spawn two threads, one for doing file IO and one for doing computation. (Note that the VW file format is basically the same as our file format, though a bit more flexible.) Note: we would need to remove feature hashing, or at least turn it into a user-level option
  • All these algorithms will require multiple passes over the data (since they're not allowed to store it in memory). On the first pass over the data, VW will create a "cache" file, which is essentially a very quick-to-read compressed version of the input. Subsequent passes will read this file to reduce the IO bottleneck

Progress Reports




Questions for the Crew