Document Tracking

From DeviceLab
Jump to: navigation, search

Overview of the current page identification process

  1. Compute SIFT features for documents in the database.
  2. Build k-d tree of features from training images. Each feature is tagged with an identifier that tracks which training image it is associated with. All features are inserted into the k-d tree
  3. Compute features for the video frame/image for which to perform identification. For each computed feature in the query image find 2 nearest neighbors out of all features in the database. This is done using a Best-Bin First approximate search through the k-d tree. If the nearest neighbor scores significantly better than the next best match, then add that to a list of matches.
  4. Each feature match is a hypothesis that a particular object exists and its pose. Go through matches and cluster those that agree on an object and a pose using Generalized Hough Transform. (GHT Dimensions: x, y ,orientation, scale, object id)
  5. Sort clusters from most votes to least. For each cluster with > 3 matches, solve for an affine transform that best maps the position of the training points in the cluster to their corresponding points in the query (least squares difference in transformed points).
  6. Outliers are removed and process is repeated.
  7. If at the end, the cluster contains > 3 points, consider that a positive ID. Add the solved transform to a list. Remove matches in the cluster from future consideration.
  • To do: Some type of probabilistic evaluation to check if a positive ID makes sense.

Design Choices, Etc.

  • Best-bin first (BBF) vs. exhaustive search: Exhaustive search = O(n) comparisons, n = training feature set.
  • Doing the BBF search on the k-d tree of all training features scales much better as well.
    • BBF is capped at 200 comparisons per feature, with good performance reported for databases of 40,000 points. Training thumbnails are ~100 features. Thus, nearest-neighbor matching should be close to independent of database size.
  • At the moment, the half of the analysis time involves finding the features in the frame. This increases quickly as the image size goes up.
    • Runs in reasonable time at 640x480.
    • At 1280x960, it takes over 5 seconds to compute the features.

Results and Issues

  • Pages with high text density are less reliably matched.
  • In general, very few false positives.
  • Camera quality is extremely important.
    • Using a consumer DV camcorder, we can find matches when 2 pages occupy a frame. (see #Videos and Photos below).
    • With good optics 6+ pages per frame should be possible at 640x480. (photo examples).
    • Would B-W camera produce better results?

Topics for discussion

  • In Lowe paper: "Top-down matching to add additional points." Does top-down mean something special?
  • Trade off between resolution / quality / desk coverage / speed
  • Log post processing: Probably necessary, it seems. Pages sometimes blink in and out.
  • Other algorithms that might increase speed or accuracy. Finding regions or frames of interest only. Francois

suggests optical flow or differencing perhaps.

Next Steps

Once we get a rig up and have more reliable matching-> generate text log and compare to manual annotation to get hard numbers about the recognition accuracy.

Videos and Photos

GPU implementation

    • [1] Link to SIFTGPU