Overview of the current page identification process
- Compute SIFT features for documents in the database.
- Build k-d tree of features from training images. Each feature is tagged with an identifier that tracks which training image it is associated with. All features are inserted into the k-d tree
- Compute features for the video frame/image for which to perform identification. For each computed feature in the query image find 2 nearest neighbors out of all features in the database. This is done using a Best-Bin First approximate search through the k-d tree. If the nearest neighbor scores significantly better than the next best match, then add that to a list of matches.
- Each feature match is a hypothesis that a particular object exists and its pose. Go through matches and cluster those that agree on an object and a pose using Generalized Hough Transform. (GHT Dimensions: x, y ,orientation, scale, object id)
- Sort clusters from most votes to least. For each cluster with > 3 matches, solve for an affine transform that best maps the position of the training points in the cluster to their corresponding points in the query (least squares difference in transformed points).
- Outliers are removed and process is repeated.
- If at the end, the cluster contains > 3 points, consider that a positive ID. Add the solved transform to a list. Remove matches in the cluster from future consideration.
- To do: Some type of probabilistic evaluation to check if a positive ID makes sense.
Design Choices, Etc.
- Best-bin first (BBF) vs. exhaustive search: Exhaustive search = O(n) comparisons, n = training feature set.
- Doing the BBF search on the k-d tree of all training features scales much better as well.
- BBF is capped at 200 comparisons per feature, with good performance reported for databases of 40,000 points. Training thumbnails are ~100 features. Thus, nearest-neighbor matching should be close to independent of database size.
- At the moment, the half of the analysis time involves finding the features in the frame. This increases quickly as the image size goes up.
- Runs in reasonable time at 640x480.
- At 1280x960, it takes over 5 seconds to compute the features.
Results and Issues
- Pages with high text density are less reliably matched.
- In general, very few false positives.
- Camera quality is extremely important.
- Using a consumer DV camcorder, we can find matches when 2 pages occupy a frame. (see #Videos and Photos below).
- With good optics 6+ pages per frame should be possible at 640x480. (photo examples).
- Would B-W camera produce better results?
Topics for discussion
- In Lowe paper: "Top-down matching to add additional points." Does top-down mean something special?
- Trade off between resolution / quality / desk coverage / speed
- Log post processing: Probably necessary, it seems. Pages sometimes blink in and out.
- Other algorithms that might increase speed or accuracy. Finding regions or frames of interest only. Francois
suggests optical flow or differencing perhaps.
Once we get a rig up and have more reliable matching-> generate text log and compare to manual annotation to get hard numbers about the recognition accuracy.
Videos and Photos
- Camcorder video with recognition annotation. http://www.cs.umd.edu/alandaluz/nchen/tracking/coffee-out.avi
- Webcam setup. Approximates the area we would like to track. Quite Noisy. http://www.cs.umd.edu/alandaluz/nchen/tracking/fout.avi
- 4MP Still frame resized down to 640x480. Gives an idea about how the system will perform with a good camera.
- Camera rotation w/ moire pattern. (Full Quality DV 50MB) http://www.cs.umd.edu/alandaluz/nchen/tracking/camrotate.dv
-  Link to SIFTGPU