Cvss summer2011

Schedule Summer 2011

Date Speaker Title
June 9 Vlad Morariu Multi-Agent Event Recognition in Structured Scenarios
June 16 Ajay Mishra A Vision System to Extract "Simple" Objects in a Purely Bottom-Up Fashion
June 23 (no meeting, CVPR)
June 30 Dikpal Reddy Fast Imaging with Slow Cameras
July 7 Raghuraman Gopalan Exploring Context in Unsupervised Object Identification Scenarios
July 14 Behjat Siddiquie Utilizing Contextual Information for Scene Understanding and Image Retrieval
July 21 Kaushik Mitra Robust Regression Using Sparse Learning
July 28 Zhuolin Jiang Discriminative Dictionary Learning for Sparse Representation
August 4 Carlos Castillo Dense Wide-Baseline Stereo Matching and its Application to Face Recognition
August 11 Qiang Qiu Learning an Attribute Dictionary for Human Action Classification
August 18 Yezhou Yang Corpus-Guided Sentence Generation of Natural Images
August 25 Nazre Batool Random Field Models for Applications in Computer Vision
September 1 (no meeting)


Talk Abstracts Summer 2011

Multi-Agent Event Recognition in Structured Scenarios

Speaker: Vlad Morariu -- Date: June 9, 2011

I will present a framework for the automatic recognition of complex multi-agent events in settings where structure is imposed by rules that agents must follow while performing activities. Given semantic spatio-temporal descriptions of what generally happens (i.e., rules, event descriptions, physical constraints), and based on video analysis, the framework determines the events that occurred. Knowledge about spatio-temporal structure is encoded using first-order logic using an approach based on Allen's Interval Logic, and robustness to low-level observation uncertainty is provided by Markov Logic Networks (MLN). The main contribution is that the framework integrates interval-based temporal reasoning with probabilistic logical inference, relying on an efficient bottom-up grounding scheme to avoid combinatorial explosion. Applied to one-on-one basketball, the framework detects and tracks players, their hands and feet, and the ball, generates event observations from the resulting trajectories, and performs probabilistic logical inference to determine the most consistent sequence of events.

A Vision System to Extract "Simple" Objects in a Purely Bottom-Up Fashion

Speaker: Ajay Mishra -- Date: June 16, 2011

Human perception, being active, is inextricably linked to visual fixation. Despite the obvious importance of fixation, it has not become an integral part of computer vision/robotics algorithms so far. To incorporate fixation and attention in a computer vision framework, we have proposed a new segmentation framework that takes a fixation point (i.e a single point) inside a "simple" object as its input and outputs the region corresponding to that object. We have also designed a new attentional mechanism that utilizes the concept of neural border-ownership to automatically select the fixation points inside different "simple" objects in the scene. All of this together creates a fully automatic system that outputs only the regions corresponding to the "simple" objects without knowing the actual number or the size of the objects in the scene.

Using these regions, instead of rectangular patches of fixed sizes, to analyze the content of a scene will result in better performance (in terms of accuracy and robustness to noise) for high-level vision algorithms such as object recognition, object manipulation, and action analysis. A variety of experimental results will conclude the talk.

Also, to understand the role of fixation in perception, Ajay recommends taking the psychophysical test available at http://www.umiacs.umd.edu/~mishraka/fixationExperiment.php

Fast Imaging with Slow Cameras

Speaker: Dikpal Reddy -- Date: June 30, 2011

Over the years, the spatial resolution of cameras has steadily increased but the temporal resolution has remained the same. In this talk, I will present my work on converting a regular slow camera into a faster one. We capture and accurately reconstruct fast events using our slower prototype camera by exploiting the temporal redundancy in videos. First, I will show how by fluttering the shutter during the exposure duration of a slow 25fps camera we can capture and reconstruct a fast periodic video at 2000fps. Next, I will present its generalization where we show that per-pixel modulation during exposure, in combination with brightness constancy constraints allows us to capture a broad class of motions at 200fps using a 25fps camera. In both these techniques we borrow ideas from compressive sensing theory for acquisition and recovery.

Exploring Context in Unsupervised Object Identification Scenarios

Speaker: Raghuraman Gopalan -- Date: July 7, 2011

The utility of context for supervised object recognition has been well acknowledged from the early seventies, and has been practically demonstrated by many systems in the last few years. The goal of this talk is to understand the role of context in unsupervised pattern identification scenarios. We consider two problems of clustering a set of unlabelled data points using maximum margin principles, and adapting a classifier trained on a specific domain to identify instances across novel domain shifting transformations, and propose contextual sources that provide pertinent information on the identity of the unlabelled data.

Utilizing Contextual Information for Scene Understanding and Image Retrieval

Speaker: Behjat Siddiquie -- Date: July 14, 2011

In many vision tasks, contextual information can often help disambiguate confusions arising from appearance information. In this talk, I will discuss two different works, which deal with effective utilization of contextual information to improve the performance of active learning for scene understanding and multi-attribute based image retrieval.

First, I will propose an active learning framework to simultaneously learn appearance and contextual models for scene understanding tasks (multi-class classification). Current multi-class active learning approaches ignore the contextual interactions between different regions of an image and the fact that knowing the label for one region provides information about the labels of other regions. We explicitly model the contextual interactions between regions and select the question which leads to the maximum reduction in the combined entropy of all the regions in the image (image entropy).

Next, I will present a novel approach for ranking and retrieval of images based on multi-attribute queries. Existing image retrieval methods train separate classifiers for each word and heuristically combine their outputs for retrieving multi-word queries. Moreover, these approaches ignore the interdependencies among the query words. In contrast, we propose a principled approach for multi-attribute retrieval which explicitly models the correlations that are present between the attributes. Given a multi-attribute query, we also utilize other attributes in the vocabulary which are not present in the query, for ranking/retrieval.

Robust Regression Using Sparse Learning

Speaker: Kaushik Mitra -- Date: July 21, 2011

Robust regression is a combinatorial optimization problem. Hence, algorithms such as RANSAC and least median squares (LMedS), which are successful in solving low-dimensional problems, can not be used for solving high-dimensional problems. We show that under certain conditions the robust linear regression problem can be solved accurately using polynomial-time algorithms such as a modified version of basis pursuit and a sparse Bayesian algorithm. We then extend our robust formulation to the case of kernel regression, specifically to propose a robust version for relevance vector machine (RVM) regression.

Discriminative Dictionary Learning for Sparse Representation

Speaker: Zhuolin Jiang -- Date: July 28, 2011

Sparse coding approximates an input signal by a sparse linear combination of items from an over-complete dictionary. The sparse coding-based approaches lead to state-of-the-art results for many signal or image processing tasks and advances in computer vision tasks such as object recognition. However, the performance of sparse coding relies on the quality of dictionary. How to design or learn the best dictionary adapted to natural signals has been the topic of much research in the past. In this talk I will first introduce some recent techniques that learn the dictionary from training data. Next I will present a label consistent K-SVD (LC-KSVD) algorithm to learn a discriminative dictionary for sparse representation. It yields dictionaries so that feature points with the same class labels have similar sparse codes.

Dense Wide-Baseline Stereo Matching and its Application to Face Recognition

Speaker: Carlos Castillo -- Date: August 4, 2011

We study the problem of dense wide baseline stereo with varying illumination. We are motivated by the problem of face recognition across pose. Stereo matching allows us to compare face images based on physically valid, dense correspondences. We show that the stereo matching cost provides a very robust measure of similarity of faces that is insensitive to pose variations. We build on the observation that most illumination insensitive local comparisons require the use of relatively large windows. The size of these windows is affected by foreshortening. If we do not account for this effect, we incur misalignments that are systematic and significant and are exacerbated by wide baseline conditions.

We present a general formulation of dense wide baseline stereo with varying illumination and provide two methods to solve them. The first method is based on dynamic programming (DP) and fully accounts for the effect of slant. The second method is based on graph cuts (GC) and fully accounts for the effect of slant and tilt. The GC method finds a global solution using the unary function from the general formulation and a novel smoothness term that encodes surface orientation.

Our experiments show that the DP dense wide baseline stereo demonstrates superior performance compared to existing methods in face recognition across pose. The experiments with the GC method show that accounting for slant and tilt can improve performance in situations with wide baselines and lighting variation. Our formulation can be applied to other more sophisticated window based image comparison methods for stereo.

Learning an Attribute Dictionary for Human Action Classification

Speaker: Qiang Qiu -- Date: August 11, 2011

We present an approach for dictionary learning of action attributes via information maximization. We unify the class distribution and appearance information into an objective function for learning a sparse dictionary of action attributes. The objective function maximizes the mutual information between what has been learned and what remains to be learned in terms of appearance information and class distribution for each dictionary item. We propose a Gaussian Process (GP) model for sparse representation to optimize the dictionary objective function. The sparse coding property allows a kernel with a compact support in GP to realize a very efficient dictionary learning process. Hence we can describe an action video by a set of compact and discriminative action attributes. More importantly, we can recognize modeled action categories in a sparse feature space, which can be generalized to unseen and unmodeled action categories. Experimental results demonstrate the effectiveness of our approach in action recognition applications.

Corpus-Guided Sentence Generation of Natural Images

Speaker: Yezhou Yang -- Date: August 18, 2011

We propose a sentence generation strategy that describes images by predicting the most likely nouns, verbs, scenes and prepositions that make up the core sentence structure. The input are initial noisy estimates of the objects and scenes detected in the image using state of the art trained detectors. As predicting actions from still images directly is unreliable, we use a language model trained from the English Gigaword corpus to obtain their estimates; together with probabilities of co-located nouns, scenes and prepositions. We use these estimates as parameters on a HMM that models the sentence generation process, with hidden nodes as sentence components and image detections as the emissions. Experimental results show that our strategy of combining vision and language produces readable and descriptive sentences compared to naive strategies that use vision alone.

Random Field Models for Applications in Computer Vision

Speaker: Nazre Batool -- Date: August 25, 2011

This talk will present a brief overview of random field models for computer vision. Markov Random Field (MRF) models have been most popular class of models for computer vision applications. Recently, new class of models, Conditional Random Fields (CRF), has been introduced. Although CRFs were first introduced for labeling 1D sequences, they have also been incorporated for 2D images for applications such as labeling and object recognition. Another model, Discriminative Random Field (DRF) model, inspired by CRF, has been applied successfully for image denoising and labeling. In this talk, the key differences between MRF and CRF/DRF will be highlighted. The main diff erence between the two classes of models can be best understood on the basis of generative vs .discriminative probabilistic models based on graphs. Hence, graphical models will also be briefly discussed in the talk.