cvss:Spring2015

From cvss

Schedule Spring 2015[edit]

All talks take place on Thursdays at 3:30pm in AVW 3450.

Date Speaker Title
February 19 Bharat Singh PSPGC: Part-Based Seeds for Parametric Graph-Cuts
February 26 Jingjing Zheng Submodular Attribute Selection for Action Recognition in Video
March 5 Snow Break
March 13 Yezhou Yang Grasp Type Revisited: A Modern Perspective on A Classical Feature for Vision and Robotics
March 20 Spring Break, no meeting
March 27 Sravanthi and Varun Manjunatha SHOE: Supervised Hashing with Output Embeddings
April 3 Bahadir Ozdemir A Probabilistic Framework for Multimodal Retrieval using Integrative Indian Buffet Process
April 10 Ching-Hui Chen Matrix Completion for Resolving Label Ambiguity
April 17 ICCV deadline, no meeting
April 24 Sameh Khamis Learning an Efficient Model of Hand Shape Variation from Depth Images
May 1 Joe Ng Beyond Short Snippets: Deep Networks for Video Classification
May 8 Ching Lik Teo Fast 2D Border Ownership Assignment
May 15 Final Exam, no meeting

Talk Abstracts Spring 2015[edit]

PSPGC: Part-Based Seeds for Parametric Graph-Cuts[edit]

Speaker: Bharat Singh -- Date: February 19, 2015

Abstract: PSPGC is a detection-based parametric graph-cut method for accurate image segmentation. Experiments show that seed positioning plays an important role in graph-cut based methods, so, we propose three seed generation strategies which incorporate information about location and color of object parts, along with size and shape. Combined with low-level regular grid seeds, PSPGC can leverage both low-level and high-level cues about objects present in the image. Multiple-parametric graph-cuts using these seeding strategies are solved to obtain a pool of segments, which have a high rate of producing the ground truth segments. Experiments on the challenging PASCAL2010 and 2012 segmentation datasets show that the accuracy of the segmentation hypotheses generated by PSPGC outperforms other state-of-the-art methods when measured by three different metrics(average overlap, recall and covering) by up to 3.5%. We also obtain the best average overlap score in 15 out of 20 categories on PASCAL2010. Further, we provide a quantitative evaluation of the efficacy of each seed generation strategy introduced.

Submodular Attribute Selection for Action Recognition in Video[edit]

Speaker: Jingjing Zheng -- Date: February 26, 2015

Abstract: We present an approach to jointly learn a set of view-specific dictionaries and a common dictionary for cross-view action recognition. The set of view-specific dictionaries is learned for specific views while the common dictionary is shared across different views. Our approach represents videos in each view using both the corresponding view-specific dictionary and the common dictionary. More importantly, it encourages the set of videos taken from different views of the same action to have similar sparse representations. In this way, we can align view-specific features in the sparse feature spaces spanned by the view-specific dictionary set and transfer the view-shared features in the sparse feature space spanned by the common dictionary. Meanwhile, the incoherence between the common dictionary and the view-specific dictionary set enables us to exploit the discrimination information encoded in view-specific features and view-shared features separately. In addition, the learned common dictionary not only has the capability to represent actions from unseen views, but also makes our approach effective in a semi-supervised setting where no correspondence videos exist and only a few labels exist in the target view. Extensive experiments using the multi-view IXMAS dataset demonstrate that our approach outperforms many recent approaches for cross-view action recognition.

Grasp Type Revisited: A Modern Perspective on A Classical Feature for Vision and Robotics[edit]

Speaker: Yezhou Yang -- Date: March 13, 2015

Abstract: Our ability to interpret other people's actions hinges crucially on predictions about their intentionality. The grasp type provides crucial information about human action. However, recognizing the grasp type from unconstrained scenes is challenging because of the large variations in appearance, occlusions and geometric distortions. In this paper, first we present a convolutional neural network to classify functional hand grasp types. Experiments on a public static scene hand data set validate good performance of the presented method. Then we present two applications utilizing grasp type classification: (a) inference of human action intention and (b) fine level manipulation action segmentation. Experiments on both tasks demonstrate the usefulness of grasp type as a cognitive feature for computer vision. Furthermore, we will present a system that learns manipulation action plans by processing Youtube cooking instructional videos with the grasp type feature. Its goal is to robustly generate the sequence of atomic actions of seen longer actions in video in order to acquire knowledge for robots, and further guide it to execute the task.

Related Papers:

SHOE: Supervised Hashing with Output Embeddings[edit]

Speaker: Sravanthi Bondugula and Varun Manjunatha -- Date: March 27, 2015

Abstract: We present a supervised binary encoding scheme for image retrieval that learns projections by taking into account similarity between classes obtained from output embeddings. Our motivation is that binary hash codes learned in this way improve both the visual quality of retrieval results and existing supervised hashing schemes. We employ a sequential greedy optimization that learns relationship aware projections by minimizing the difference between inner products of binary codes and output embedding vectors. We develop a joint optimization framework to learn projections which improve the accuracy of supervised hashing over the current state of the art with respect to standard and sibling evaluation metrics. We further boost performance by applying the supervised dimensionality reduction technique on kernelized input CNN features. Experiments are performed on three datasets: CUB-2011, SUN-Attribute and ImageNet ILSVRC 2010. As a by-product of our method, we show that using a simple k-nn pooling classifier with our discriminative codes improves over the complex classification models on fine grained datasets like CUB and offer an impressive compression ratio of 1024 on CNN features.

Related paper: SHOE

A Probabilistic Framework for Multimodal Retrieval using Integrative Indian Buffet Process[edit]

Speaker: Bahadir Ozdemir

Abstract: Integrating information from multiple input sources is critical to achieve several key tasks in machine learning. Discovering hidden common causes that explain the dependency among modalities contributes towards enhancing performance in these tasks compared to single-view approaches. We propose a multimodal retrieval procedure based on latent feature models. The procedure consists of a nonparametric Bayesian framework for learning underlying semantically meaningful abstract features in a multimodal dataset, a probabilistic retrieval model that allows cross-modal queries and an extension model for relevance feedback. Experiments on two multimodal datasets, PASCAL-Sentence and SUN-Attribute, demonstrate the effectiveness of the proposed retrieval procedure in comparison to the state-of-the-art algorithms for learning binary codes.

Related paper: A Probabilistic Framework for Multimodal Retrieval using Integrative Indian Buffet Process, NIPS 2014

Matrix Completion for Resolving Label Ambiguity[edit]

Speaker: Ching-Hui Chen

Abstract: In real applications, data is not always explicitly-labeled. For instance, label ambiguity exists when we associate two persons appearing in a news photo with two names provided in the caption. We propose a matrix completion-based method for resolving ambiguity to predict the actual labels from the ambiguously labeled instances, and a standard supervised classifier can learn from the disambiguated labels to classify new data. We further generalize the method to handle the labeling constraints between instances when such prior knowledge is available. Compared to the state of the arts, our proposed framework achieves 2.9% improvement on the labeling accuracy of the Lost dataset and comparable performance on the Labeled Yahoo! News dataset.

Related paper: Matrix Completion for Resolving Label Ambiguity, To appear in CVPR 2015

Learning an Efficient Model of Hand Shape Variation from Depth Images[edit]

Speaker: Sameh Khamis

Abstract: We describe how to learn a compact and efficient model of the surface deformation of human hands. The model is built from a set of noisy depth images of a diverse set of subjects performing different poses with their hands. We represent the observed surface using Loop subdivision of a control mesh that is deformed by our learned parametric shape and pose model. The model simultaneously accounts for variation in subject-specific shape and subject-agnostic pose. Specifically, hand shape is parameterized as a linear combination of a mean mesh in a neutral pose with a small number of offset vectors. This mesh is then articulated using standard linear blend skinning (LBS) to generate the control mesh of a subdivision surface. We define an energy that encourages each depth pixel to be explained by our model, and the use of a smooth subdivision surface allows us to optimize for all parameters jointly from a rough initialization. The efficacy of our method is demonstrated using both synthetic and real data, where it is shown that hand shape variation can be represented using only a small number of basis components. We compare with other approaches including PCA and show a substantial improvement in the representational power of our model, while maintaining the efficiency of a linear shape basis.

Link: Learning an Efficient Model of Hand Shape Variation from Depth Images (To appear in CVPR 2015)

Beyond Short Snippets: Deep Networks for Video Classification[edit]

Speaker: Joe Ng

Abstract: Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improvements over previously published results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs. 87.9%) and without additional optical flow information (82.6% vs. 72.8%).

Link: Preliminary Version (To appear in CVPR 2015)

Fast 2D Border Ownership Assignment[edit]

Speaker: Ching-Lik Teo

A method for efficient border ownership assignment in 2D images is proposed. Leveraging on recent advances using Structured Random Forests (SRF) for boundary detection, we impose a novel border ownership structure that detects both boundaries and border ownership at the same time. Key to this work are features that predict ownership cues from 2D images. To this end, we use several different local cues: shape, spectral properties of boundary patches, and semi-global grouping cues that are indicative of perceived depth. For shape, we use HoG-like descriptors that encode local curvature (convexity and concavity). For spectral properties, such as extremal edges (EE), we first learn an orthonormal basis spanned by the top K eigenvectors via PCA over common types of contour tokens from which we reproject the patches to extract the most important spectral features. For grouping, we introduce a novel mid-level descriptor that captures patterns near edges and indicates ownership information of the boundary. Experimental results over a subset of the Berkeley Segmentation Dataset (BSDS) and the NYU Depth V2 dataset show that our method’s performance exceeds current state of the art multi-stage approaches that use more complex features.

Related paper: PDF C.L. Teo, C. Fermüller, Y. Aloimonos. Fast 2D Border Ownership Assignment. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), to appear, 2015.