| Humanoid robots will need to learn the actions that humans perform. They will need to recognize these actions when they see them and they will need to perform these actions themselves. In this presentation I will introduce a manipulation grammar to perform this learning task. Context-free grammars in linguistics provide a simple and precise mechanism for describing the methods by which phrases in some natural language are built from smaller blocks. Also, the basic recursive structure of natural languages is described exactly. Similarly, for manipulation actions, every complex activity is built from smaller blocks involving hands and their movements, as well as objects, tools and the monitoring of their state. Thus, interpreting a seen action is like understanding language, and executing an action from knowledge in memory is like producing language. Associated with the grammar, a parsing algorithm is proposed, which can be used bottom-up to interpret videos by dynamically creating a semantic tree structure, and top-down to create the motor commands for a robot to execute manipulation actions. Experiments on both tasks, i.e. a robot observing people performing manipulation actions, and a robot executing manipulation actions on a simulation platform, validate the proposed formalism. | | Humanoid robots will need to learn the actions that humans perform. They will need to recognize these actions when they see them and they will need to perform these actions themselves. In this presentation I will introduce a manipulation grammar to perform this learning task. Context-free grammars in linguistics provide a simple and precise mechanism for describing the methods by which phrases in some natural language are built from smaller blocks. Also, the basic recursive structure of natural languages is described exactly. Similarly, for manipulation actions, every complex activity is built from smaller blocks involving hands and their movements, as well as objects, tools and the monitoring of their state. Thus, interpreting a seen action is like understanding language, and executing an action from knowledge in memory is like producing language. Associated with the grammar, a parsing algorithm is proposed, which can be used bottom-up to interpret videos by dynamically creating a semantic tree structure, and top-down to create the motor commands for a robot to execute manipulation actions. Experiments on both tasks, i.e. a robot observing people performing manipulation actions, and a robot executing manipulation actions on a simulation platform, validate the proposed formalism. |
| + | We extend the classical notion of visual saliency to multi-image data collected using a stationary pan-tilt-zoom (PTZ) camera. We show why existing saliency methods are not effective for this type of data, and propose ray saliency: a modified notion of visual saliency that utilizes knowledge of the imaging process in order to appropriately incorporate the context provided by multiple images. We present a practical, mosaic-free method by which to quantify and calculate ray saliency, and demonstrate its usefulness on PTZ imagery. |