Abstract: The talk will briefly touch upon the Multi-scale CNN of Lecun and Farabet to extract pixel-wise features for semantic segmentation and then I will move on to discuss the work we did to enhance the model further in order to result in a real-time and accurate pixel-wise labeling pipeline. I will talk about a deep feed-forward neural network architecture for pixel-wise semantic scene labeling. It uses a novel recursive neural network architecture for context propagation, referred to as rCPN. It first maps the local features into a semantic space followed by a bottom-up aggregation of local information into a global feature of the entire image. Then a top-down propagation of the aggregated information takes place that enhances the contextual information of each local features. Therefore, the information from every location in the image is propagated to every other location. Experimental results on Stanford background and SIFT Flow datasets show that the proposed method outperforms previous approaches in terms of accuracy. It is also orders of magnitude faster than previous methods and takes only 0.07 seconds on a GPU for pixel-wise labeling of a 256 by 256 image starting from raw RGB pixel values, given the super-pixel mask that takes an additional 0.3 seconds using an off-the-shelf implementation. | Abstract: The talk will briefly touch upon the Multi-scale CNN of Lecun and Farabet to extract pixel-wise features for semantic segmentation and then I will move on to discuss the work we did to enhance the model further in order to result in a real-time and accurate pixel-wise labeling pipeline. I will talk about a deep feed-forward neural network architecture for pixel-wise semantic scene labeling. It uses a novel recursive neural network architecture for context propagation, referred to as rCPN. It first maps the local features into a semantic space followed by a bottom-up aggregation of local information into a global feature of the entire image. Then a top-down propagation of the aggregated information takes place that enhances the contextual information of each local features. Therefore, the information from every location in the image is propagated to every other location. Experimental results on Stanford background and SIFT Flow datasets show that the proposed method outperforms previous approaches in terms of accuracy. It is also orders of magnitude faster than previous methods and takes only 0.07 seconds on a GPU for pixel-wise labeling of a 256 by 256 image starting from raw RGB pixel values, given the super-pixel mask that takes an additional 0.3 seconds using an off-the-shelf implementation. |