Event Recognition Using Object-Scene Convolutional Neural Networks


Limin Wang, Zhe Wang, Wenbin Du, and Yu Qiao

Abstract

Event recognition from still images is of great importance for image understanding. However, compared with event recognition in videos, there are much fewer research works on event recognition in images. This paper addresses the issue of event recognition from images and proposes an effective method with deep neural networks. Specifically, we propose a new architecture, called Object-Scene Convolutional Neural Network (OS-CNN). This architecture is decomposed into object net and scene net, which extract useful information for event understanding from the perspective of objects and scene context, respectively. Meanwhile, we investigate different network architectures for OS-CNN design, and adapt the deep (AlexNet) and very-deep (GoogLeNet) networks to the task of event recognition. Furthermore, we find that the deep and very-deep networks are complementary to each other. Finally, based on the proposed OS-CNN and comparative study of different network architectures, we come up with a solution of five-stream CNN for the track of cultural event recognition at the ChaLearn Looking at People (LAP) challenge 2015. Our method obtains the performance of 85.5% and ranks the 1st place in this challenge.

Method

We utilize two separate components for event recognition. The object stream, pre-trained in large object dataset (ImageNet), carries information about object depicted in the image. The scene stream, pre-trained in large scene dataset (Places), captures the pattern about scene context of this image.

  • Object Net: We first choose the Clarifai network architecture and use the pre-trained model from VGG group. Then, we fine tune the model parameters for the task of event recognition on the training dataset provided by the challenge organizers.

  • Scene Net: We first use the pre-trained model in Places dataset, which choose the famous AlexNet architecture. Similar to object net, we then fine tune the model parameters on the training dataset from the cultural event recognition challenge.

  • Ensemble of Multiple CNNs: Several successful deep CNN architectures have been designed for the task of object recognition at the ImageNet Large Scale Visual Recognition Challenge. These architectures can be roughly classified into two categories: (i) deep CNN including AlexNet and Clarifai, (ii) very-deep CNN including GoogLeNet and VGGNet. We exploit these very-deep networks in our proposed Object-Scene CNN architecture and aim to verify the superior performance of deeper structure
  • Results

  • Effectiveness of OS-CNN


  • Evaluation of different architecture


  • Challenge approach and results

  • Our challenge solution is a five-stream CNN pre-trained with different datasets (ImageNet or Places) equipped with different network architectures. The challenge results are shown as following:

    Downloads

    Models and Codes will be coming soon.

    References

    L. Wang, Z. Wang, W. Du, and Y. Qiao, Event Recognition Using Object-Scene Convolutional Neural Networks, in ChaLearn Looking at People (LAP) workshop, CVPR, 2015.