Actionness Estimation Using Hybrid Fully Convolutional Networks


Limin Wang, Yu Qiao, Xiaoou Tang, and Luc Van Gool

Abstract

Actionness was introduced to quantify the likelihood of containing a generic action instance at a specific location. Accurate and efficient estimation of actionness is important in video analysis and may benefit other relevant tasks such as action recognition and action detection. This paper presents a new deep architecture for actionness estimation, called hybrid fully convolutional network (H-FCN), which is composed of appearance FCN (A-FCN) and motion FCN (M-FCN). These two FCNs leverage the strong capacity of deep models to estimate actionness maps from the perspectives of static appearance and dynamic motion, respectively. In addition, the fully convolutional nature of H-FCN allows it to efficiently process videos with arbitrary sizes. Experiments are conducted on the challenging datasets of Stanford40, UCF Sports, and JHMDB to verify the effectiveness of H-FCN on actionness estimation, which demonstrate that our method achieves superior performance to previous ones. Moreover, we apply the estimated actionness maps on action proposal generation and action detection. Our actionness maps advance the current state-of-the-art performance of these tasks substantially.

Method

As shown in the above figure, we propose the architecture of hybrid fully convolutional networks (H-FCN) for actionness estimation in videos. In addition, based on actionness maps, we develop a action proposal generation method and present a more unified action detection framework.

  • Hybrid fully convolutional networks: H-FCN is composed of appearance fully convolutional network (A-FCN) and motion fully convolutional network (M-FCN). These two FCNs capture visual information for actionness estimation from the perspectives of static appearance and dynamic motion, respectively.

  • Actionness estimation: A-FCN takes a single RGB image as input and M-FCN deals with two consecutive optical flow fields. In order to deal with scale variations, We construct pyramid representations of RGB frames and stacking optical flow fields. These actionness maps from different scales are first up-sampled to the size of original image and then averaged.

  • Action proposal generation: Actionness maps are generic visual features and can be exploited for different vision problems. Here, we propose a sampling method to generate action proposals based on estimated actionness maps. We sample boxes according to their scores and spatial overlaps.

  • Action detection: Following R-CNN, we train an action classifier with two-stream CNNs by cropping positive examples and mining negative examples. At test time, we directly use the output of two-stream CNNs as the detection score for each action proposal.

  • Results

    • Examples of Actionness Maps and Action Proposals

    • Evaluation on Actionness Estimation (UCF Sports, Stanford40, and J-HMDB)

    • Evaluation on Action Proposal (Stanford 40 and J-HMDB)

    • Evaluation on Action Detection (J-HMDB)

    Downloads

    • The code of TDD extraction:

    • Code on github [ Link ]

    References

    If you use our trained model or code, please cite the following paper:

    Limin Wang, Yu Qiao, Xiaoou Tang, and Luc Van Gool
    Actionness estimation using hybrid fully convolutional networks
    in IEEE conference on computer vision and pattern recognition (CVPR), 2016

    Last Updated on 20th July, 2016