MoFAP: A Multi-Level Representation for Action Recognition


Limin Wang, Yu Qiao, and Xiaoou Tang

Abstract

This paper proposes a multi-level video representation by stacking the activations of motion features, atoms, and phrases (MoFAP). Motion features refer to those low-level local descriptors, while motion atoms and phrases can be viewed as mid-level ``temporal parts''. Motion atom is defined as an atomic part of action, and captures the motion information of video in a short temporal scale. Motion phrase is a temporal composite of multiple motion atoms with an AND/OR structure. It further enhances the discriminative capacity of motion atoms by incorporating temporal structure in a longer temporal scale. Specifically, we first design a discriminative clustering method to automatically discover a set of representative motion atoms. Then, we mine effective motion phrases with high discriminative and representative capacity in a bottom-up construction algorithm. Based on these basic units of motion features, atoms, and phrases, we construct a MoFAP network by stacking them layer by layer. This MoFAP network enables us to extract the effective representation of video data from different levels and scales, by conducting pooling operation in each layer. The separate representations from motion features, motion atoms, and motion phrases are concatenated as a whole one, called Activation of MoFAP. The effectiveness of this representation is demonstrated on four challenging datasets: Olympic Sports, UCF50, HMDB51, and UCF101. Experimental results show that our representation achieves the state-of-the-art performance on these datasets.

Method

As shown in the Figure above, the whole process consists of three steps, 1) extracting motion salient regions, 2) finding motionlet candidates, and 3) ranking motionlets.

  • Extraction of Motion Salient Regions: we extract 3D video regions with high motion saliency as seeds for constructing motionlets. Specifically, we use spatiotemporal orientation energy (SOE) for motion saliency detection and low level motion description.

  • Finding Motionlet Candidates: we identify representative ones from all 3D regions by using clustering method. We first group the 3D regions according to their spatial sizes. Then, for each group, we clusterthe 3D regions according to motion and appearance information.

  • Ranking Motionlet: our goal is to find a subset of motionlets satisfying two requirements, the sum of representative and discriminative power should be as large as possible; the coverage percentage of training samples should be as high as possible. We design a greedy algorithm to select motionlets sequentially.

  • For video represenation, we resort resort to a scanning-pooling scheme. For each motionlet, we conduct window scanning over video data and use max pooling to obtain its response maximum. We call this representation as motionlet activation vector.