Motionlets: Mid-Level 3D Parts for Human Motion Recognition


Limin Wang, Yu Qiao, and Xiaoou Tang

Abstract

This paper proposes motionlet, a mid-level and spatiotemporal part, for human motion recognition. Motionlet can be seen as a tight cluster in motion and appearance space, corresponding to the moving process of different body parts. We postulate three key properties of motionlet for action recognition: high motion saliency, multiple scale representation, and representative-discriminative ability. Towards this goal, we develop a data-driven approach to learn motionlets from training videos. First, we extract 3D regions with high motion saliency. Then we cluster these regions and preserve the centers as candidate templates for motionlet. Finally, we examine the representative and discriminative power of the candidates, and introduce a greedy method to select effective candidates. With motionlets, we present a mid-level representation for video, called motionlet activation vector. We conduct experiments on three datasets, KTH, HMDB51, and UCF50. The results show that the proposed methods significantly outperform state-of-the-art methods.

Method

As shown in the Figure above, the whole process consists of three steps, 1) extracting motion salient regions, 2) finding motionlet candidates, and 3) ranking motionlets.

  • Extraction of Motion Salient Regions: we extract 3D video regions with high motion saliency as seeds for constructing motionlets. Specifically, we use spatiotemporal orientation energy (SOE) for motion saliency detection and low level motion description.

  • Finding Motionlet Candidates: we identify representative ones from all 3D regions by using clustering method. We first group the 3D regions according to their spatial sizes. Then, for each group, we clusterthe 3D regions according to motion and appearance information.

  • Ranking Motionlet: our goal is to find a subset of motionlets satisfying two requirements, the sum of representative and discriminative power should be as large as possible; the coverage percentage of training samples should be as high as possible. We design a greedy algorithm to select motionlets sequentially.

  • For video represenation, we resort resort to a scanning-pooling scheme. For each motionlet, we conduct window scanning over video data and use max pooling to obtain its response maximum. We call this representation as motionlet activation vector.

    Results

    • Examples of motionlets

    • Recognition Results on Small-Scale Dataset:

    • Recognition Results on Large-Scale Datasets:

    • Combine with Other Representations and Varying Number of Motionlets

    Downloads

    We release the code of extraction low-level features of a video. It contains the Spatio-temporal Orientation Engergy (SOE), and dense HOG and HOE. Matlab code ([zip files (258 KB)])

    References

    L. Wang, Y. Qiao, and Xiaoou Tang, Motionlets: Mid-Level 3D Parts for Human Motion Recognition, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013