================华丽分割线=================这部分来自知乎====================
链接:http://www.zhihu.com/question/33272629/answer/60279003
有关action recognition in videos, 最近自己也在搞这方面的东西,该领域水很深,不过其实主流就那几招,我就班门弄斧说下video里主流的: Deep Learning之前最work的是INRIA组的Improved Dense Trajectories(IDT) + fisher vector, paper and code: 基本上INRIA的东西都挺work 恩.. 然后Deep Learning比较有代表性的就是VGG组的2-stream: 其实效果和IDT并没有太大区别,里面的结果被很多人吐槽难复现,我自己也试了一段时间才有个差不多的数字。 然后就是在这两个work上面就有很多改进的方法,目前的state-of-the-art也是很直观可以想到的是xiaoou组的IDT+2-stream: 还有前段时间很火,现在仍然很多人关注的G社的LSTM+2-stream: 然后安利下zhongwen同学的paper: 最后你会发现paper都必需和IDT比,
================华丽分割线=================这部分也来自知乎====================
链接:http://www.zhihu.com/question/33272629/answer/60163859
视频方面的不了解,可以聊一聊静态图像下的~ [1] Action Recognition from a Distributed Representation of Pose and Appearance, CVPR,2010 [2] Combining Randomization and Discrimination for Fine-Grained Image Categorization, CVPR,2011 [3] Object and Action Classification with Latent Variables, BMVC, 2011 [4] Human Action Recognition by Learning Bases of Action Attributes and Parts, ICCV, 2011 [5] Learning person-object interactions for action recognition in still images, NIPS, 2011 [6] Weakly Supervised Learning of Interactions between Humans and Objects, PAMI, 2012 [7] Discriminative Spatial Saliency for Image Classification, CVPR, 2012 [8] Expanded Parts Model for Human Attribute and Action Recognition in Still Images, CVPR, 2013 [9] Coloring Action Recognition in Still Images, IJCV, 2013 [10] Semantic Pyramids for Gender and Action Recognition, TIP, 2014 [11] Actions and Attributes from Wholes and Parts, arXiv, 2015 [12] Contextual Action Recognition with R*CNN, arXiv, 2015 [13] Recognizing Actions Through Action-Specific Person Detection, TIP, 2015 2010之前的都没看过,在10年左右的这几年(11,12)主要的思路有3种:1.以所交互的物体为线索(person-object interaction),建立交互关系,如文献5,6;2.建立关于姿态(pose)的模型,通过统计姿态(或者更广泛的,部件)的分布来进行分类,如文献1,4,还有个poselet上面好像没列出来,那个用的还比较多;3.寻找具有鉴别力的区域(discriminative),抑制那些meaningless 的区域,如文献2,7。10和11也用到了这种思想。 文献9,10都利用了SIFT以外的一种特征:color name,并且描述了在动作分类中如何融合多种不同的特征。 文献12探讨如何结合上下文(因为在动作分类中会给出人的bounding box)。 比较新的工作都用CNN特征替换了SIFT特征(文献11,12,13),结果上来说12是最新的。 静态图像中以分类为主,检测的工作出现的不是很多,文献4,13中都有关于检测的工作。可能在2015之前分类的结果还不够promising。现在PASCAL VOC 2012上分类mAP已经到了89%,以后的注意力可能会更多地转向检测。
[1] (干货较多,可以进去浏览浏览)
[2]
- Tian, YingLi, et al. "Hierarchical filtered motion for action recognition in crowded videos." Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 42.3 (2012): 313-323.
- A new 3D interest point detector, based on 2D Harris and Motion History Image (MHI). Essentially, 2D Harris points with recent motion are selected as interest point.
- A new descriptors based on HOG on image intensity and MHI. Some filtering is performed to remove cluttered motion and normalize descriptors.
- KTH and MSR Action dataset
- Yuan, Junsong, Zicheng Liu, and Ying Wu. "Discriminative subvolume search for efficient action detection." Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009.
- A discriminative matching techniques based on mutual information and nearest neighbor algorithm
- A better upper bound for Branching and Bounding to locate matched action that maximize mutual information
- The key idea is to decompose the search space into spatial and temporal.
- Lampert, Christoph H., Matthew B. Blaschko, and Thomas Hofmann. "Beyond sliding windows: Object localization by efficient subwindow search." Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008.
- Code online: (Efficient Subwindow Search)
- Reducing the complexity of sliding window from n4 to averagely n2
- Branching and Bounding techniques
- Relies on a bounding funtion that gives a upper bound of the scoring function over a set of potential box
- works well with linear classifiers and BOW features.
- Li, Li-Jia, et al. "Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification." NIPS. Vol. 2. No. 3. 2010.
- Images are represented as a scale-invariant map of object detector response
- Detectors are applied to novel images in multiple scales. At each scale, a 3 level spatial pyramid is applied. Responses are concatenated to form the descriptors for the image.
- 200 objecst are selected from a 1000 objects pool
- Evaluated In Scene classification task
- L1 and L1/L2 regularized LR is applied to discover sparsity. The the L1/L2 group sparsity, group is defined for each object, hence object level sparsity. Bear in mind that there are multiple entries in the descriptors for each object. (marginal improvements)
- Wang, Heng, et al. "Dense trajectories and motion boundary descriptors for action recognition." International journal of computer vision 103.1 (2013): 60-79.
- Tracking over densely sampled points to get trajectories, in contrast with local representation. Not really dense sampling, grids are filtered by minEigen value criterion (Shi and Tomasi)
- Motion boundary (derivative over optical flow field), to overcome camera motion
- Code online:
- Optical Flow field is filtered by Median Filter. based on opencv
- Limit trajectory to overcome drift. Filter static point and error trajectories.
- Trajectory shape, HOG, HOF and MBH descriptors along the trajectory
- KTH (94.2%), Youtube (84.1%), Hollywood2 (58.2%), UCF Sports (88.0%), IXMAS (93.5%), UIUC (98.4%), Olympic Sports (74.1%), UCF50 (84.5%), HMDB51 (46.6%)
- Liang, Xiaodan, Liang Lin, and Liangliang Cao. "Learning latent spatio-temporal compositional model for human action recognition." Proceedings of the 21st ACM international conference on Multimedia. ACM, 2013.
- Laptev STIP with HOF and HOG, with BOW quantization
- Leaf node for detecting action parts
- Or node to account for intra-class variability
- And node to aggregate action in a frame
- Root node to identify temporal composition
- Contextual interaction (connecting leaf nodes)
- Everything is formulated in a latent SVM framework and solved by CCCP
- Since the leaf node can move around from one Or-node to another, a reconfiguration step is used to rearrange the feature vector
- UCF Youtube and Olympic Sports dataset
- Sadanand, Sreemanananth, and Jason J. Corso. "Action bank: A high-level representation of activity in video." Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.
- 98.2% KTH, 95.0% UCF Sports, 57.9% UCF50, 26.9% HMDB51
- 205 video clips used as template to detect action from novel video.
- Detectors are sampled from multi viewpoint and run with multiple scales
- Output of detectors are maxpooled for ST volume through various pooling unit
- "Action Spoting" for template detector
- Code online:
- Liu, Jingen, Benjamin Kuipers, and Silvio Savarese. "Recognizing human actions by attributes." Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011.
- 22 manually selected action attributes as semantic representation
- Data Driven attributes as complementary information
- Attributes as latent variable, just the parts in DPM model
- Account for the class matching, attribute matching, attributes cooccurcance.
- STIP by 1D-Gabor detector. Gradient based + BOW over ST volume
- UIUC dataset, KTH, Olympic Sports Dataset
- Niebles, Juan Carlos, Hongcheng Wang, and Li Fei-Fei. "Unsupervised learning of human action categories using spatial-temporal words." International Journal of Computer Vision 79.3 (2008): 299-318.
- Unsupervised video categorizaton, using pLSA and LDA
- Action Localization
- Laptev's STIP is too sparse comparing with Dollar's
- Simple gradient based descriptors and PCA applied to reduce dimensionality --> rely on codebook to deal with invariance
- K-means with Euclidean distance metric
- pLSA or LDA on top of BOW (# topic is equal to the categories to be recognized)
- Each STIP is associated with a BOW, hence topic distribution, so it's trivial to perform Localization
- Laptev, Ivan, et al. "Learning realistic human actions from movies." Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008.
- Annotating videos by aligning transcriptes
- A movie dataset
- Space-Time interest points + HOG + HOF around a ST volume
- ST BOW. Given a video sequence, multiple way to segment it, each of which is called a channel
- Multi-Channel \chi^2 kernel classification. Channel selection using greedy shrink
- KTH (91.8%) and Movie (18.2% ~ 53.3%) dataset
- STIP + HOG and HOF code:
Links to Datasets:
- D. Weinland, R. Ronfard, E. Boyer
- M. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri
- C. Schuldt, I. Laptev and B. Caputo
- Y. Shi, Y. Huang, D. Minnen, A. Bobick, I. Essa
- Y. Huang, I. Essa
Recent Action Recognition Papers:
- D. Weinland, R. Ronfard, E. Boyer (CVIU Nov./Dec. '06) 11 actors each performing 3 times 13 actions: Check Watch, Cross Arms, Scratch Head, Sit Down, Get Up, Turn Around, Walk, Wave, Punch, Kick, Point, Pick Up, Throw. Multiple views of 5 synchronized and calibrated cameras are provided.
- A. Yilmaz, M. Shah (ICCV '05) 18 Sequences, 8 Actions: 3 x Running, 3 x Bicycling, 3 x Sitting-down, 2 x Walking, 2 x Picking-up, 1 x Waving Hands, 1 x Forehand Stroke, 1 x Backhand Stroke
- Y. Sheikh, M. Shah (ICCV '05) 6 Actions: Sitting, Standing, Falling, Walking, Dancing, Running
- M. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri (ICCV '05) 81 Sequences, 9 Actions, 9 People: Running, Walking, Bending, Jumping-Jack, Jumping-Forward-On-Two-Legs, Jumping-In-Place-On-Two-Legs, Galloping-Sideways, Waving-Two-Hands, Waving-One-Hand Ballet
- A. Yilmaz, M. Shah (CVPR '05) 28 Sequences, 12 Actions: 7 x Walking, 4 x Aerobics, 2 x Dancing, 2 x Sit-down, 2 x Stand-up, 2 x Kicking, 2 x Surrender, 2 x Hands-down, 2 x Tennis, 1 x Falling
- E. Shechtman, M. Irani (CVPR '05) Walking, Diving, Jumping, Waving Arms, Waving Hands, Ballet Figure, Water Fountain
- Y. Shi, Y. Huang, D. Minnen, A. Bobick, I. Essa (CVPR '04) Glucose Monitor Calibration
- C. Schuldt, I. Laptev and B. Caputo (ICPR '04) 6 Actions x 25 Subjects x 4 Scenarios
- V. Parameswaran, R. Chellappa (CVPR '03) 25 x Walk, 6 x Run, 18 x Sit-down
- D. Minnen, I. Essa, T. Starner (CVPR '03) Towers of Hanoi (only hands)
- A. Efros, A. Berg, G. Mori, J. Malik (ICCV '03) Soccer, Tennis, Ballet
[4]
[5]
[6]
Sample sequences for each action (DivX-compressed)
Action database in zip-archives (DivX-compressed)
Note: The database is publicly available for non-commercial use. Please refer to if you use this database in your publications. (242Mb) (168Mb) (149Mb) (194Mb) (218Mb) (176Mb)Related publications "Recognizing Human Actions: A Local SVM Approach", Christian Schuldt, Ivan Laptev and Barbara Caputo; in Proc. ICPR'04, Cambridge, UK. [ ] "Local Spatio-Temporal Image Features for Motion Interpretation", Ivan Laptev; PhD Thesis, 2004, Computational Vision and Active Perception Laboratory (CVAP), NADA, KTH, Stockholm [ , ] "Local Descriptors for Spatio-Temporal Recognition", Ivan Laptev and Tony Lindeberg; ECCV Workshop "Spatial Coherence for Visual Motion Analysis" [ , ] "Velocity adaptation of space-time interest points", Ivan Laptev and Tony Lindeberg; in Proc. ICPR'04, Cambridge, UK. [ , ] "Space-Time Interest Points", I. Laptev and T. Lindeberg; in Proc. ICCV'03, Nice, France, pp.I:432-439. [ , ]