多模态特征融合与多任务学习的特种视频分类

吴晓雨; 顾超男; 王生进

doi:doi:10.3788/ope.20202805.1177

光学精密工程, 2020, 28 (5): 1177, 网络出版: 2020-11-06

多模态特征融合与多任务学习的特种视频分类

Special video classification based on multitask learning and multimodal feature fusion

吴晓雨 ^1,*顾超男 ¹王生进 ²

作者单位

¹ 中国传媒大学信息与通信工程学院, 北京 100024

² 清华大学电子工程系, 北京 100084

引用该论文

吴晓雨, 顾超男, 王生进. 多模态特征融合与多任务学习的特种视频分类[J]. 光学精密工程, 2020, 28(5): 1177.

WU Xiao-yu, GU Chao-nan, WANG Sheng-jin. Special video classification based on multitask learning and multimodal feature fusion[J]. Optics and Precision Engineering, 2020, 28(5): 1177.

参考文献

[1] 马晓晨,韦世奎,蒋翔,等.基于相机溯源的潜在不良视频通话预警［J］.光学精密工程, 2018, 26 (11): 2785-2794.

MA X CH, WEI SH K, JIANG X, et al.. Early warning of illegal video chats based on camera source identification ［J］. Opt. Precision Eng., 2018, 26 (11): 2785-2794. (in Chinese)

[2] CLAIRE H D, CEDRIC P, MOHAMMAD S, et al.. VSD, a public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation ［J］. Multimedia Tools and Applications, 2014, 74 (17): 7379-7404.

[3] MOREIRA D, AVILA S , PEREZ M , et al.. Multimodal data fusion for sensitive scene localization ［J］. Information Fusion, 2019 (45): 307-323.

[4] WANG H M, YANG L, WU X Y, et al.. A review of bloody violence in video classification ［C］. International Conference on the Frontiers & Advances in Data Science, 2017: 86-91.

[5] YI Y, WANG H, ZHANG B, et al.. MIC-TJU at affective impact of movies task ［C］. MediaEval Workshop, 2015, 7.

[6] LAM, LE S P, DO T, et al.. Computational optimization for violent scenes detection ［C］. International Conference on Computer, Control, Informatics and its Applications, 2016: 141-146.

[7] DAI Q, ZHAO R, WU Z, et al.. Fudan-Huawei at mediaeval 2015: Detecting violent scenes and affective impact in movies with deep learning ［C］. MediaEval Workshop, 2015, 5.

[8] SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos ［C］. NeurIPS, 2014: 568-576.

[9] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks ［C］. NeurIPS, 2014: 3104-3112.

[10] SWATHIKIRAN S, OSWALD L. Learning to detect violent videos using convolutional long short-term memory ［C］. IEEE International Conference on Advanced Video and Signal Based Surveillance, 2017: 1-6.

[11] BALTRUSAITIS T, AHUJA C, MORENCY L P. Multimodal machine learning: a survey and taxonomy ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019 41(2): 423-443.

[12] 崔鑫, 彭宗举, 陈芬. 联合多特征的未来视频快速编码［J］. 光学精密工程, 2019 , 27 (4): 990-999.

CUI X, PENG Z J, CHEN F. Joint Multi-feature fast coding for future video coding ［J］. Opt. Precision Eng., 2019, 27 (4): 990-999. (in Chinese)

[13] WU Z , JIANG Y G , WANG X , et al.. Multi-Stream multi-class fusion of deep networks for video classification ［C］. ACM International Conference on Multimedia, 2016: 791-800.

[14] 潘仙张,张石清, 郭文平. 多模深度卷积神经网络应用于视频表情识别［J］.光学精密工程, 2019 , 27 (4): 963-970.

PAN X ZH, ZHANG SH Q, GUO W P. Video-based facial expression recognition using multimodal deep convolutional neural networks ［J］. Opt. Precision Eng., 2019, 27 (4): 963-970. (in Chinese)

[15] ATREY P K, HOSSAIN M A, SADDIK A E, et al.. Multimodal fusion for multimedia analysis: a survey ［J］. Multimedia Systems, 2010, 16(6): 345-379.

[16] QIU Z, YAO T, TAO M. Learning spatial-temporal representation with pseudo-3d residual networks ［C］. IEEE International Conference on Computer Vision, 2017: 5534-5542.

[17] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition A new model and the kinetics dataset ［C］. IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6299-6308.

[18] HERSHEY S, CHAUDHURI S, ELLIS D P W, et al.. CNN architectures for large-scale audio classification ［C］. International Conference on Acoustics, Speech and Signal Processing, 2017: 131-135.

[19] WU Z, JIANG Y G, WANG J, et al.. Exploring inter-feature and inter-class relationships with deep neural networks for video classification ［C］. ACM International Conference on Multimedia, 2014: 167-176.

[20] HASSNER T, ITCHER Y, KLIPER C O. Violent flows: Real-time detection of violent crowd behavior ［C］. IEEE Conference on Computer Vision and Pattern Recognition, 2012: 1-6

[21] BILINSKI P, BREMOND F. Human violence recognition and detection in surveillance videos ［C］. IEEE International Conference on Advanced Video and Signal Based Surveillance, 2016: 30-36.

[22] ZHANG T, JIA W, HE X, et al.. Discriminative dictionary learning with motion weber local descriptor for violence detection ［J］. IEEE Transactions on Circuits and Systems for Video Technology, 2017, 27(3): 696-709.

[23] MATS S, YOANN B, HANLI W, et al.. The mediaeval 2015 affective impact of movies task ［C］. MediaEval Workshop, 2015: 1.

[24] ESRA A, FRANK H, SAHIN A. Breaking down violence detection: combining divide-et-impera and coarse-to-fine strategies ［J］. Neurocomputing, 2016, 208: 225-237.

吴晓雨, 顾超男, 王生进. 多模态特征融合与多任务学习的特种视频分类[J]. 光学精密工程, 2020, 28(5): 1177. WU Xiao-yu, GU Chao-nan, WANG Sheng-jin. Special video classification based on multitask learning and multimodal feature fusion[J]. Optics and Precision Engineering, 2020, 28(5): 1177.

多模态特征融合与多任务学习的特种视频分类

关于本站 Cookie 的使用提示

全站搜索

多模态特征融合与多任务学习的特种视频分类

相关论文

相关资讯

关于本站 Cookie 的使用提示

全站搜索