基于时空注意力网络的中国手语识别

手语识别广泛应用于聋哑人与正常人之间的交流中。针对手语识别任务中时空特征提取不充分而导致识别率低的问题, 提出了一种新颖的基于时空注意力的手语识别模型。首先提出了基于残差3D卷积网络(Residual 3D Convolutional Neural Network, Res3DCNN)的空间注意力模块, 用来自动关注空间中的显著区域; 随后提出了基于卷积长短时记忆网络(Convolutional Long Short-Term Memory, ConvLSTM)的时间注意力模块, 用来衡量视频帧的重要性。所提算法的关键在于在空间中关注显著区域, 并且在时间上自动选择关键帧。最后, 在CSL手语数据集上验证了算法的有效性。

Abstract

Sign language recognition is widely used in communication between deaf-mute and ordinary people. In adequate extraction of spatial-temporal features in sign language recognition task is likely to result in low recognition rate. In this paper, proposed is a novel sign language recognition model based on spatial-temporal attention which can learn more discriminative spatial-temporal features. Specially, a new spatial attention module based on residual 3D convolutional neural network (Res3DCNN) is proposed, which automatically focus on the salient areas in the spatial region. Then, to measure the importance of video frames, a new temporal attention module based on convolutional long short-term memory (ConvLSTM) is introduced. The crucial purpose of the proposed model is to focus on the salient areas spatially and pay attention to the key video frames temporally. Lastly, experimental results demonstrate the efficiency of the proposed method on the Chinese sign language (CSL) dataset.

参考文献

[1] Li K, Zhou Z, Lee C H. Sign transition modeling and a scalable solution to continuous sign language recognition for real-world applications[J]. ACM Trans. on Accessible Computing, 2016, 8(2): 1-23.

[2] Lin Y, Chai X, Zhou Y, et al. Curve matching from the view of manifold for sign language recognition[J]. Lecture Notes in Computer Science, 2014, 9010: 233-246.

[3] Khotimah W N, Suciati N, Benedict I. Indonesian sign language recognition by using the static and dynamic features[C]// 2018 Inter. Seminar on Intelligent Technol. and Its Applications (ISITIA), 2018: 293-298.

[4] Huang J, Zhou W, Li H, et al. Sign language recognition using 3D convolutional neural networks[C]// 2015 IEEE Inter. Conf. on Multimedia and Expo, 2015: 1-6.

[5] Zhu G, Zhang L, Shen P, et al. Multimodal gesture recognition using 3-D convolution and convolutional LSTM[J]. IEEE Access, 2017, 5: 4517-4524.

[6] Zhang L, Zhu G, Shen P, et al. Learning spatiotemporal features using 3DCNN and convolutional LSTM for gesture recognition[C]// 2017 IEEE Inter. Conf. on Computer Vision Workshop, 2017: 3120-3128.

[7] Tran D, Ray J, Shou Z, et al. ConvNet architecture search for spatiotemporal feature learning[EB/OL]. [2019-10-15]. https://arxiv.org/abs/1708.05038.

[8] Li Z, Gavrilyuk K, Gavves E, et al. VideoLSTM convolves, attends and flows for action recognition[J]. Computer Vision and Image Understanding, 2018, 166: 41-50.

[9] Sharma S, Kiros R, Salakhutdinov R. Action recognition using visual attention[EB/OL]. [2019-10-15]. https://arxiv.org/abs/1511.04119.

[10] Hu Y, Wong Y, Wei W, et al. A novel attention-based hybrid CNN-RNN architecture for sEMG-based gesture recognition[J]. PLoS ONE, 2018, 13(10): e0206049.

[11] Li D, Yao T, Duan L, et al. Unified spatio-temporal attention networks for action recognition in videos[J]. IEEE Trans. on Multimedia, 2019, 21(2): 416-428.

[12] Wang L, Xu Y, Cheng J, et al. Human action recognition by learning spatio-temporal features with deep neural networks[J]. IEEE Access, 2018, 6: 17913-17922.

[13] Molchanov P, Gupta S, Kim K, et al. Hand gesture recognition with 3D convolutional neural networks[C]// 2015 IEEE Conf. on Computer Vision and Pattern Recognition Workshops, 2015: 1-7.

[14] Peng M, Wang C, Chen T. Attention based residual network for micro-gesture recognition[C]// 13th IEEE Inter. Conf. on Automatic Face & Gesture Recognition, 2018: 790-794.

[15] Zhengyuan Y, Yuncheng L, Jianchao Y, et al. Action recognition with spatio-temporal visual attention on skeleton image sequences[J]. IEEE Trans. on Circuits and Systems for Video Technol., 2019, 29(8): 2405-2415.

[16] Zhang H, Goodfellow I, Metaxas D, et al. Self-attention generative adversarial networks[C]// Proc. of the 36th Inter. Conf. on Machine Learning, 2019, 97: 7354-7363.

[17] Huang J, Zhou W, Li H, et al. Attention based 3D-CNNs for large-vocabulary sign language recognition[J]. IEEE Trans. on Circuits and Systems for Video Technol., 2019, 29(9): 2822-2832.

[18] Pu J, Zhou W, Zhang J, et al. Sign language recognition based on trajectory modeling with HMMs[C]// The Inter. Conf. on Multimedia Modeling, 2016, 9516: 686-697.

[19] Zhang J, Zhou W, Xie C, et al. Chinese sign language recognition with adaptive HMM[C]// 2016 IEEE Inter. Conf. on Multimedia and Expo, 2016: 1-6.

[20] Liu T, Zhou W, Li H. Sign language recognition with long short-term memory[C]// 2016 IEEE Inter. Conf. on Image Proc., 2016: 2871-2875.

[21] Miao Q, Li Y, Ouyang W, et al. Multimodal gesture recognition based on the ResC3D network[C]// 2017 IEEE Inter. Conf. on Computer Vision Workshop, 2017: 3047-3055.

[22] Zhang Liang, Zhu Guangming, Lin Mei, et al. Attention in convolutional LSTM for gesture recognition[C]// Neural Information Processing Systems (NIPS), 2018: 1953-1962.

罗元, 李丹, 张毅. 基于时空注意力网络的中国手语识别[J]. 半导体光电, 2020, 41(3): 414. LUO Yuan, LI Dan, ZHANG Yi. Chinese Sign Language Recognition Based on Spatial-Temporal Attention Network[J]. Semiconductor Optoelectronics, 2020, 41(3): 414.

基于时空注意力网络的中国手语识别

关于本站 Cookie 的使用提示

全站搜索

基于时空注意力网络的中国手语识别

相关论文

相关资讯

关于本站 Cookie 的使用提示

全站搜索