首页 > 论文 > 激光与光电子学进展 > 56卷 > 3期(pp:31007--1)

基于具有深度门的多模态长短期记忆网络的说话人识别

Speaker Identification Based on Multimodal Long Short-Term Memory with Depth-Gate

  • 摘要
  • 论文信息
  • 参考文献
  • 被引情况
  • PDF全文
分享:

摘要

为了在说话人识别任务中有效融合音视频特征, 提出一种基于深度门的多模态长短期记忆(LSTM)网络。首先对每一类单独的特征建立一个多层LSTM模型, 并通过深度门连接上下层的记忆存储单元, 增强上下层的联系, 提升该特征本身的分类性能。同时, 通过在不同模型之间共享连接隐藏层输出与各个门单元的权重, 学习每一层模型之间的联系。实验结果表明, 该方法能有效融合音视频特征, 提高说话人识别的准确率, 并且对干扰具有一定的稳健性。

Abstract

In order to effectively fuse the audio and visual features in the task of speaker recognition, a multimodal long short-term memory network (LSTM) with depth-gate is proposed. First, a multi-layer LSTM model is established for each type of individual features. Then the depth-gate is used to connect the memory cells in the upper and lower layers, and the connection between the upper and lower layers is enhanced, which improves the classification performance of the feature itself. At the same time, the connection among layer models can be learned by sharing the output of hidden layers and the weight of each gate unit among different models. The experimental results show that this method can be used to effectively fuse the audio and video features and improve the accuracy of speaker recognition. Moreover, this method is robust to external disturbance.

Newport宣传-MKS新实验室计划
补充资料

中图分类号:TP391

DOI:10.3788/lop56.031007

所属栏目:图像处理

基金项目:国家自然科学基金(61573168)

收稿日期:2018-06-13

修改稿日期:2018-08-21

网络出版日期:2018-08-31

作者单位    点击查看

陈湟康:江南大学轻工过程先进控制教育部重点实验室, 江苏 无锡 214122
陈莹:江南大学轻工过程先进控制教育部重点实验室, 江苏 无锡 214122

联系人作者:陈湟康(6161918009@vip.jiangnan.edu.cn); 陈莹(chenying@jiangnan.edu.cn);

【1】Kanagasundaram A,Vogt R, Dean D, et al. I-vector based speaker recognition on short utterances[C]∥Proceedings of the 12th Annual Conference of the International Speech Communication Association (ISCA), 2011: 2341-2344.

【2】Matějka P, Glembek O, Castaldo F, et al. Full-covariance UBM and heavy-tailed PLDA in I-vector speaker verification[C]∥2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011: 4828-4831.

【3】Alam M R, Bennamoun M, Togneri R, et al. A confidence-based late fusion framework for audio-visual biometric identification[J]. Pattern Recognition Letters, 2015, 52: 65-71.

【4】Wu Z Y, Cai L H. Audio-visual bimodal speaker identification using dynamic bayesian networks[J]. Journal of Computer Research and Development, 2006, 43(3): 470-475.
吴志勇, 蔡莲红. 基于动态贝叶斯网络的音视频双模态说话人识别[J]. 计算机研究与发展, 2006, 43(3): 470-475.

【5】Hu Y T, Ren J S, Dai J W, et al. Deep multimodal speaker naming[C]∥Proceedings of the 23rd ACM International Conference on Multimedia-MM′15, 2015: 1107-1110.

【6】Geng J J, Liu X, Cheung Y M. Audio-visual speaker recognition via multi-modal correlated neural networks[C]∥2016 IEEE/WIC/ACM International Conference on Web Intelligence Workshops (WIW), 2016: 123-128.

【7】Wen M F, Hu C, Liu W R. Heterogeneous multimodal object recognition method based on deep learning[J]. Journal of Central South University (Science and Technology), 2016, 47(5): 1580-1587.
文孟飞, 胡超, 刘伟荣. 一种基于深度学习的异构多模态目标识别方法[J]. 中南大学学报(自然科学版), 2016, 47(5): 1580-1587.

【8】Ren J, Hu Y, Tai Y W, et al. Look, listen and learn-a multimodal LSTM for speaker identification[C]. AAAI, 2016: 3581-3587.

【9】Yao K, Cohn T, Vylomova K, et al. Depth-gated recurrent neural networks[J]. arXiv: 1508.03790, 2015.

【10】Hochreiter S, Schmidhuber J. Longshort-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.

【11】Mikolov T, Karafi T M, Burget L, et al. Recurrent neural network based language model[C]∥Proceedings of the 11th Annual Conference of the International Speech Communication Association (ISCA), 2010: 1045-1048.

【12】Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks[J]. arXiv: 1409.3215v3, 2014.

【13】Kalchbrenner N, Danihelka I, Graves A. Grid long short-term memory[J]. arXiv: 1507.01526, 2015.

【14】Hinton G E, Srivastava N, Krizhevsky A, et al. Improving neural networks by preventing co-adaptation of feature detectors[J]. Computer Science, 2012, 3(4): 212-223.

【15】Li Y J, Huang J J, Wang H Y, et al. Study of emotion recognition based on fusion multi-modal bio-signal with SAE and LSTM recurrent neural network[J]. Journal on Communications, 2017, 38(12): 109-120.
李幼军, 黄佳进, 王海渊, 等. 基于SAE和LSTM RNN的多模态生理信号融合和情感识别研究[J]. 通信学报, 2017, 38(12): 109-120.

【16】Liu Y H, Liu X, Fan W T, et al. Efficient audio-visual speaker recognition via deep heterogeneous feature fusion[C]∥Chinese Conference on Biometric Recognition. Springer, Cham, 2017: 575-583.

【17】Azab M, Wang M Z, Smith M, et al. Speaker naming in movies[C]∥Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, 2018: 2206-2216.

【18】Yang H X, Chen Y, Zhang F, et al. Face recognition based on improved gradient local binary pattern[J]. Laser & Optoelectronics Progress, 2018, 55(6): 061004.
杨恢先, 陈永, 张翡, 等. 基于改进梯度局部二值模式的人脸识别[J]. 激光与光电子学进展, 2018, 55(6): 061004.

引用该论文

Chen Huangkang,Chen Ying. Speaker Identification Based on Multimodal Long Short-Term Memory with Depth-Gate[J]. Laser & Optoelectronics Progress, 2019, 56(3): 031007

陈湟康,陈莹. 基于具有深度门的多模态长短期记忆网络的说话人识别[J]. 激光与光电子学进展, 2019, 56(3): 031007

您的浏览器不支持PDF插件,请使用最新的(Chrome/Fire Fox等)浏览器.或者您还可以点击此处下载该论文PDF