首页 > 论文 > 激光与光电子学进展 > 57卷 > 18期(pp:181702--1)

用于腹腔镜扶持器控制的特定人语音识别算法

Speaker-Dependent Speech Recognition Algorithm for Laparoscopic Supporter Control

  • 摘要
  • 论文信息
  • 参考文献
  • 被引情况
  • PDF全文
分享:

摘要

提出了一种基于融合i-vector特征的长短时记忆(LSTM)循环神经网络模型,用于腹腔镜扶持器语音控制,在小训练样本下实现对特定医生语音中的短时、孤立词指令的识别。该模型以LSTM循环神经网络作为基础模型,以梅尔频率倒谱系数(MFCC)作为输入特征参数,将i-vector特征作为LSTM循环神经网络的深层输入信息,与神经网络中LSTM层后的深层特征信息进行拼接,达到参数融合的目的,实现对特定主刀医生语音指令的准确识别以及对非主刀医生语音指令的拒识别,为腹腔镜操作提供安全智能的语音识别方案。使用自建语音库进行实验,分别验证所提算法对训练库内语音的识别性能以及对训练库外语音的拒识别性能。实验结果表明:与动态时间规整算法(DTW)和混合高斯模型-隐马尔可夫模型(GMM-HMM)相比,所提模型在对训练库内特定人语音指令识别正确率高达99.6%的同时保持着错误接受率为0%,对训练库外语音的平均错误接受率为2.5%,满足腹腔镜扶持器控制的准确性和安全性要求。

Abstract

A long short-term memory (LSTM) recurrent neural network based on an i-vector feature is presented for speech control of laparoscopic supporter to realize short-term isolated word command recognition from the speech of a specific doctor using small training samples. In this model, LSTM recurrent neural network is used as the basic model, Mel-frequency cepstrum coefficient (MFCC) is used as the input characteristic parameter, i-vector feature is used as the deep input information of LSTM recurrent neural network, and the deep feature information behind LSTM layer in the neural network is spliced to achieve the purpose of parameter fusion, so as to realize the accurate recognition of the voice instructions of the specific surgeon and the rejection recognition of the voice instructions of the non surgeon. This approach offers a secure and intelligent speech recognition scheme for laparoscopic surgeries. Further, a self-built speech database is used as a training library to verify speech recognition performance of the proposed algorithm as well as its rejection performance for the speech not included in the training library. Experiments show that compared with dynamic time warping(DTW)and Gaussian mixture model-Hidden Markov model (GMM-HMM), the proposed model exhibits a 99.6% correct recognition rate for voice commands of specific people recorded in the training library while maintaining a false acceptance rate of 0%, with an average false acceptance rate of 2.5% for voices not included in the training library. The proposed model meets the requirements of accuracy and safety expected by laparoscopic supporter control standards.

广告组1 - 空间光调制器+DMD
补充资料

中图分类号:TN912

DOI:10.3788/LOP57.181702

所属栏目:医用光学与生物技术

收稿日期:2020-02-05

修改稿日期:2020-03-19

网络出版日期:2020-09-01

作者单位    点击查看

任凯龙:天津大学精密仪器与光电子工程学院, 天津 300072
汪毅:天津大学精密仪器与光电子工程学院, 天津 300072
陈晓冬:天津大学精密仪器与光电子工程学院, 天津 300072
蔡怀宇:天津大学精密仪器与光电子工程学院, 天津 300072

联系人作者:汪毅(koala_wy@tju.edu.cn)

【1】Abdulla W H, Chow D, Sin G. Cross-words reference template for DTW-based speech recognition systems[C]∥2003 Conference on Convergent Technologies for Asia-Pacific Region. 15-17 Oct. 2003, Bangalore, India. New York: , 2003, 1576-1579.

【2】Zhao X, Chen X D, Chang X, et al. Parameter extraction and enhancing method for mixed phonetic features based on multi-fisher criterion [J]. Nanotechnology and Precision Engineering. 2017, 15(4): 317-322.
赵鑫, 陈晓冬, 常昕, 等. 基于Multi-Fisher准则的语音混合特征提取和特征增强方法 [J]. 纳米技术与精密工程. 2017, 15(4): 317-322.

【3】Sak H, Senior A, Beanfays F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling . [C]∥ 2014 Proceedings of Annual Conference of International Speech Communication Association. [S.l.:s.n.]. 2014, 338-342.

【4】AAbdel-Hamid O, Mohamed A R, Jiang H, et al. Convolutional neural networks for speech recognition [J]. ACM Transactions on Audio, Speech, and Language Processing. 2014, 22(10): 1533-1545.

【5】Graves A, Mohamed A R, Hinton G. Speech recognition with deep recurrent neural networks[C]∥2013 IEEE International Conference on Acoustics, Speech and Signal Processing. 26-31 May 2013, Vancouver, BC, Canada. New York: , 2013, 6645-6649.

【6】Dehak N, Kenny P J, Dehak R, et al. Front-end factor analysis for speaker verification [J]. IEEE Transactions on Audio, Speech, and Language Processing. 2011, 19(4): 788-798.

【7】Variani E, Lei X. McDermott E, et al. Deep neural networks for small footprint text-dependent speaker verification[C]∥2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4-9 May 2014, Florence, Italy. New York: , 2014, 4052-4056.

【8】Li Y X, Zhang J Q, Pan D, et al. A study of speech recognition based on RNN-RBM language model [J]. Journal of Computer Research and Development. 2014, 51(9): 1936-1944.
黎亚雄, 张坚强, 潘登, 等. 基于RNN-RBM语言模型的语音识别研究 [J]. 计算机研究与发展. 2014, 51(9): 1936-1944.

【9】Yang H J, Yan Z, Wu Z L, et al. Extraction method of interest text in image based on recurrent neural network [J]. Laser & Optoelectronics Progress. 2019, 56(24): 241501.
杨恒杰, 闫铮, 邬宗玲, 等. 基于循环神经网络的图像特定文本抽取方法 [J]. 激光与光电子学进展. 2019, 56(24): 241501.

【10】Li J Y, Yu D, Huang J T, et al. Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM[C]∥2012 IEEE Spoken Language Technology Workshop (SLT). 2-5 Dec. 2012, Miami, FL, USA. New York: , 2012, 131-136.

【11】Chen H K, Chen Y. Speaker identification based on multimodal long short-term memory with depth-gate [J]. Laser & Optoelectronics Progress. 2019, 56(3): 031007.
陈湟康, 陈莹. 基于具有深度门的多模态长短期记忆网络的说话人识别 [J]. 激光与光电子学进展. 2019, 56(3): 031007.

【12】Yao Y S. 04874 [2020-03-05]. 2016-02-16) https:∥arxiv. 1602, org/abs/1602: 04874.

【13】Scheffer N, Bonastre J F. UBM-GMM driven discriminative approach for speaker verification[C]∥2006 IEEE Odyssey - the Speaker and Language Recognition Workshop. 28-30 June 2006, San Juan, Puerto Rico. New York: , 2006, 1-7.

【14】Snyder D, Garcia-Romero D, Povey D. Time delay deep neural network-based universal background models for speaker recognition[C]∥2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). 13-17 Dec. 2015, Scottsdale, AZ, USA. New York: , 2015, 92-97.

【15】Li P, Zhang Y. Video smoke detection based on Gaussian mixture model and convolutional neural network [J]. Laser & Optoelectronics Progress. 2019, 56(21): 211502.
李鹏, 张炎. 基于高斯混合模型和卷积神经网络的视频烟雾检测 [J]. 激光与光电子学进展. 2019, 56(21): 211502.

【16】Garcia-Romero D. Espy-Wilson C Y. Analysis of i-vector length normalization in speaker recognition systems . [C]∥ Proceedings of the Annual Conference of the International Speech Communication Association. Florence, Italy:[s.n.]. 2011, 249-252.

【17】Kenny P, Boulianne G, Ouellet P, et al. Joint factor analysis versus eigenchannels in speaker recognition [J]. IEEE Transactions on Audio, Speech, and Language Processing. 2007, 15(4): 1435-1447.

【18】Kenny P, Boulianne G, Dumouchel P. Eigenvoice modeling with sparse training data [J]. IEEE Transactions on Speech and Audio Processing. 2005, 13(3): 345-354.

【19】Kenny P, Ouellet P, Dehak N, et al. A study of interspeaker variability in speaker verification [J]. IEEE Transactions on Audio, Speech, and Language Processing. 2008, 16(5): 980-988.

【20】Gupta V, Kenny P, Ouellet P, et al. I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription[C]∥2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4-9 May 2014, Florence, Italy. New York: , 2014, 6334-6338.

【21】Li Z Y, Zhang W Q, He L, et al. Total variability subspace adaptation based speaker recognition [J]. Acta Automatica Sinica. 2014, 40(8): 1836-1840.
栗志意, 张卫强, 何亮, 等. 基于总体变化子空间自适应的i-vector说话人识别系统研究 [J]. 自动化学报. 2014, 40(8): 1836-1840.

【22】Zhang J C, Inoue N. 00290 [2020-03-05]. 2018-04-01) https:∥arxiv.org/abs/1804.00290v1. 1804.

【23】Glembek O, Burget L, Matějka P, et al. Simplification and optimization of i-vector extraction[C]∥2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). New York: , 2011, 12176147.

【24】Chakroborty S, Saha G. Improved text-independent speaker identification using fused MFCC & IMFCC feature sets based on Gaussian filter [J]. International Journal of Signal Processing. 2009, 5(1): 11-19.

【25】Murty K S R, Yegnanarayana B. Combining evidence from residual phase and MFCC features for speaker recognition [J]. IEEE Signal Processing Letters. 2006, 13(1): 52-55.

【26】Ai O C, Hariharan M, Yaacob S, et al. Classification of speech dysfluencies with MFCC and LPCC features [J]. Expert Systems with Applications. 2012, 39(2): 2157-2165.

【27】Huang G X, Tian Y, Kang J, et al. Long short term memory recurrent neural network acoustic models using i-vector for low resource speech recognition [J]. Application Research of Computers. 2017, 34(2): 392-396.
黄光许, 田垚, 康健, 等. 低资源条件下基于i-vector特征的LSTM递归神经网络语音识别系统 [J]. 计算机应用研究. 2017, 34(2): 392-396.

引用该论文

Ren Kailong,Wang Yi,Chen Xiaodong,Cai Huaiyu. Speaker-Dependent Speech Recognition Algorithm for Laparoscopic Supporter Control[J]. Laser & Optoelectronics Progress, 2020, 57(18): 181702

任凯龙,汪毅,陈晓冬,蔡怀宇. 用于腹腔镜扶持器控制的特定人语音识别算法[J]. 激光与光电子学进展, 2020, 57(18): 181702

您的浏览器不支持PDF插件,请使用最新的(Chrome/Fire Fox等)浏览器.或者您还可以点击此处下载该论文PDF