Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture
文献类型:期刊论文
作者 | Mou HR(缪浩然); Cheng GF(程高峰); Zhang PY(张鹏远); Yan YH(颜永红) |
刊名 | IEEE/ACM Transactions on Audio, Speech, and Language Processing
![]() |
出版日期 | 2020 |
期号 | 1页码:1452 |
ISSN号 | 2329-9290 |
DOI | 10.1109/TASLP.2020.2987752 |
英文摘要 | Recently, there has been increasing progress in endto-end automatic speech recognition (ASR) architecture, which transcribes speech to text without any pre-trained alignments. One popular end-to-end approach is the hybrid ConnectionistTemporal Classification (CTC) and attention (CTC/attention) based ASR architecture, which utilizes the advantages of both CTC and attention. The hybrid CTC/attention ASR systems exhibit performance comparable to that of the conventional deep neural network (DNN)/hidden Markov model (HMM) ASR systems. However, how to deploy hybrid CTC/attention systems for online speech recognition is still a non-trivial problem. This article describes our proposed online hybrid CTC/attention end-to-end ASR architecture, which replaces all the offline components of conventional CTC/attention ASR architecture with their corresponding streaming components. Firstly, we propose stable monotonic chunk-wise attention (sMoChA) to stream the conventional global attention, and further propose monotonic truncated attention (MTA) to simplifysMoChA and solve the training-and-decoding mismatch problem of sMoChA. Secondly, we propose truncated CTC (T-CTC) prefix score to stream CTC prefix score calculation. Thirdly, we design dynamic waiting joint decoding (DWJD) algorithm to dynamically collect the predictions of CTC and attention in an online manner. Finally, we use latency-controlled bidirectional long short-term memory (LC-BLSTM) to stream the widely-used offline bidirectional encoder network. Experiments with LibriSpeech English and HKUST Mandarin tasks demonstrate that, compared with the offline CTC/attention model, our proposed online CTC/attention model improves the real time factor in human-computer interaction services and maintains its performance with moderate degradation. To the best of our knowledge, this is the firstwork to provide the full-stack online solution for CTC/attention end-to-end ASR architecture. |
URL标识 | 查看原文 |
源URL | [http://159.226.59.140/handle/311008/9487] ![]() |
专题 | 历年期刊论文_2020年期刊论文 |
作者单位 | 中国科学院声学研究所 |
推荐引用方式 GB/T 7714 | 缪浩然;程高峰;张鹏远;颜永红. Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing,2020(1):1452. |
APA | 缪浩然;程高峰;张鹏远;颜永红.(2020).Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture.IEEE/ACM Transactions on Audio, Speech, and Language Processing(1),1452. |
MLA | 缪浩然;程高峰;张鹏远;颜永红."Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture".IEEE/ACM Transactions on Audio, Speech, and Language Processing .1(2020):1452. |
入库方式: OAI收割
来源:声学研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。