中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
A Public Chinese Dataset for Language Model Adaptation

文献类型:期刊论文

作者Bai, Ye1,2; Yi, Jiangyan1; Tao, Jianhua1,2,3; Wen, Zhengqi1; Fan, Cunhang1,2
刊名JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY
出版日期2019-10-16
页码13
关键词Chinese dataset Language model adaptation Speech recognition N-gram RNNLM
ISSN号1939-8018
DOI10.1007/s11265-019-01482-5
通讯作者Yi, Jiangyan(jiangyan.yi@nlpr.ia.ac.cn)
英文摘要A language model (LM) is an important part of a speech recognition system. The performance of an LM is affected when the domains of training data and test data are different. Language model adaptation is to compensate for this mismatch. However, there is no public dataset in Chinese for evaluating language model adaptation. In this paper, we present a public Chinese dataset called CLMAD for language model adaptation. The dataset consists of four domains: sport, stock, fashion, and finance. The differences in these four domains are evaluated. We present baselines for two commonly used adaptation techniques: interpolation for n-gram, and fine-tuning for recurrent neural network language models (RNNLMs). For n-gram interpolation, when the source domain and target domain are relatively similar, the adapted model can be improved. But interpolating LMs of very different domains does not obtain improvement. For RNNLMs, fine-tuning whole network achieves the largest improvement over only fine-tuning softmax layer or embedding layer. When the domain difference is large, the improvement of the adapted RNNLM is significant. We also provide speech recognition results on AISHELL-1 with the LMs trained on CLMAD. CLMAD can be freely downloaded at http://www.openslr.org/55/.
资助项目National Key R&D Program of China[2017YFB1002802]
WOS研究方向Computer Science ; Engineering
语种英语
WOS记录号WOS:000490530600001
出版者SPRINGER
资助机构National Key R&D Program of China
源URL[http://ir.ia.ac.cn/handle/173211/26603]  
专题模式识别国家重点实验室_智能交互
通讯作者Yi, Jiangyan
作者单位1.Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China
2.Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
3.Chinese Acad Sci, CAS Ctr Excellence Brain Sci & Intelligence Techn, Inst Automat, Beijing, Peoples R China
推荐引用方式
GB/T 7714
Bai, Ye,Yi, Jiangyan,Tao, Jianhua,et al. A Public Chinese Dataset for Language Model Adaptation[J]. JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY,2019:13.
APA Bai, Ye,Yi, Jiangyan,Tao, Jianhua,Wen, Zhengqi,&Fan, Cunhang.(2019).A Public Chinese Dataset for Language Model Adaptation.JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY,13.
MLA Bai, Ye,et al."A Public Chinese Dataset for Language Model Adaptation".JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY (2019):13.

入库方式: OAI收割

来源:自动化研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。