Domain-specific Chinese word segmentation using suffix tree and mutual information
文献类型:期刊论文
作者 | Zeng, Daniel1,2![]() ![]() |
刊名 | INFORMATION SYSTEMS FRONTIERS
![]() |
出版日期 | 2011-03-01 |
卷号 | 13期号:1页码:115-125 |
关键词 | Mutual information Chinese segmentation N-gram Suffix tree Ukkonen algorithm Heuristic rules |
通讯作者 | Zeng, Daniel |
英文摘要 | As the amount of online Chinese contents grows, there is a critical need for effective Chinese word segmentation approaches to facilitate Web computing applications in a range of domains including terrorism informatics. Most existing Chinese word segmentation approaches are either statistics-based or dictionary-based. The pure statistical method has lower precision, while the pure dictionary-based method cannot deal with new words beyond the dictionary. In this paper, we propose a hybrid method that is able to avoid the limitations of both types of approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves high accuracy in word segmentation when domain training is available. It can also identify new words through MI-based token merging and dictionary updating. In addition, with the proposed Improved Bigram method IASeg can process N-grams. To evaluate the performance of our segmenter, we compare it with two well-known systems, the Hylanda segmenter and the ICTCLAS segmenter, using a terrorism-centric corpus and a general corpus. The experiment results show that IASeg performs better than the benchmarks in both precision and recall for the domain-specific corpus and achieves comparable performance for the general corpus. |
WOS标题词 | Science & Technology ; Technology |
类目[WOS] | Computer Science, Information Systems ; Computer Science, Theory & Methods |
研究领域[WOS] | Computer Science |
关键词[WOS] | SECURITY INFORMATICS ; LINEAR-TIME ; ALGORITHM ; CONSTRUCTION ; INTELLIGENCE ; SYSTEMS |
收录类别 | SCI ; SSCI |
语种 | 英语 |
WOS记录号 | WOS:000288220000010 |
源URL | [http://ir.ia.ac.cn/handle/173211/3557] ![]() |
专题 | 自动化研究所_复杂系统管理与控制国家重点实验室_先进控制与自动化团队 |
作者单位 | 1.Chinese Acad Sci, Inst Automat, Intelligent Control & Syst Engn Ctr, Beijing, Peoples R China 2.Univ Arizona, Dept Management Informat Syst, Tucson, AZ 85721 USA 3.Univ Hong Kong, Sch Business, Hong Kong, Hong Kong, Peoples R China 4.Chinese Acad Sci, Key Lab Complex Syst & Intelligence Sci, Beijing, Peoples R China |
推荐引用方式 GB/T 7714 | Zeng, Daniel,Wei, Donghua,Chau, Michael,et al. Domain-specific Chinese word segmentation using suffix tree and mutual information[J]. INFORMATION SYSTEMS FRONTIERS,2011,13(1):115-125. |
APA | Zeng, Daniel,Wei, Donghua,Chau, Michael,&Wang, Feiyue.(2011).Domain-specific Chinese word segmentation using suffix tree and mutual information.INFORMATION SYSTEMS FRONTIERS,13(1),115-125. |
MLA | Zeng, Daniel,et al."Domain-specific Chinese word segmentation using suffix tree and mutual information".INFORMATION SYSTEMS FRONTIERS 13.1(2011):115-125. |
入库方式: OAI收割
来源:自动化研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。