DISTRIBUTION OF MULTI-WORDS IN CHINESE AND ENGLISH DOCUMENTS
文献类型:期刊论文
作者 | Zhang, Wen1,2; Yoshida, Taketoshi1; Tang, Xijin3 |
刊名 | INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & DECISION MAKING
![]() |
出版日期 | 2009-06-01 |
卷号 | 8期号:2页码:249-265 |
关键词 | Multi-word term distribution Poisson distribution zero-inflated distribution G-distribution |
ISSN号 | 0219-6220 |
英文摘要 | As a hybrid of N-gram in natural language processing and collocation in statistical linguistics, multi-word is becoming a hot topic in area of text mining and information retrieval. In this paper, a study concerning distribution of multi-words is carried out to explore a theoretical basis for probabilistic term-weighting scheme. Specifically, the Poisson distribution, zero-inflated binomial distribution, and G-distribution are comparatively studied on a task of predicting probabilities of multi-words' occurrences using these distributions, for both technical multi-words and nontechnical multi-words. In addition, a rule-based multi-word extraction algorithm is proposed to extract multi-words from texts based on words' occurring patterns and syntactical structures. Our experimental results demonstrate that G-distribution has the best capability to predict probabilities of frequency of multi-words' occurrence and the Poisson distribution is comparable to zero-inflated binomial distribution in estimation of multi-word distribution. The outcome of this study validates that burstiness is a universal phenomenon in linguistic count data, which is applicable not only for individual content words but also for multi-words. |
WOS研究方向 | Computer Science ; Operations Research & Management Science |
语种 | 英语 |
WOS记录号 | WOS:000267703000004 |
出版者 | WORLD SCIENTIFIC PUBL CO PTE LTD |
源URL | [http://ir.amss.ac.cn/handle/2S8OKBNM/8641] ![]() |
专题 | 中国科学院数学与系统科学研究院 |
通讯作者 | Zhang, Wen |
作者单位 | 1.Japan Adv Inst Sci & Technol, Sch Knowledge Sci, Tatsunokuchi, Ishikawa 9231292, Japan 2.Chinese Acad Sci, Inst Software, Lab Internet Software Technol, Beijing 100190, Peoples R China 3.Chinese Acad Sci, Inst Syst Sci, Acad Math & Syst Sci, Beijing 100190, Peoples R China |
推荐引用方式 GB/T 7714 | Zhang, Wen,Yoshida, Taketoshi,Tang, Xijin. DISTRIBUTION OF MULTI-WORDS IN CHINESE AND ENGLISH DOCUMENTS[J]. INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & DECISION MAKING,2009,8(2):249-265. |
APA | Zhang, Wen,Yoshida, Taketoshi,&Tang, Xijin.(2009).DISTRIBUTION OF MULTI-WORDS IN CHINESE AND ENGLISH DOCUMENTS.INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & DECISION MAKING,8(2),249-265. |
MLA | Zhang, Wen,et al."DISTRIBUTION OF MULTI-WORDS IN CHINESE AND ENGLISH DOCUMENTS".INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & DECISION MAKING 8.2(2009):249-265. |
入库方式: OAI收割
来源:数学与系统科学研究院
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。