中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
DISTRIBUTION OF MULTI-WORDS IN CHINESE AND ENGLISH DOCUMENTS

文献类型:期刊论文

作者Zhang, Wen1,2; Yoshida, Taketoshi1; Tang, Xijin3
刊名INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & DECISION MAKING
出版日期2009-06-01
卷号8期号:2页码:249-265
关键词Multi-word term distribution Poisson distribution zero-inflated distribution G-distribution
ISSN号0219-6220
英文摘要As a hybrid of N-gram in natural language processing and collocation in statistical linguistics, multi-word is becoming a hot topic in area of text mining and information retrieval. In this paper, a study concerning distribution of multi-words is carried out to explore a theoretical basis for probabilistic term-weighting scheme. Specifically, the Poisson distribution, zero-inflated binomial distribution, and G-distribution are comparatively studied on a task of predicting probabilities of multi-words' occurrences using these distributions, for both technical multi-words and nontechnical multi-words. In addition, a rule-based multi-word extraction algorithm is proposed to extract multi-words from texts based on words' occurring patterns and syntactical structures. Our experimental results demonstrate that G-distribution has the best capability to predict probabilities of frequency of multi-words' occurrence and the Poisson distribution is comparable to zero-inflated binomial distribution in estimation of multi-word distribution. The outcome of this study validates that burstiness is a universal phenomenon in linguistic count data, which is applicable not only for individual content words but also for multi-words.
WOS研究方向Computer Science ; Operations Research & Management Science
语种英语
WOS记录号WOS:000267703000004
出版者WORLD SCIENTIFIC PUBL CO PTE LTD
源URL[http://ir.amss.ac.cn/handle/2S8OKBNM/8641]  
专题中国科学院数学与系统科学研究院
通讯作者Zhang, Wen
作者单位1.Japan Adv Inst Sci & Technol, Sch Knowledge Sci, Tatsunokuchi, Ishikawa 9231292, Japan
2.Chinese Acad Sci, Inst Software, Lab Internet Software Technol, Beijing 100190, Peoples R China
3.Chinese Acad Sci, Inst Syst Sci, Acad Math & Syst Sci, Beijing 100190, Peoples R China
推荐引用方式
GB/T 7714
Zhang, Wen,Yoshida, Taketoshi,Tang, Xijin. DISTRIBUTION OF MULTI-WORDS IN CHINESE AND ENGLISH DOCUMENTS[J]. INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & DECISION MAKING,2009,8(2):249-265.
APA Zhang, Wen,Yoshida, Taketoshi,&Tang, Xijin.(2009).DISTRIBUTION OF MULTI-WORDS IN CHINESE AND ENGLISH DOCUMENTS.INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & DECISION MAKING,8(2),249-265.
MLA Zhang, Wen,et al."DISTRIBUTION OF MULTI-WORDS IN CHINESE AND ENGLISH DOCUMENTS".INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & DECISION MAKING 8.2(2009):249-265.

入库方式: OAI收割

来源:数学与系统科学研究院

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。