中国科学院机构知识库网格系统: 基于流形学习的文本分类算法研究与应用

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

基于流形学习的文本分类算法研究与应用

文献类型：学位论文


作者	徐海瑞
学位类别	工程硕士
答辩日期	2011-05-25
授予单位	中国科学院研究生院
授予地点	中国科学院自动化研究所
导师	张文生
关键词	流形学习近邻保持嵌入线性鉴别嵌入文本分类 k最近邻算法 manifold learning NPE LDE text categorization kNN
其他题名	Research and application of Text Categorization Based on Manifold Learning
学位专业	软件工程
中文摘要	文本分类是信息检索和数据挖掘的关键技术，其主要任务是在给定的分类体系下，根据文本的内容自动确定与其关联的类别。近年来，随着网络和信息技术的飞速发展， Web上的文本资源呈现爆炸式增长，它向人们提供了更加丰富、细致的信息的同时也造成了大量的信息冗余。文本分类通过对网络文本进行类别标注，能有效地辅助人们进行组织和信息管理，已经成为信息检索领域的研究热点。对于文本分类而言，其难点主要在于文本特征的提取和分类效率。如何有效地对原始文档降维，提高分类正确率，从而迅速判定目标文档与主题是否相关是整个研究的关键。传统的文本分类算法在处理大规模高维数据集时，存在着计算复杂，分类效率和精度不高的弱点。本文应用流形学习算法，对文本集进行建模，能够有效发现隐藏在高维数据中的非线性低维流形结构，克服采用特征选择或线性映射带来的分类性能受损问题，并有效避免“维数灾难”，提高了分类器的性能和计算效率。本文的主要工作和创新点在于： 1、针对传统降维算法在处理高维和大规模文本数据时存在的局限性，提出了一种基于NPE的文本分类算法。该方法将高维数据非线性的投影到最优低维空间，在低维空间上做进一步的分类。通过在标准数据集上进行分类实验，证明了基于NPE的文本分类算法的有效性和优越性。 2、为充分利用训练样本的类别信息，更好的刻画样本空间的特点，采用基于线性回归的流形正则化框架，提出一种新的分类算法（MLD-RLSC）。实验结果表明，本算法在分类性能和运行速度上比传统分类器有较大的提高。
英文摘要	Automatic text classification is a core technology of information retrieval and data mining field. The main task is to determine the class of text based on its content. Recently, with the development of internet and information technology, the amount of information is increasing rapidly. While providing more detailed information, it also provided lots of useless information. Text classification can assist people to organize and manage information effectively and it has become research focus in information retrieval. For text classification, how to reduce dimension effectively and improve the classification efficiency is the key of research. Traditional algorithms have many weaknesses when deal with high dimensional lager data sets. In this paper, manifold learning is used to give generative probabilistic model of the text corpus. It avoids the classification performance damage problem generated by use of feature extraction method, at the same time overcomes the “dimension disaster”, and improves the classification performance and computation efficiency. The main works are as follows: 1. When the text corpuses are high-dimensional and large-scale, the traditional dimension reduction algorithm will expose their limitations. In this paper, a text categorization algorithm based on NPE is presented. The algorithm can explore and preserve the inherent structure on high dimensional web text space, and implement the classification. Experimental showed the algorithm achieves higher classification accuracy and stability. 2. To use the label of training samples efficiently, better describe the characteristics of the sample space, we proposed a novel text classification algorithm (MLD-RLSC) based on manifold regularization framework. Experimental results demonstrated that the proposed algorithm is of higher classification accuracy and faster running speed.
语种	中文
其他标识符	200828009029078
源URL	[http://ir.ia.ac.cn/handle/173211/7587]
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	徐海瑞. 基于流形学习的文本分类算法研究与应用[D]. 中国科学院自动化研究所. 中国科学院研究生院. 2011.

入库方式： OAI收割

来源：自动化研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。