中国科学院机构知识库网格系统: 信息论分类学习的若干问题研究

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

信息论分类学习的若干问题研究

文献类型：学位论文


作者	刘灿涛
学位类别	工学博士
答辩日期	2010-06-03
授予单位	中国科学院研究生院
授予地点	中国科学院自动化研究所
导师	胡包钢
关键词	信息论学习分类性能评价准则方向互信息 Renyi熵特征选择生态数据透明度 information-theoretic learning classifier evaluating criterion direction mutual information Renyi’s entropy feature selection ecological data transparency
其他题名	Study on Some Issues of Information-Theoretic Classification Learning
学位专业	计算机应用技术
中文摘要	信息论学习是近年来兴起的机器学习领域的一个分支，它以直接从数据中获得的熵和散度作为描述去替代传统统计学的方差和协方差，可以被使用在监督和非监督机器学习中。在信息论学习的基本框架中最重要的两部分是用信息准则评估学习算法和根据评估的结果提出新的学习算法。本文重点研究了信息论学习中的分类学习，包括三个方面的问题，一是如何利用信息论准则评价分类模型，二是由互信息这个评价准则提出了基于Renyi熵互信息的特征选择算法。三是在特征选择算法的基础上，本文进一步利用信息论分类学习的方法研究如何增加生态数据的透明度。具体说来如下： ①基于信息论的分类模型评价准则  本文研究了二值分类中经验互信息相对于其自由参数的凸性，并分析了其最优解与目标类别和预测类别独立条件的关系。在此基础上，本文提出了方向互信息的概念，在理论上分析了在其它自由参数固定的情况下，方向互信息与分类准确率成单调递增的关系，从而避免了经验互信息在不同分类模型对应相同值的可能。  本文在利用方向互信息对二值分类模型的基础上，对带有拒识判别的二值分类模型也做了初步探索，本文所提方法的优点是在拒识判别和减少误差做了一个更好的平衡。  针对于多类问题，本文按照不同类别间的关系（而非该类别与其它所有类别）将其分解成一个个二值分类问题，将方向互信息拓展到对多类问题的分类模型评价。 ②基于Renyi熵互信息的特征选择针对于大规模数据集的数据挖掘时，速度是瓶颈的问题，本文提出了基于Renyi熵互信息的特征选择方法。在已有Renyi熵估计的基础上，本文结合大规模数据集数据量大的特点和概率论中的大数定律，对Renyi熵进行了一个近似估计，将其计算复杂度从O(N 2 ③信息论分类学习在增加生态数据透明度中的应用)降低到O(N)。结合最小冗余度最大相关度的特征选择算法，本文利用所提出的Renyi熵估计方法对互信息进行估计，从而降低了特征选择算法的计算复杂度。实验结果表明本文所提出的基于Renyi熵互信息的特征选择算法在分类准确率类似的情况下，计算速度有了大大的提高。本文以森林覆盖类型数据集为例，研究了如何利用信息论分类学习增加生态数据的透明度。首先计算了该数据集每一维属性所包含的信息量，即其熵值的大小；其次分析了每一维属性和类别的互信息，研究它们与类别的相关程度；接着研究属性之间的互信息，揭示了属性之间的关系和冗余程度；最后根据本文所提出的特征选择方法，对各个属性相对于分类而言的重要性程度进行排序。
英文摘要	Information-theoretic learning (ITL) is a branch of machine learning arising in recent years. Entropy and divergence, obtained from the data directly, are proposed as ITL criterions instead of the conventional variance and covariance, and can be applied supervised and unsupervised learning. Evaluating the learning algorithm with the IT criterion and proposing the new algorithm according to the evaluating result are the two most important woks in ITL framework. Our work in this thesis focuses three issues of Information-Theoretic Classifica- tion learning. First we study how to evaluate the classifier with IT criterion. Second we propose a new feature selection (FS) algorithm via mutual information based on Renyi’s entropy. Last we study the application of ITL to increase the transparency of ecological data according to the proposed FS algorithm. The detail is as follows. ①Evaluating criterion of the classifier based on ITL We prove mutual information (MI) is a convex function relative to its free parameters in the binary classifier, and analyse the relation between its optimal solution and the independent condition of the target and predicting class label. According to these results, we propose the concept of the direction MI, and give the theoretic proof that it monotonically increases with accuracy fixed another free parameter. Direction MI avoids the things that the different classifiers have the same MI value. Then, we explore the evaluation method of the binary classifier with the reject option with direction MI, which is a better trade-off between the performances and reject than the existing MI method. We divide the multi-class classification task into multiple binary classification problems in the light of the different class instead of one class and other classes so that direction MI expands to the evaluation of multi-class classifier. ②Feature Selection via mutual information based on Renyi’s entropy For the computing speed is the bottleneck in the data mining of the large scale data sets, we propose feature selection (FS) algorithm via mutual information based on Renyi’s entropy. We propose the approximate estimating of Renyi’s entropy with the law of large numbers and the characteristic of the large scale data sets, which decreases the computational complexity from O(N2 ③Application of ITL to increase the transparency of ecological data ) to O(N). Based on min-redundancy-max-dependency, we estimate MI with proposed the approximate es...
语种	中文
其他标识符	200718014629090
源URL	[http://ir.ia.ac.cn/handle/173211/6292]
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	刘灿涛. 信息论分类学习的若干问题研究[D]. 中国科学院自动化研究所. 中国科学院研究生院. 2010.

入库方式： OAI收割

来源：自动化研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。