中国科学院机构知识库网格系统: 数据清洗算法研究与实现

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

数据清洗算法研究与实现

文献类型：学位论文


作者	郝胜轩
学位类别	硕士
答辩日期	2015-05-26
授予单位	中国科学院沈阳自动化研究所
授予地点	中国科学院沈阳自动化研究所
导师	宋宏
关键词	数据清洗数据预处理缺失数据填补算法噪声数据检测算法数据清洗系统
其他题名	Research and Implementation of Data Cleaning Algorithm
学位专业	计算机应用技术
中文摘要	随着数据仓库技术与数据挖掘技术的广泛应用和发展，企业管理人员对决策分析有了更高的要求。企业的中高层领导目前更多关注的是如何能够在现有大量数据的背后挖掘到有用的隐藏信息，以及如何利用这些信息指导企业未来的发展。而要在基于历史的数据仓库的基础上为企业将来的发展作决策和预测时，数据的质量问题就变的非常关键。根据“垃圾进，垃圾出”原理，存在缺失数据、噪声数据、不一致数据和冗余数据等质量问题的数据会导致漫长的响应时间和昂贵的操作费用，并且会影响从数据中导出规则的准确性和挖掘出的模式的正确性，进而使决策支持系统产生误导决策的错误分析结果，影响信息服务的质量。因此，数据清洗正在成为数据挖掘与数据仓库领域的一个重要研究课题。本文首先对数据清洗的相关理论知识进行了详细的介绍，介绍了数据清洗的概念、研究背景及意义以及国内外的研究与应用现状。并对数据清洗的定义与基本流程进行了总结，对缺失数据填补的常用算法以及噪声数据检测的常用算法进行了详细阐述。重点对缺失数据填补与噪声数据检测的各类算法进行了深入的研究，提出了相应的改进算法，并在此基础上设计了一个数据清洗系统。实验与实践表明，所提出的改进算法均具有较好的效果，设计并实现的数据清洗系统具有很高的实用价值。本文的主要工作有： 1、提出基于近邻噪声处理的KNN缺失数据填补算法。该算法通过比较待填补缺失数据每个最近邻的真实近邻程度能够有效地识别潜在的噪声最近邻，最后使用所有非噪声最近邻对待填补缺失数据进行填补，从而消除了KNN缺失数据填补算法效果受噪声最近邻的影响。实验表明该算法具有较高的填补准确性。 2、提出基于双聚类的缺失数据填补方法。该算法首先利用双聚类簇内平均平方残值越小簇内数据相似性越高的特性，将缺失数据的填补问题转化为求解特定双聚类簇最小平均平方残值的问题，进而来对数据集中缺失元素进行填补。除此之外，该算法利用二次函数求解极小值的思想对包含有缺失数据的特定双聚类簇最小平均平方残值的问题进行求解，并进行了数学上的分析证明。实验表明该算法具有较高的填补准确性。 3、提出基于DBSCAN与SVDD的噪声数据检测方法。该算法首先通过经典的DBSCAN算法首先对数据聚类，剔除DBSCAN算法识别出的噪声数据点。然后根据聚类的结果对每个类分别使用SVDD算法进行训练，得到每个类别对应的判别模型。然后使用得到的所有模型依次对数据集中的所有非噪声点进行分类，将数据集中不属于任何类别的数据视为噪声并剔除。实验表明该算法具有较好的噪声检测效果。 4、提出基于快速搜索密度峰值聚类与信息熵的噪声数据检测算法。该算法首先通过快速搜索密度峰值聚类算法对原始数据集进行聚类，移除快速搜索密度峰值聚类算法识别出的噪声数据样本。然后根据聚类结果分别为每一个类构建矩形窗格并进行网格划分，将类中所有数据样本投影到网格后计算类的信息熵，并将类中局部密度最低的部分数据样本依次从类中删除，计算每个数据样本删除前与删除后类的信息熵变化，将删除后使类信息熵变化明显的数据样本当作噪声。实验表明该算法具有较好的噪声检测效果。 5、结合前面的研究工作设计并实现了一个可扩展与可交互的数据清洗系统，该数据清洗系统主要有数据预处理、缺失数据填补、噪声数据检测、关联分析四大功能模块，该系统目前已经投入到实际应用当中，并且取得了较好的效果。
索取号	TP301.6/H24/2015
英文摘要	With the wide application and development of data warehouse and data mining technology, Business executives have higher requirements on decision analysis. The high-level leaders of the enterprise currently pay more attention to how can dig into the hidden useful information behind the existing large amounts of data and how use this information to make decisions and predictions for the future of enterprise development. Therefore data quality issues becomes very critical when you want to make decisions and predictions for the future development of enterprises in the data warehouse based on the foundation of history. According to the principle of "garbage in, garbage out", the problem of data quality such as incompletion missing, noise, inconsistent and redundant will cause long response time and expensive operation cost. And will affect the correctness of the derived rules and accuracy of the mined patterns from data. Thereby affecting the quality of information and services, because it makes the decision support system to produce misleading results wrong decision. Therefore, data cleaning is becoming an important research topic in the field of data warehouse and data mining. Firstly, this article introduced the relevant theoretical knowledge of data cleaning such as concept, research background, the domestic and foreign research and application status, definition, principle, basic process etc. Secondly, this article described in detail the various missing data imputation algorithms and noise data detection algorithms, and this article also put forward the corresponding improved algorithms and designed a data cleaning system. Finally, the simulation experiment was performed. Experiment results show that the improved algorithms proposed have good results and the data cleaning system designed has a high practical value. The main work of the article can be summarized as follows: 1. This paper, which is based on the relationship of nearest neighbors of missing data, presents a novel imputation method for dealing with missing data——ENN-KNN(Eliminate Neighbor Noise k-Nearest Neighbor). ENN-KNN imputation method can effectively identify potential noise nearest neighbor by comparing each real nearest degree of nearest neighbor of missing data. At last, it uses all nearest neighbor which is not noise nearest neighbor to deal with missing data, for this reason it can eliminate the effect of noise nearest neighbor for dealing with missing data. By observing the experiment results, we can conclude that ENN-KNN imputation method has a high prediction accuracy. 2. This paper presents a novel imputation method based on biclustering to solve the missing data problem. Firstly, the proposed method transforms the problem of imputing missing data into the problem of specific bicluster’s minimum mean squared residue, which utilizes the characteristics of the bicluster data that the smaller bicluster’s mean squared residue the higher similarity, thus the proposed method can predict the missing data in data sets. Secondly, a solving minimization strategy of quadratic function is employed to solve the problem of specific bicluster’s minimum mean squared residue, and the corresponding mathematical proof is given. Finally, simulation and verification are executed, and the results show that the proposed imputation method has higher accuracy compared with other imputation methods. 3．A new method for noise data detection based on Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and support vector data description (SVDD) was proposed in this article. Firstly, classical DBSCAN algorithm was used to cluster the data and remove the outliers. Secondly, SVDD was used to train the grouped data according to the cluster result, and gained discriminant model for each group. All these discriminant models were used in whole dataset to classify the data. The point does not belong to any class is identified as noise data and be removed. Experiment results show that the method is considerably efficient. 4．A new approach for noise data detection based on fast search and find density peaks (FSFDP) and information entropy (IE) was proposed in this article. In the proposed method, FSFDP was used to cluster the original datasets and remove the outliers. Then construct the rectangular panes and mesh generation for each class according to the clustering results. Calculate the IE of each class after projecting all samples to the mesh, and remove the samples which have the lower local density in the class. If the IE value change obviously after the sample was removed from the class, the sample was marked as a noise. Finally, the result of the experiment shows that the presented approach is effective and accurately. 5. Combined with previous work this paper designed a scalable and interactive data cleaning system. The data cleaning system mainly includes data preprocessing, missing data imputation, noise data detection, correlation analysis modules. The system has been put into practical application, and has obtained the good effect
语种	中文
产权排序	1
页码	80页
源URL	[http://ir.sia.ac.cn/handle/173321/16741]
专题	沈阳自动化研究所_数字工厂研究室
推荐引用方式 GB/T 7714	郝胜轩. 数据清洗算法研究与实现[D]. 中国科学院沈阳自动化研究所. 中国科学院沈阳自动化研究所. 2015.

入库方式： OAI收割

来源：沈阳自动化研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。