数据流决策树分类算法的研究与应用
文献类型:学位论文
作者 | 侯旭珊 |
学位类别 | 硕士 |
答辩日期 | 2014-05-26 |
授予单位 | 中国科学院大学 |
授予地点 | 北京 |
导师 | 吕品 |
关键词 | 数据流 分类 决策树 缺失值 概念漂移 模拟推演系统 |
学位专业 | 计算机应用技术 |
中文摘要 | 随着大数据时代的到来,数据密集型系统得到了广泛应用,这些系统连续不断地产生高速的数据流。如何从数据流中挖掘出有价值的信息,已成为数据挖掘领域新的研究热点。与传统的静态数据相比,数据流是动态产生的,具有实时、海量、连续以及高速变化等特点,这些特点给数据流挖掘的研究工作带来了巨大的挑战。 数据流分类是数据流挖掘中重要的研究方向,它在网络入侵检测和信用卡欺诈等很多方面得到了实际应用。而实际中的数据流会因网络传输故障等原因造成数据缺失,也会随着时间的变化发生概念漂移,因此本文主要研究数据流决策树分类中的缺失值处理和自适应概念漂移问题,并研究其在模拟推演系统的数据分析中的应用。 首先,本文调研了数据流决策树分类算法的研究现状,分析了模拟推演系统对数据流分类的需求。通过详细分析数据流决策树分类中的经典算法,指出了经典算法中存在的问题。 其次,针对数据流中的数据缺失问题,本文提出了一种自适应处理缺失数据流的高效决策树算法。通过自适应选择缺失值处理方法,采用改进的贝叶斯分类器,并优化更新机制,提升算法的时间性能。仿真实验结果表明,本文算法在保持与现有缺失值处理算法的分类准确率相同的情况下,算法的时间性能提高了20%至70%。 再次,针对数据流中的概念漂移现象,本文提出了一种基于多窗口机制的自适应概念漂移算法。通过自适应确定滑动窗口的大小,增强算法对概念漂移的适应能力,并改进建立候选节点的机制,降低算法的运行时间。仿真实验结果表明,本文算法比现有概念漂移算法具有更强的概念漂移适应能力和更短的运行时间。 最后,本文根据模拟推演系统对数据流分类的需求,设计了应用于模拟推演中的数据流分类系统。应用本文提出的缺失值处理算法和自适应概念漂移算法,能够对模拟推演中的数据流进行分类挖掘,为模拟推演过程提供决策支持。 |
英文摘要 | Many application systems today generate continuous data stream. It has become a new research direction in Data mining that how to mining information from data stream. Data stream is real-time, massive, continuous and rapid, which brings a huge challenge to the research of Data stream mining. Data stream classification is an important research in Data stream mining, which has been applied in network intrusion detection, credit card fraud and many other areas. Data streams in the actual always have missing values and concept drift. Therefore, this paper studies how to deal with missing values and concept drift in data stream decision tree classification, and how to apply these algorithms in the wargame. Firstly, this paper investigates the research status on data stream decision tree classification, and analyzes the requirements of the wargame in data stream classification. This paper analyzes the detailed of the classic algorithms in data stream decision tree classification, and points out the problems in these algorithms. Secondly, this paper presents an efficient algorithm for missing values in data stream decision tree classification(EAM) to avoid the impact of missing values. EAM selects method for missing values adaptively, and uses an improved Bayesian classifier, and optimizes the update mechanism, which can improve the time performance. The experiment results show that the run-time of EAM is reduced by 20%-70%, while the accuracy is the same as the existing algorithm. Thirdly, this paper presents a concept-adapting algorithm based multi-windows in data stream decision tree classification(CAMW) to adapt the concept drift in the data stream. CAMW chooses the size of the sliding window adaptively and enhances the ability to adapt the concept drift. Also, CAMW improves the mechanism to create the candidate nodes and reduces its time complexity. The experiment results show that CAMW has greater ability for concept drift and lower run-time than the existing algorithm on concept drift. Finally, this paper designs the data stream classification system for the wargame, according to the requirements of the wargame in the data stream classification. The data stream classification system uses the algorithms in this paper to classify the data streams of the wargame and supports the decision for the wargame. |
学科主题 | 计算机应用 |
语种 | 中文 |
公开日期 | 2014-05-27 |
源URL | [http://ir.iscas.ac.cn/handle/311060/16391] ![]() |
专题 | 软件研究所_综合信息系统技术国家级重点实验室 _学位论文 |
推荐引用方式 GB/T 7714 | 侯旭珊. 数据流决策树分类算法的研究与应用[D]. 北京. 中国科学院大学. 2014. |
入库方式: OAI收割
来源:软件研究所
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。