中国科学院机构知识库网格系统: 数据流上的频繁项集挖掘技术研究

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

数据流上的频繁项集挖掘技术研究

文献类型：学位论文


作者	李坤
学位类别	博士
答辩日期	2009-01-18
授予单位	中国科学院软件研究所
授予地点	软件研究所
关键词	数据流滑动窗口 ε-近似频繁元素频繁项集前缀树乐观裁剪方法
其他题名	A Study on Frequent Itemsets Mining in Data Streams
中文摘要	数据流是近年出现的一个新的应用类型，具有连续、无限、高速等特点。典型的数据流包括：无线传感器网络应用环境中由传感器传回的各种监测数据、股票交易所的股票价格信息、网络监测系统与道路交通监测系统的监测数据、电信部门的通话记录数据，以及网站的日志信息等。数据流的出现对传统的数据管理和挖掘技术提出了巨大的挑战。传统的数据挖掘技术往往对静态数据集合做多遍扫描，其时间和空间复杂度均较高，难以直接应用到数据流环境中。本文对数据流上的频繁项集挖掘问题做了深入研究，主要研究内容和创新性成果概述如下：本文首先对频繁项集挖掘问题做了一个全面的综述。综述部分先对静态数据集上的频繁项集挖掘的概念、分类、经典算法等相关研究做全面的介绍，然后分析了在数据流上进行频繁项集挖掘面临的问题和挑战、以及研究现状。针对数据流上的频繁元素挖掘问题，本文提出了一个简单而高效的算法，挖掘数据流滑动窗口上的频繁元素。算法既可以定期返回满足ε-近似要求的频繁元素，也可以响应用户在任意时间提交的请求，返回满足误差要求的结果。针对数据流上的频繁项集挖掘问题，本文提出了BFI-Stream算法，挖掘数据流滑动窗口上的所有频繁项集，实时返回精确结果。该算法使用前缀树数据结构，并且在创建和更新过程中裁剪了一部分非频繁节点，因此算法的空间和时间复杂度都较低。接着，本文针对现有的在数据流上挖掘频繁项集的算法存在维护过多非频繁项集而导致使用空间过大的问题，提出了一种乐观裁剪方法，大大降低了算法的空间复杂度。文中先对实际数据集分析了项集的频率分布情况，提出了乐观裁剪方法，裁剪大部分非频繁项集；实验结果表明乐观裁剪方法不仅大大降低了内存使用量，还提高了算法的更新效率。再次，本文针对用户指定最小支持度和允许误差的近似查询，提出了在数据流滑动窗口上挖掘频繁项集的近似算法AFI-Stream，返回满足误差要求的结果。AFI-Stream仅仅维护频繁项集，不维护非频繁项集，因此能大大降低算法使用的内存。为了满足在数据流上挖掘频繁项集研究的需要，本文设计并开发了一个数据流频繁项集挖掘原型系统StreamMiner，进行相关算法的评测和研究。
英文摘要	Data stream is a new emerging class of applications in recent years, which is often continuous, unbounded, high-speed and has a data distribution changing with time. Examples of such applications include financial analysis, network monitoring, sensor networks, telecommunication data management, and others. Mining in data streams incurs extra challenges. The traditional mining techniques usually require multiple scans on the static dataset, which result in high complexity of both time and space, so it is hard to use these traditional algorithms directly on data streams. Mining data streams requires fast, real-time processing in order to keep up with the high-speed data arrival and mining results must be returned within short response time. In this thesis, the design of several core technologies for mining frequent itemsets in data streams is investigated. First of all, the thesis gives a comprehensive overview of the frequent itemset mining algorithms. This overview first describes the concepts, classification and classical algorithms of mining frequent itemsets on static datasets. It also discusses the problems, challenges and current status of the frequent itemset mining in data streams. Second, the problem of mining frequent items in data streams is studied. A simple and efficient algorithm is proposed to find frequent items in sliding window data streams. It can return not only the deterministic ε-approximate results periodically, but also the approximate results for the random queries at any time. It is guaranteed that the errors of all the results are smaller than the user specified error bound. Third, the problem of mining frequent itemsets in data streams is studied. The thesis proposes the Bounded Frequent Itemsets stream (abbreviated as BFI-stream) algorithm, which uses a prefix-tree based structure, called BFI-tree, to maintain all accurate frequent itemsets from sliding windows over data streams. It can process the mining requests at any time and mining all frequent itemsets with accurate frequencies is just a traversal on the tree, so it has a real-time response to the requests. Next, the thesis considers the problem of huge space usage of the current algorithms. It first analyzes the distribution of itemsets' frequencies by real data sets and then proposes an optimistic pruning method to reduce the space complexity. The experimental results show that the space performance is improved greatly even when the user specified minimum support threshold is small. Furthermore, to answer queries with minimum support and error bound, the thesis proposes an approximate algorithm, called AFI-Stream, to mine frequent itemsets in sliding window streams. The AFI-Stream algorithm only maintains frequent itemsets, so the memory usage is reduced a lot. At last, a stream mining prototype, named StreamMiner, is developed to evaluate frequent itemset mining algorithms in data streams.
语种	中文
公开日期	2011-03-17
页码	147
源URL	[http://124.16.136.157/handle/311060/6402]
专题	软件研究所_人机交互技术与智能信息处理实验室_学位论文
推荐引用方式 GB/T 7714	李坤. 数据流上的频繁项集挖掘技术研究[D]. 软件研究所. 中国科学院软件研究所. 2009.

入库方式： OAI收割

来源：软件研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。