中国科学院机构知识库网格系统: 离群模式挖掘的算法研究及应用

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

离群模式挖掘的算法研究及应用

文献类型：学位论文


作者	高君
学位类别	工学博士
答辩日期	2012-06-02
授予单位	中国科学院研究生院
授予地点	中国科学院自动化研究所
导师	胡卫明
关键词	离群模式挖掘分布式入侵检测图像显著性检测 Outlier Detection Distributed Intrusion Detection Image Saliency Detection
其他题名	Research and Applications on Outlier Detection and Analysis
学位专业	计算机应用技术
中文摘要	离群模式挖掘（Outlier Detection and Analysis）是知识发现和数据挖掘领域中备受关注的研究方向和最为活跃的研究主题之一，用于从海量数据中发现那些与众不同的、远离常规数据对象的离群数据，并对这类数据展开进一步的分析。离群数据与常规数据有着明显的差别，既有可能是在数据形成过程中出现的错误所导致，也有可能是一类全新的数据且蕴含着极重要的信息，代表了一种新的模式和知识的出现。离群模式挖掘除了具有重要的理论研究价值以外，还具有巨大的应用前景和潜在的经济价值，包括医学诊断分析、欺诈检测、入侵检测、图像处理、生物信息学等领域。近年来，离群模式挖掘处于飞速发展阶段，其中基于局部特征的离群点检测方法（Local Outlier Detection）日益成为主流方法之一。该类方法通过分析数据与其邻域间的上下文关系来区分正常数据与离群数据，能够更好地适应疏密不均的大规模复杂数据集，避免了传统离群模式挖掘方法需要假设正常数据服从的分布模型等缺陷。与基于全局特征的方法只能标记数据是否为离群数据相比，该类方法用连续的离群特征值（Outlier Score）作为软标签来表示数据为离群数据的可能性，有利于后续的数据处理与分析，如Top-N离群点分析等。本文主要围绕基于局部特征的离群点检测算法及其在入侵检测和图像处理领域的应用展开研究，主要工作包括： 1、提出了一种基于局部核密度估计的无监督离群模式挖掘算法。经典的基于局部特征的离群点检测方法通过分析比较数据在特征空间内的局部密度和其邻域密度来计算数据的离群特征值。大部分基于局部密度估计的算法包含两个缺陷：局部密度估计不够准确、不够平滑和算法性能严重依赖于数据邻域的范围参数。为此，我们提出了基于局部核密度估计和加权邻域密度估计的离群模式挖掘算法来解决上述两个问题，并且针对离群模式挖掘的特点提出了一个新的核函数：Volcano Kernel。 2、提出了一种基于局部核回归模型的多层次离群模式挖掘算法。我们从理论上对离群模式挖掘问题进行抽象描述，将无监督的离群模式挖掘问题转化为有监督的回归模型学习问题。结合信息传递机制和无监督的局部核回归模型，我们提出了多层次的离群模式挖掘算法，实现了全局视角和局部视角的融合，大幅度提高了离群特征值计算的准确性。同时，我们提出了基于上下文关系的核函数，从多个角度衡量数据间的相关性来提高局部核回归估计的鲁棒性。 3、提出了基于排序融合的集成式离群模式挖掘算法。我们提出了用于Top-N离群点检测的集成式离群模式挖掘的基本框架，并分别提出了基于离群特征值融合和基于次序信息融合的集成式学习算法。基于离群特征值的融合算法通过将离群特征值转化为数据为离群点的后验概率值实现了不同类型离群特征值的归一化处理，并融合不同离群点检测模型得到的Top-N离群点列表得到更加准确的排序结果。基于次序信息的融合算法采用Distance-based Mallows Model来描述最优序列和多个由基本离群点检测算法得到的观测序列之间的概率关系，进而通过无监督EM算法求解该概率模型和最优序列。 4、提出了一种可在线学习的动态分布...
英文摘要	Outlier detection and analysis is an important and attractive problem in knowledge discovery and data mining in large datasets. Compared with the other knowledge discovery problems, outlier detection is arguably more valuable and effective in finding rare events and exceptional cases from normal data. Outlier is an observation that deviates so much from other observations as to arouse suspicion that this observation is generated by a different mechanism. Hence, some outliers can be labeled as noises or the meaningless data that are generated by invalid operations, while the other outliers indicate a novel kind of pattern and knowledge. Outlier detection and analysis has been widely applied in many applications such as Medical Diagnosis Analysis, Fraud Detection, Intrusion Detection, Image Processing and Bioinformatics. Over the past several decades, the research on outlier detection varies from the global computation to the local analysis, and the descriptions of outliers vary from the binary interpretations to probabilistic representations. Global outlier detection identifies an observational object with a binary label by the global computation. Local outlier detection provides a probabilistic likelihood called outlier score to capture how likely an object is considered as an outlier. Outlier scores can be used not only to discriminate outliers from normal data, but also to rank all the data in a database, such as the top-n outlier detection. Compared with global outlier detection, local outlier detection separates outliers from normal data based on the context information between objects with their k-nearest neighbors, which makes that local outlier detection achieves better performance on the complex and large databases. In this thesis, we focus on the research of local density-based outlier detection and the applications of outlier detection and analysis on intrusion detection and image saliency detection. The main contributions of our work are summarized as follows: 1. We propose a robust kernel-based local outlier detection algorithm. The classical local outlier detection framework computes outlier scores based on local density estimate of objects and their neighborhood density estimate. However, most of local density-based outlier detection approaches contain two disadvantages that restrict their applications. First, local density estimate is not accurate enough to detect outliers in the complex and large databases. Second, the detection perfo...
语种	中文
其他标识符	200918014629081
源URL	[http://ir.ia.ac.cn/handle/173211/6474]
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	高君. 离群模式挖掘的算法研究及应用[D]. 中国科学院自动化研究所. 中国科学院研究生院. 2012.

入库方式： OAI收割

来源：自动化研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。