中国科学院机构知识库网格系统: 综合集成研讨厅中的专家兴趣建模及应用

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

综合集成研讨厅中的专家兴趣建模及应用

文献类型：学位论文


作者	刘凯
学位类别	工学硕士
答辩日期	2011-06-03
授予单位	中国科学院研究生院
授予地点	中国科学院自动化研究所
导师	李耀东
关键词	综合集成研讨厅体系网页正文抽取兴趣模型非负矩阵分解个性化推荐 Cyberspace for Workshop of Metasynthetic Engineering(CWME) Webpage Body Text Extraction Interest-Based Model Non-negative Matrix Factorization(NMF) Personalized Recommendation
其他题名	An Interest-Based Expert Modeling Method and Its Application for CWME
学位专业	模式识别与智能系统
中文摘要	“从定性到定量的综合集成法”是我国科学家提出的用于解决开放的复杂巨系统及其相关问题的方法论。综合集成研讨厅作为这一方法论的发展，将专家的智慧、计算机的高性能及已有的知识体系融为一个整体，通过专家间的在线研讨与论证，结合前人总结的领域知识及计算机在逻辑运算方面的辅助，激发专家的创造性思维，深化专家知识，并形成最终的解决方案。在研讨过程中，Internet中的丰富资源对启发、激活专家群体的思维具有重大帮助。有效的将这些资源引入综合集成研讨厅体系，对解决重大决策问题有着重要的意义。现存的面向综合集成研讨环境的主动信息获取系统主要存在的问题是：向专家推荐的网页中存在无用信息，推荐过程没有考虑专家的兴趣及领域偏好。针对上述两个问题，本文开展了结合网页分类的网页正文抽取算法研究、面向综合集成研讨环境的专家兴趣建模方法研究等工作。具体包括以下三部分内容： 1.提出主题网页正文抽取算法。Internet的网页中，不同类型网页的表现形式不同，其中携带的信息量也不相同。通常主题类网页通过大段文字描述了相关主题，其文本内容对专家具有较大的帮助作用。直接向专家推荐主题类网页的正文文本，能够减轻专家的阅读负担。本文提出的主题网页正文抽取算法，基于HTML网页的特点，通过分析网页中锚文本文字和所有文字之间的字数比例关系及锚文本项数，实现对网页类型的判断；采用基于字数统计及标签判别的方法，对判定为“有用的”（主题类）网页的正文部分内容进行提取。实验结果表明，本文提出的网页类别判定方法优于简单的阈值判别法；网页正文抽取方法具有较高的成功率，并且在锚文本判别和抽取方面效果更佳。 2.提出面向综合集成研讨环境的专家兴趣模型构建方法。已有的面向综合集成研讨环境的主动信息获取系统中，对于引入研讨环境的待推荐网页，通过专家之间的协同过滤，实现重要信息的筛选。该方法在一定程度上减轻了专家人工进行信息检索的负担，但忽略了专家的领域背景及兴趣偏好，不能针对专家个体提供个性化的信息。根据研讨流程特点及专家发言的特殊性，本文提出了一种基于专家历史发言记录分析的兴趣建模方法。该方法采用非负矩阵分解技术，自动生成兴趣话题，通过分析专家发言特征词与兴趣话题的关系逐步生成专家兴趣信息，最终整合专家兴趣信息得到层次化组织的专家兴趣模型。实验结果表明：利用该模型能够很好的实现研讨领域预测，能够用来作为信息筛选的依据，为专家提供兴趣相关的、个性化的信息。 3.提出面向综合集成研讨环境的主动信息获取雏形系统的重设计方案并实现。针对原雏形系统在信息推荐过程中出现的不足，本文重新设计了面向综合集成研讨环境的主动信息获取系统，通过向系统架构中添加利用专家兴趣模型进行信息筛选的个性化信息过滤模块，实现研讨支持信息的个性化推荐。实验结果表明，该系统能够良好的运行，能够向具体专家推荐更有质量的信息，大大减轻了专家的工作压力。
英文摘要	Metasynthesis is the methodology for Open Complex Giant Systems(OCGSs) originally proposed by Chinese scientists. And Cyberspace for Workshop of Meta-synthetic Engineering (CWME) is a type of workspace that embodies this methodology. It is a man-computer co-operated intelligent complex problem solving system,whose key goal is to synthesize the wisdom of experts, the intelligence of computers, all sorts of information and knowledge into a whole. In CWME, experts express their opinions and exchange their domain knowledge through online discussion with the help of different disciplines and knowledge of human being and the intelligence of computer system. It is obvious that plenty of web resources relevant to the discussion topic will inspire the creative thinking of experts a lot supposing they can be introduced into CWME timely and precisely. And how to organize and utilize these information is a very important problem. As a component of CWME system, an Active Information Retrieval Prototype System(AIRPS) is used to provide experts in CWME with webpages from Internet. But the webpages collected by AIRPS sometimes contain useless ones and ignores the interests of the experts. Therefore, we developed the research work of webpage classification, webpage's body text extraction, and interest-based expert modeling method. More specifically, this paper involves with the following issues: 1.Webpage body text extraction with the ability to classify webpages as well. On Internet, as the expression forms of webpages varies, there are great difference in information quantity of webpages. Generally, webpages of category "topic" can help experts in CWME a lot. The contents extracted from these pages are plain texts, which can be recommended directly to release the time pressure of the participants, and are also easy for computers to process. Therefore, this paper proposed a webpage body text extraction method that meets this requirement. Based on the characteristics of html pages, this method discriminated the category of a web page as "useful" or "useless" through analyzing the proportion of the number of characters in anchor text to that in the whole page and the number of anchor texts. It then extracted the body text of the "useful" pages by a hybrid of character statistic and html-tag analysis algorithm. Experimental results showed that the proposed web page classification algorithm performed better than threshold-based methods in general. And the proposed w...
语种	中文
其他标识符	200828014628048
源URL	[http://ir.ia.ac.cn/handle/173211/7572]
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	刘凯. 综合集成研讨厅中的专家兴趣建模及应用[D]. 中国科学院自动化研究所. 中国科学院研究生院. 2011.

入库方式： OAI收割

来源：自动化研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。