中国科学院机构知识库网格系统: 基于文本特征提取的软件缺陷自动定位技术研究

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

基于文本特征提取的软件缺陷自动定位技术研究

文献类型：学位论文


作者	朱家佑 1,2
学位类别	硕士
答辩日期	2020-05-26
授予单位	中国科学院大学
授予地点	北京
导师	贺也平
关键词	软件缺陷定位，程序分析，文本匹配，深度学习
学位专业	软件工程
中文摘要	软件缺陷定位是软件缺陷修复中最重要、最耗时以及最关键的阶段。如果能实现软件缺陷定位的自动化，或者提供一定专业性的调试方向，将大大减轻软件维护人员的工作量。本文研究了基于多种文本特征提取的Java软件缺陷自动分析定位技术。针对目前相关研究定位准确性差、粒度太粗等问题。本文分析造成这些问题的原因主要在三个方面：①缺陷报告与源码文件信息量不匹配。②缺陷报告与源码文件语义信息匹配不好。③目前主要采用的定位模型忽略了相关性匹配特征与语义匹配特征之间的互补关系。针对造成问题的原因，结合程序源码所特有的高度结构化特点，以及缺陷报告与源码的文本匹配问题特点，采用程序分析技术和深度学习技术，从数据集构建、定位问题建模以及模型分析三方面分析解决问题。本研究的主要工作及贡献如下： 1. 针对软件源码高度结构化的特点，利用JavaParser和JGit工具，提出一种建立缺陷报告和造成该缺陷的相应方法源码的对应关系的方法。利用这种对应关系，提取方法源码及其调用API描述文档。基于现有文件级缺陷定位公开数据集，提出一种代码方法级语义匹配缺陷定位数据集构建方法。有效改进了缺陷报告与源码文件信息量不匹配、语义匹配不好、定位粒度粗等问题。 2. 基于提出的数据集构建方法，使用三款不同类型软件仓储，建立了三款代码方法级语义匹配数据集。该数据集较之现有公开数据集，代码粒度更细、更精准、与缺陷报告更匹配。为以后的研究打下数据基础。 3. 将软件缺陷自动分析定位问题建模为一类二文本匹配分类问题。结合缺陷定位问题具有相关性匹配特征、语义匹配特征以及特有的定位特征的特点，在现有文本匹配模型基础上，设计了用于方法级软件缺陷定位问题的文本匹配分类模型，混合模型。 4. 基于设计的混合模型，在建立的三款数据集上进行实验验证与分析。一方面通过对比试验，验证了本文方法的有效性；另一方面，使用控制变量法，分析并验证了相关性匹配特征、语义匹配特征、API描述文档以及扩展特征等多种文本特征对定位效果的积极影响。
英文摘要	Localization of software defect is the most important, time-consuming and critical stage in the repair of software defect. If we can realize the automation of software defect localization, or provide a professional direction of debugging, it will greatly reduce the workload of software maintenance personnel. In this paper, the automatic analysis and localization technology of Java software defect based on multiple text feature extraction is studied. Aiming at the problems of poor accuracy and too coarse granularity in current research. According to our analysis, the causes of these problems are mainly in three aspects: (a) The information quantity of bug report and source file does not match. (b) The source file does not match the bug report semantic information well. (c) At present, most localization models ignores the complementary relationship between relevance matching features and semantic matching features. In view of the causes of the problems, combined with the highly structured characteristics of the program source code, as well as the characteristics of the text matching problem from bug report to source code, we solve the problems by the program analysis technology and deep learning technology from three aspects: data set construction, modeling of defect localization and model analysis. The main work and contributions of this study are as follows: 1. According to the highly structured characteristics of the software source code, this paper proposes a method to establish the corresponding link between the bug report and the source code of the corresponding method by using JavaParser and JGit. Using this link, extracting the method source code and its calls’ API description documents, based on the existing file level public data set for defect localization, this paper proposes a code method level semantic matching defect localization data set construction method. It can effectively improve the problems of information mismatch, semantic mismatch and coarse positioning granularity between bug report and source file. 2. Based on the data set construction method proposed in this study, three code method level semantic matching data sets are established by using three different types of software repositories. Compared with the existing public data set, the code granularity of this data set is finer, more accurate and more matching with bug report. Lay the data foundation for future research. 3. In this paper, the problem of automatic analysis and localization of software defect is modeled as a text matching classification problem. Based on the existing text matching model, we design a text matching classification model named Hybrid Model to solve the problem of method level software defect localization, combined with the characteristics of relevance matching, semantic matching and special localization features of defect localization problem. 4. Experimental verification and analysis are carried out based on the Hybrid Model and the three data sets established in this study. On the one hand, the effectiveness of the research method is verified by comparative test; on the other hand, the positive influence of relevance matching feature, semantic matching feature, API description document and extended feature on the localization effect is verified by means of controlling variables.
语种	中文
源URL	[http://ir.iscas.ac.cn/handle/311060/19230]
专题	总体部_学位论文
作者单位	1.中国科学院大学 2.中国科学院软件研究所
推荐引用方式 GB/T 7714	朱家佑. 基于文本特征提取的软件缺陷自动定位技术研究[D]. 北京. 中国科学院大学. 2020.

入库方式： OAI收割

来源：软件研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。