中国科学院机构知识库网格系统: 基于单元阵列的电子表格计算语义错误检测与修复

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

基于单元阵列的电子表格计算语义错误检测与修复

文献类型：学位论文


作者	窦文生
学位类别	博士
答辩日期	2015-05-26
授予单位	中国科学院大学
授予地点	北京
导师	魏峻
关键词	电子表格单元阵列计算语义错误
学位专业	计算机软件与理论
中文摘要	电子表格是当前广泛使用的终端用户开发工具，并广泛应用于数据记录、金融、教育等多个领域。电子表格中的数据与公式都具有一定的计算语义，当电子表格中的数据或公式未能反映终端用户的语义意图时，就会产生计算语义错误。该类错误极易导致电子表格中数据出现不一致性问题，从而降低电子表格的质量。电子表格中的通用计算语义错误是难以自动化检测与修复的，因为电子表格中数据和公式正确与否需要终端用户人为判定或提供规范严格检查，而在电子表格中不存在显式的规范。我们发现电子表格的一行或一列中的某些连续单元格往往具有相同的计算语义，这些连续单元格被称为单元阵列，而且单元阵列中可能存在公式丢失、公式不一致、数据不一致等计算语义错误。基于这个新观察，我们设计了一系列新方法来自动化检测并修复单元阵列及与其相关计算语义错误。本文从以下三个方面开展研究工作：（1）针对电子表格的数据集EUSES与Enron进行一系列实证研究，分析单元阵列在电子表格中的实际应用情况。研究结果发现单元阵列在电子表格中十分常见，具有公式的电子表格中68.6%的电子表格存在单元阵列；83.1%的单元阵列中的数据依赖关系是其每个单元格引用与该单元格同行/同列的其他单元格作为输入（我们称之为同构数据依赖）；在单元阵列位置分布上，单元阵列极少相交。基于该实证研究的结果，我们研究了针对同构数据依赖/非同构数据依赖的单元阵列及与其相关计算语义错误检测与修复机制。（2）针对同构数据依赖的单元阵列，我们提出了一种相应的单元阵列检测与修复方法SameCheck。其中包括：基于单元阵列中同构数据依赖的特性，设计一组启发式规则来识别电子表格中的单元阵列；针对其中存在计算语义错误的单元阵列，提出了一种改进的程序合成机制生成单元阵列的计算语义，进而利用该计算语义来修复单元阵列相关的计算语义错误。（3）针对非同构数据依赖的单元阵列，我们提出基于数据依赖相似性的单元阵列检测与修复方法ShareCheck。其中包括：一种基于单元格数据依赖相似性的单元阵列识别算法；以及利用单元阵列不相交等特性过滤误报单元阵列的求精算法。我们基于EUSES数据集以及中国科学院软件研究所实际使用的电子表格，对SameCheck与ShareCheck进行一系列实验。实验结果表明：（1）单元阵列相关的计算语义错误是十分常见的，并且确实降低了电子表格的质量；（2）SameCheck与ShareCheck能够有效检测与修复单元阵列相关的计算语义错误，给终端用户提供有效的帮助信息；（3）ShareCheck在检测非同构数据依赖单元阵列的同时，能有效消除SameCheck误报的同构数据依赖单元阵列。
英文摘要	Spreadsheet is one of the most wildly used end-user development tools. Spreadsheet is used for data storage and tracking, financial reporting, education, and so on. Spreadsheet cells that contain data or formulas often have certain computational semantics. When cells’ data or formulas cannot prescribe the end-users intended semantics, they would suffer from computational semantic errors. This kind of errors could cause data inconsistency, which could degrade the quality of spreadsheets. It is hard to identify which cells contain computational semantic errors in spreadsheets, because this involves knowledge of intended semantics, which often requires human judgments or specifications. We observe that spreadsheet cells whose computation is subject to the same semantics are often clustered in a row or column (we name them as a cell array). In spreadsheets, cell arrays may suffer from several kinds of computational semantic errors, such as missing formula errors, inconsistent formula errors, and inconsistent data errors. Based on this observation, we propose two novel approaches to automatically detect and repair cell array-based computational semantic errors in spreadsheets. In this dissertation, we have done three pieces of work related to cell array-based computational semantic errors. (1) We do the first empirical study to understand the key properties of cell arrays in real-life spreadsheets on EUSES and Enron corpuses. Our study shows several interesting findings as following: a) cell arrays are widely used in real-life spreadsheet, and about 68.6% of spreadsheet with formulas have used cell arrays. b) About 83.1% of cell arrays obey the isomorphic data dependence, in which each cell reference the cells in the same row/column with the cell as inputs. c) Cell arrays in spreadsheets rarely overlap. Based on this empirical study, we propose two novel approaches to detect and repair cell-array based computational semantic errors. (2) We propose SameCheck that detects and repairs cell arrays with isomorphic data dependence. SameCheck assumes that each cell in a cell array references the other cells with the same row/column with this cell as inputs, and proposes a novel algorithm to detect this kind of cell arrays. For the smelly cell arrays, SameCheck propose a novel improved component-based program synthesis to recover the cell arrays’ intended computational semantics. (3) We propose ShareCheck that detects and repairs cell arrays with non-isomorphic data dependence. ShareCheck identifies cell arrays through a novel data dependence similarity of cells, and then leverages the properties (such as, non-overlap) of cell arrays to filter out wrongly identified cell arrays. The evaluations on EUSES corpus and case studies on real-life spreadsheets from Institute of Software Chinese Academy of Sciences show that (1) cell array-based computational semantic errors are very common, and they have degraded the quality of existing spreadsheets, (2) SameCheck and ShareCheck can detect and repair cell array-based computational semantic errors precisely, and (3) ShareCheck could detect cell arrays with non-isomorphic data dependence, and besides, could filter out the wrongly identified cell arrays with isomorphic data dependence by SameCheck.
学科主题	软件工程
语种	中文
公开日期	2015-07-10
源URL	[http://ir.iscas.ac.cn/handle/311060/17164]
专题	软件研究所_软件工程技术研究开发中心 _学位论文
推荐引用方式 GB/T 7714	窦文生. 基于单元阵列的电子表格计算语义错误检测与修复[D]. 北京. 中国科学院大学. 2015.

入库方式： OAI收割

来源：软件研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。