印刷体朝鲜文识别方法研究
文献类型:学位论文
作者 | 许日俊 |
学位类别 | 工学硕士 |
答辩日期 | 2005-05-15 |
授予单位 | 中国科学院研究生院 |
授予地点 | 中国科学院自动化研究所 |
导师 | 刘昌平 |
关键词 | 朝鲜文识别 字母分割 辅音 元音 识别后处理 Hangul Recognition Grapheme Segmentation Consonant Vowel Post-processing |
其他题名 | Research of Printed Korean Character Recognition |
学位专业 | 模式识别与智能系统 |
中文摘要 | 朝鲜文是一种由辅音和元音基本字母构成的文字,它跟汉字有很多相似之处,因此汉字识别中用到的一些理论也可以应用到朝鲜文识别中。朝鲜文根据元音字母类型和后辅音的有无可以分为 6 种结构,理论上可以组成 11000 多个文字。朝鲜文中普遍存在相似字,这个特点严重阻碍了朝鲜文识别技术的发展。为了减少识别文字的复杂度,本文提出了一种基于字母的识别方法。本文在粗分类候选字的基础上,利用背景细化方法分离出构成文字的基本字母,然后提取两层外围距离特征,通过神经网络和结构分析识别字母,并根据候选字的实际情况以及朝鲜文的组成特点,对朝鲜文细分类进行了研究。另外,在现有的朝鲜文单词统计表的基础上,对识别后处理进行了实验,并取得了比较好的效果。下面列出了本文的主要工作: (一) 分析朝鲜文文字结构特点,利用垂直方向、水平方向投影直方图法确定背景细化区域,通过对这些背景区域进行细化处理,得到字母之间的分割线并分离出了每个字母。 (二) 从分离出的字母提取两层外围距离特征,以这些特征向量为输入建立了三层 BP 神经网络。然后利用神经网络和结构特点识别字母,分析现有的印刷体朝鲜文识别系统给出的候选字组来判决识别文字,对经常用到的 4 种印刷体朝鲜文相似字候选组进行了识别研究。 (三) 初步地建立了一种识别后处理系统。利用双方向搜索方法,从朝鲜文单词统计表中检索主体词和附加词,并把句子中识别错误的单词修正过来,对识别系统有一定的改善作用。 |
英文摘要 | Hangul(Korean) is a language which character is composed of consonants and vowels. Since Hangul is very similar to the Chinese languages, some recognition methods applied to Chinese character recognition can be also applied to Hangul recognition. Hangul can be classified into 6 types according to the form of vowels and the existence of final-consonants, which results in over 11000 possibilities, and many of these combinations look remarkably similar. To reduce the complexity of character recognition, the approach of separating each alphabet of a character and identifying the separated alphabets independently was adopted in this thesis. Basing on existing Hangul recogntion system, background-thinning technique was proposed to separate graphemes, and then separated graphemes were recognized by the neural network classifier using peripheral feature. Finally, a character is recognized by combining recognized graphemes using the information of candidates. Furthermore, an efficient Post-processing method was proposed based on the Hangul word statistics. The main points in this thesis include: 1. By analyzing the structure of Hangul character, the horizontal and vertical projection histogram method was used to calculate the thinning area of background. Then through thining processing to the background region of character image, the segmentation-line between the alphabet was found to separate each alphabet. 2. The 3 layer BP neural network was established by training the peripheral feature vectors extracting from the alphabet image. Then consonants and vowels were recognized with neural network and structure information analysis methods, and then the similar characters were distinguished by analyzing the candidate similar Hangul character group in 4 most frequently used printed Hanguls fonts. 3. The wrongly recongized words were found and corrected by searching substantive and empty words from the Hangul word statistics with two-direction-searching method. The recognition accuracy of character classification was improved with this post-processing method. |
语种 | 中文 |
其他标识符 | 200228014603567 |
源URL | [http://ir.ia.ac.cn/handle/173211/6893] ![]() |
专题 | 毕业生_硕士学位论文 |
推荐引用方式 GB/T 7714 | 许日俊. 印刷体朝鲜文识别方法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院. 2005. |
入库方式: OAI收割
来源:自动化研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。