| There is an about 30 years' history of research on Tibetan information processing. However, There are many Tibetan Character sets and corresponding encodings which are still used at present. It's still an issue to make encoding conversion. Meanwhile, there is a lack of word separator in Tibetan text, so word segmentation is also a fundamental task in Tibetan natural language processing. In addition, there is also a lack of corpus for Tibetan natural language processing. Focusing on these problems, we make research on Tibetan encoding detection and conversion, Tibetan word segmentation, web Tibetan text mining and some other issues. The achievements of this paper are as follow. First, we researched many Tibetan encodings and proposed a method to automatically identify these encodings with both the syllable dot and high frequency syllables as the features. Experimental data in large scale application shows that the accuracy is very close to 100%. We have summarized three encoding models and three encoding implementation methods, and introduced many Tibetan encodings. An encoding detection method is proposed by combining the two type features, namely, distance of syllable dots and high frequency syllables. Experimental data in large scale application shows that the accuracy is very close to 100%. We also developed some tools to make the encoding conversion. Second, we have solved or partially solved many problems in rule-based Tibetan word segmentation method, such as word disambiguation, Tibetan number identification and so on. A segmenter is implemented. Experimental data shows that the accuracy of Tibetan number identification is 99.21% and The F score of the segmenter is 96.98%. An iteratively training method is proposed to make word frequency statistic before there is a good word segmenter. It is used to make disambiguate the cross ambiguity. A fast critical word detection method is also proposed based on the double-array trie structure. Applying these methods, a segmenter named "SegT" is implemented. We have summarized the structure of Tibetan numbers, and sort Tibetan number components into different classed, namely basic number, number prefix, number linker, number suffix and independent number. A method is proposed to identify Tibetan numbers based on the classification. The method first tags each number component according to the class which it belongs to while segmenting, and then updates the tag series according to some predefined rules. At last adjacent number components are combined to form a Tibetan number if they meet a certain requirement. Experimental data shows that, fast critical word detection method improves the segmentation speed by about 15%, but it doesn't improve the segmentation precision. The accuracy of Tibetan number identification is 99.21%. The F score of the segmenter is 96.98% on a corpus of 1000 Tibetan manually segmented sentences. Third, we have reformulated Tibetan word segmentation to a syllable labelling problem and applied statistic based method to Tibetan word segmentation. We compared the effects of different feature templates and corpus scales on the performance. The segmenter achieves an F-score of 95.12% on the test set We have proposed a novel approach for Tibetan word segmentation using the conditional random field. We reformulate the segmentation as a syllable tagging problem. The approach labels each syllable with a word-internal position tag, and combines syllable(s) into words according to their tags. As there is no public available Tibetan word segmentation corpus, the training corpus is generated by "SegT" which has an F-score of 96.94% on the test set. Two feature template sets namely TMPT-6 and TMPT-10 are used and compared, and the result shows that the former is better. Experiments also show that larger training set improves the performance significantly. Trained on a set of 131,903 sentences, the segmenter achieves an F-score of 95.12% on the test set of 1,000 sentences. At last, we have analyzed the Tibetan text distribution on the internet. A large scale Tibetan text corpus is built, including nearly 1.59 million sentences or 35 million syllables in total. A general Tibetan search engine is also developed. |