tailieunhanh - Báo cáo khoa học: "A Trainable Rule-based Algorithm for Word Segmentation"

This paper presents a trainable rule-based algorithm for performing word segmentation. The algorithm provides a simple, language-independent alternative to large-scale lexicai-based segmenters requiring large amounts of knowledge engineering. As a stand-alone segmenter, we show our algorithm to produce high performance Chinese segmentation. In addition, we show the transformation-based algorithm to be effective in improving the output of several existing word segmentation algorithms in three different languages. . | A Trainable Rule-based Algorithm for Word Segmentation David D. Palmer The MITRE Corporation 202 Burlington Rd. Bedford MA 01730 USA Abstract This paper presents a trainable rule-based algorithm for performing word segmentation. The algorithm provides a simple language-independent alternative to large-scale lexical-based segmenters requiring large amounts of knowledge engineering. As a stand-alone segmenter we show our algorithm to produce high performance Chinese segmentation. In addition we show the transformation-based algorithm to be effective in improving the output of several existing word segmentation algorithms in three different languages. 1 Introduction This paper presents a trainable rule-based algorithm for performing word segmentation. Our algorithm is effective both as a high-accuracy stand-alone seg-menter and as a postprocessor that improves the output of existing word segmentation algorithms. In the writing systems of many languages including Chinese Japanese and Thai words are not delimited by spaces. Determining the word boundaries thus tokenizing the text is usually one of the first necessary processing steps making tasks such as part-of-speech tagging and parsing possible. A variety of methods have recently been developed to perform word segmentation and the results have been published A major difficulty in evaluating segmentation algorithms is that there are no widely-accepted guidelines cis to what constitutes a word and there is therefore no agreement on how to correctly segment a text in an unsegmented language. It is 1 Most published segmentation work has been done for Chinese. For a discussion of recent Chinese segmentation work see Sproat et al. 1996 . frequently mentioned in segmentation papers that native speakers of a language do not always agree about the correct segmentation and that the same text could be segmented into several very different and equally correct sets of words by different native speakers.

TỪ KHÓA LIÊN QUAN