tailieunhanh - Báo cáo khoa học: "Conditional Random Fields for Word Hyphenation"

Finding allowable places in words to insert hyphens is an important practical problem. The algorithm that is used most often nowadays has remained essentially unchanged for 25 years. This method is the TEX hyphenation algorithm of Knuth and Liang. We present here a hyphenation method that is clearly more accurate. The new method is an application of conditional random fields. We create new training sets for English and Dutch from the CELEX European lexical resource, and achieve error rates for English of less than for correctly allowed hyphens, and less than for Dutch. . | Conditional Random Fields for Word Hyphenation Nikolaos Trogkanis Computer Science and Engineering University of California San Diego La Jolla California 92093-0404 tronikos@ Charles Elkan Computer Science and Engineering University of California San Diego La Jolla California 92093-0404 elkan@ Abstract Finding allowable places in words to insert hyphens is an important practical problem. The algorithm that is used most often nowadays has remained essentially unchanged for 25 years. This method is the TEX hyphenation algorithm of Knuth and Liang. We present here a hyphenation method that is clearly more accurate. The new method is an application of conditional random fields. We create new training sets for English and Dutch from the CELEX European lexical resource and achieve error rates for English of less than for correctly allowed hyphens and less than for Dutch. Experiments show that both the Knuth Liang method and a leading current commercial alternative have error rates several times higher for both languages. 1 Introduction The task that we investigate is learning to split words into parts that are conventionally agreed to be individual written units. In many languages it is acceptable to separate these units with hyphens but it is not acceptable to split words arbitrarily. Another way of stating the task is that we want to learn to predict for each letter in a word whether or not it is permissible for the letter to be followed by a hyphen. This means that we tag each letter with either 1 for hyphen allowed following this letter or 0 for hyphen not allowed after this letter. The hyphenation task is also called orthographic syllabification Bartlett et al. 2008 . It is an important issue in real-world text processing as described further in Section 2 below. It is also useful as a preprocessing step to improve letter-to-phoneme conversion and more generally for text-to-speech conversion. In the well-known NETtalk system for example .

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN