tailieunhanh - Báo cáo khoa học: "Unsupervised Segmentation of Words Using Prior Distributions of Morph Length and Frequency"

We present a language-independent and unsupervised algorithm for the segmentation of words into morphs. The algorithm is based on a new generative probabilistic model, which makes use of relevant prior information on the length and frequency distributions of morphs in a language. Our algorithm is shown to outperform two competing algorithms, when evaluated on data from a language with agglutinative morphology (Finnish), and to perform well also on English data. | Unsupervised Segmentation of Words Using Prior Distributions of Morph Length and Frequency Mathias Creutz Neural Networks Research Centre Helsinki University of Technology 9800 FIN-02015 HUT Finland Abstract We present a language-independent and unsupervised algorithm for the segmentation of words into morphs. The algorithm is based on a new generative probabilistic model which makes use of relevant prior information on the length and frequency distributions of morphs in a language. Our algorithm is shown to outperform two competing algorithms when evaluated on data from a language with agglutinative morphology Finnish and to perform well also on English data. 1 Introduction In order to artificially understand or produce natural language a system presumably has to know the elementary building blocks . the lexicon of the language. Additionally the system needs to model the relations between these lexical units. Many existing NLP natural language processing applications make use of words as such units. For instance in statistical language modelling probabilities of word sequences are typically estimated and bag-of-word models are common in information retrieval. However for some languages it is infeasible to construct lexicons for NLP applications if the lexicons contain entire words. In especially agglutinative languages 1 such as Finnish and Turkish the 1 III agglutinative languages words are formed by the concatenation of morphemes. number of possible different word forms is simply too high. For example in Finnish a single verb may appear in thousands of different forms Karlsson 1987 . According to linguistic theory words are built from smaller units morphemes. Morphemes are the smallest meaning-bearing elements of language and could be used as lexical units instead of entire words. However the construction of a comprehensive morphological lexicon or analyzer based on linguistic theory requires a considerable amount of work by .

crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.