tailieunhanh - Báo cáo khoa học: "Combining Distributional and Morphological Information for Part of Speech Induction"

In this paper we discuss algorithms for clustering words into classes from unlabelled text using unsupervised algorithms, based on distributional and morphological information. We show how the use of morphological information can improve the performance on rare words, and that this is robust across a wide range of languages. | Combining Distributional and Morphological Information for Part of Speech Induction Alexander Clark ISSCO TIM University of Geneva UNI-MAIL Boulevard du Pont-d Arve CH-1211 Geneve 4 Switzerland Abstract In this paper we discuss algorithms for clustering words into classes from unlabelled text using unsupervised algorithms based on distributional and morphological information. We show how the use of morphological information can improve the performance on rare words and that this is robust across a wide range of languages. 1 Introduction The task studied in this paper is the unsupervised learning of parts-of-speech that is to say lexical categories corresponding to traditional notions of for example nouns and verbs. As is often the case in machine learning of natural language there are two parallel motivations first a simple engineering one - the induction of these categories can help in smoothing and generalising other models particularly in language modelling for speech recognition as explored by Ney et al. 1994 and secondly a cognitive science motivation - exploring how evidence in the primary linguistic data can account for first language acquisition by infant children Finch and Chater 1992a Finch and Chater 1992b Redington et al. 1998 . At this early phase of learning only limited sources of information can be used primarily distributional evidence about the contexts in which words occur and morphological evidence more strictly phonotactic or orthotactic evidence about the sequence of symbols letters or phonemes of which each word is formed. A number of different approaches have been presented for this task using exclusively distributional evidence to cluster the words together starting with Lamb 1961 and these have been shown to produce good results in English Japanese and Chinese. These languages have however rather simple morphology and thus words will tend to have higher frequency than in more morphologically complex languages. In

TỪ KHÓA LIÊN QUAN