tailieunhanh - Báo cáo khoa học: "Retrieving Collocations by Co-occurrences and Word Order Constraints"

In this paper, we describe a method for automatically retrieving collocations from large text corpora. This method retrieve collocations in the following stages: 1) extracting strings of characters as units of collocations 2) extracting recurrent combinations of strings in accordance with their word order in a corpus as collocations. Through the method, various range of collocations, especially domain specific collocations, are retrieved. | Retrieving Collocations by Co-occurrences and Word Order Constraints Sayori Shimohata Toshiyuki Sugio and Junji Nagata Kansai Laboratory Research Development Group Oki Electric Industry Co. Ltd. Crystal Tower 1-2-27 Shiromi Chuo-ku Osaka 540 Japan sayori sugio nagata @ Abstract In this paper we describe a method for automatically retrieving collocations from large text corpora. This method retrieve collocations in the following stages 1 extracting strings of characters as units of collocations 2 extracting recurrent combinations of strings in accordance with their word order in a corpus as collocations. Through the method various range of collocations especially domain specific collocations are retrieved. The method is practical because it uses plain texts without any information dependent on a language such as lexical knowledge and parts of speech. 1 Introduction A collocation is a recurrent combination of words ranging from word level to sentence level. In this paper we classify collocations into two types according to their structures. One is an uninterrupted collocation which consists of a sequence of words the other is an interrupted collocation which consists of words containing one or several gaps filled in by substitutable words or phrases which belong to the same category. The features of collocations are defined as follows collocations are recurrent collocations consist of one or several lexical units order of units are rigid in a collocation. For language processing such as machine translation a knowledge of domain specific collocations is indispensable because what collocations mean are different from their literal meaning and the usage and meaning of a collocation is totally dependent on each domain. In addition new collocations are produced one after another and most of them are technical jargons. There has been a growing interest in corpus-based approaches which retrieve collocations from large corpora Nagao and Mori 1994 Ikehara et .

TỪ KHÓA LIÊN QUAN