tailieunhanh - Báo cáo khoa học: "Accurate Collocation Extraction Using a Multilingual Parser"

This paper focuses on the use of advanced techniques of text analysis as support for collocation extraction. A hybrid system is presented that combines statistical methods and multilingual parsing for detecting accurate collocational information from English, French, Spanish and Italian corpora. The advantage of relying on full parsing over using a traditional window method (which ignores the syntactic information) is first theoretically motivated, then empirically validated by a comparative evaluation experiment. . | Accurate Collocation Extraction Using a Multilingual Parser Violeta Seretan Language Technology Laboratory University of Geneva 2 rue de Candolle 1211 Geneva Eric Wehrli Language Technology Laboratory University of Geneva 2 rue de Candolle 1211 Geneva Abstract This paper focuses on the use of advanced techniques of text analysis as support for collocation extraction. A hybrid system is presented that combines statistical methods and multilingual parsing for detecting accurate collocational information from English French Spanish and Italian corpora. The advantage of relying on full parsing over using a traditional window method which ignores the syntactic information is first theoretically motivated then empirically validated by a comparative evaluation experiment. 1 Introduction Recent computational linguistics research fully acknowledged the stringent need for a systematic and appropriate treatment of phraseological units in natural language processing applications Sag et al. 2002 . Syntagmatic relations between words also called multi-word expressions or idiosyncratic interpretations that cross word boundaries Sag et al. 2002 2 constitute an important part of the lexicon of a language according to Jackendoff 1997 they are at least as numerous as the single words while according to Mel cuk 1998 they outnumber single words ten to one. Phraseological units include a wide range of phenomena among which we mention compound nouns dead end phrasal verbs ask out idioms lend somebody a hand and collocations fierce battle daunting task schedule a meeting . They pose important problems for NLP applications both text analysis and text production perspectives being concerned. In particular collocations1 are highly problematic for at least two reasons first because their linguistic status and properties are unclear as pointed out by McKeown and Radev 2000 their definition is rather vague and the distinction from other .