tailieunhanh - Báo cáo khoa học: "Experiments on Candidate Data for Collocation Extraction"

The paper describes ongoing work on the evaluation of methods for extracting collocation candidates from large text corpora. Our research is based on a German treebank corpus used as gold standard. Results are available for adjective+noun pairs, which proved to be a comparatively easy extraction task. We plan to extend the evaluation to other types of collocations (., PP+verb pairs). | Experiments on Candidate Data for Collocation Extraction Stefan Evert and Hannah Kermes Institut fur Maschinelle sprachverarbeitung Universitãt Stuttgart evert kermes @ Abstract The paper describes ongoing work on the evaluation of methods for extracting collocation candidates from large text corpora. Our research is based on a German treebank corpus used as gold standard. Results are available for adjective noun pairs which proved to be a comparatively easy extraction task. We plan to extend the evaluation to other types of collocations . PP verb pairs . 1 Introduction While a mostly British tradition based on the ideas of J. R. Firth dehnes collocations as significantly frequent combinations of words cooccurring within a given text span applications in terminology lexicography and natural language processing prefer a more restricted view. Collocations are understood as unpredictable combinations of words in a particular morpho-syntactic relation adjectives modifying nouns direct objects of verbs or English noun-noun compounds . The extraction of such collocations from text corpora is usually performed in a three-stage process cf. Krenn 2000 28-32 and references therein 1. The source corpus is annotated with varying amounts of linguistic information ranging from part-of-speech tags to full parse trees depending on the tools available. Then a list of word pairs satisfying the required morpho- syntactic constraints is extracted typically based on part-of-speech patterns . This first candidate list will contain both collocational and non-collocational pairs. 2. Linguistic and or heuristic filters may be applied to reduce the size of the candidate set. For instance certain generic adjectives as well as those derived from verb participles are rarely found in adj noun collocations. 3. The remaining candidates are ranked by statistical measures based on their frequency profiles . Usually word pairs are considered likely to be collocations if their

TỪ KHÓA LIÊN QUAN