tailieunhanh - Báo cáo khoa học: "A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora"

We present a geometric view on bilingual lexicon extraction from comparable corpora, which allows to re-interpret the methods proposed so far and identify unresolved problems. This motivates three new methods that aim at solving these problems. Empirical evaluation shows the strengths and weaknesses of these methods, as well as a significant gain in the accuracy of extracted lexicons. and polysemy problems. | A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora E. Gaussiery . Renders I. Matveeva C. Gouttey H. Dejeany Xerox Research Centre Europe 6 Chemin de Maupertuis 38320 Meylan France Dept of Computer Science University of Chicago 1100 E. 58th St. Chicago IL 60637 USA matveeva@ Abstract We present a geometric view on bilingual lexicon extraction from comparable corpora which allows to re-interpret the methods proposed so far and identify unresolved problems. This motivates three new methods that aim at solving these problems. Empirical evaluation shows the strengths and weaknesses of these methods as well as a significant gain in the accuracy of extracted lexicons. 1 Introduction Comparable corpora contain texts written in different languages that roughly speaking talk about the same thing . In comparison to parallel corpora ie corpora which are mutual translations comparable corpora have not received much attention from the research community and very few methods have been proposed to extract bilingual lexicons from such corpora. However except for those found in translation services or in a few international organisations which by essence produce parallel documentations most existing multilingual corpora are not parallel but comparable. This concern is reflected in major evaluation conferences on crosslanguage information retrieval CLIR . CLEF1 which only use comparable corpora for their multilingual tracks. We adopt here a geometric view on bilingual lexicon extraction from comparable corpora which allows one to re-interpret the methods proposed thus far and formulate new ones inspired by latent semantic analysis LSA which was developed within the information retrieval IR community to treat synonymous and polysemous terms Deerwester et al. 1990 . We will explain in this paper the motivations behind the use of such methods for bilingual lexicon extraction from comparable corpora and show how to

crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.