tailieunhanh - Báo cáo khoa học: "Clustering Comparable Corpora For Bilingual Lexicon Extraction"

We study in this paper the problem of enhancing the comparability of bilingual corpora in order to improve the quality of bilingual lexicons extracted from comparable corpora. We introduce a clustering-based approach for enhancing corpus comparability which exploits the homogeneity feature of the corpus, and finally preserves most of the vocabulary of the original corpus. | Clustering Comparable Corpora For Bilingual Lexicon Extraction Bo Li Eric Gaussier UJF-Grenoble 1 CNRS France LIG UMR 5217 Akiko Aizawa National Institute of Informatics Tokyo Japan aizawa@ Abstract We study in this paper the problem of enhancing the comparability of bilingual corpora in order to improve the quality of bilingual lexicons extracted from comparable corpora. We introduce a clustering-based approach for enhancing corpus comparability which exploits the homogeneity feature of the corpus and finally preserves most of the vocabulary of the original corpus. Our experiments illustrate the well-foundedness of this method and show that the bilingual lexicons obtained from the homogeneous corpus are of better quality than the lexicons obtained with previous approaches. 1 Introduction Bilingual lexicons are an important resource in multilingual natural language processing tasks such as statistical machine translation Och and Ney 2003 and cross-language information retrieval Ballesteros and Croft 1997 . Because it is expensive to manually build bilingual lexicons adapted to different domains researchers have tried to automatically extract bilingual lexicons from various corpora. Compared with parallel corpora it is much easier to build high-volume comparable corpora . corpora consisting of documents in different languages covering overlapping information. Several studies have focused on the extraction of bilingual lexicons from comparable corpora Fung and McKeown 1997 Fung and Yee 1998 Rapp 1999 Dejean et al. 2002 Gaussier et al. 2004 Robitaille et al. 2006 Morin et al. 2007 Garera et al. 2009 473 Yu and Tsujii 2009 Shezaf and Rappoport 2010 . The basic assumption behind most studies on lexicon extraction from comparable corpora is a distributional hypothesis stating that words which are translation of each other are likely to appear in similar context across languages. On top of this hypothesis researchers have .

TỪ KHÓA LIÊN QUAN
crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.