tailieunhanh - Báo cáo khoa học: "Multilingual Lexical Database Generation from parallel texts in 20 European languages"

This paper deals with multilingual database generation from parallel corpora. The idea is to contribute to the enrichment of lexical databases for languages with few linguistic resources. Our approach is endogenous: it relies on the raw texts only, it does not require external linguistic resources such as stemmers or taggers. The system produces alignments for the 20 European languages of the ‘Acquis Communautaire’ Corpus. | Multilingual Lexical Database Generation from parallel texts in 20 European languages with endogenous resources GIGUET EMMANUEL GREYC CNRS UMR 6072 Université de Caen 14032 Caen Cedex - France giguet@ LUQUET Pierre-Sylvain GREYC CNRS UmR 6072 Université de Caen 14032 Caen Cedex - France psluquet@ Abstract This paper deals with multilingual database generation from parallel corpora. The idea is to contribute to the enrichment of lexical databases for languages with few linguistic resources. Our approach is endogenous it relies on the raw texts only it does not require external linguistic resources such as stemmers or taggers. The system produces alignments for the 20 European languages of the Acquis Communautaire Corpus. 1 Introduction Automatic processing of bilingual and multilingual corpora Processing bilingual and multilingual corpora constitutes a major area of investigation in natural language processing. The linguistic and translational information that is available make them a valuable resource for translators lexicographers as well as terminologists. They constitute the nucleus of example-based machine translation and translation memory systems. Another field of interest is the constitution of multilingual lexical databases such as the project planned by the European Commission s Joint Research Centre JRC or the more established Papillon project. Multilingual lexical databases are databases for structured lexical data which can be used either by humans . to define their own dictionaries or by natural language processing NLP applications. Parallel corpora are freely available for research purposes and their increasing size demands the exploration of automatic methods. The Acquis Communautaire AC Corpus is such a corpus. Many research teams are involved in the JRC project for the enrichment of a multilingual lexical database. The aim of the project is to reach an automatic extraction of lexical tuples from the AC Corpus. .