tailieunhanh - Báo cáo khoa học: "Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization"

Cross-language Text Categorization is the task of assigning semantic classes to documents written in a target language (. English) while the system is trained using labeled documents in a source language (. Italian). In this work we present many solutions according to the availability of bilingual resources, and we show that it is possible to deal with the problem even when no such resources are accessible. The core technique relies on the automatic acquisition of Multilingual Domain Models from comparable corpora. . | Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization Alfio Gliozzo and Carlo Strapparava ITC-Irst via Sommarive I-38050 Trento ITALY gliozzo strappa @ Abstract Cross-language Text Categorization is the task of assigning semantic classes to documents written in a target language . English while the system is trained using labeled documents in a source language . Italian . In this work we present many solutions according to the availability of bilingual resources and we show that it is possible to deal with the problem even when no such resources are accessible. The core technique relies on the automatic acquisition of Multilingual Domain Models from comparable corpora. Experiments show the effectiveness of our approach providing a low cost solution for the Cross Language Text Categorization task. In particular when bilingual dictionaries are available the performance of the categorization gets close to that of monolingual text categorization. 1 Introduction In the worldwide scenario of the Web age multilinguality is a crucial issue to deal with and to investigate leading us to reformulate most of the classical Natural Language Processing NLP problems into a multilingual setting. For instance the classical monolingual Text Categorization TC problem can be reformulated as a Cross Language Text Categorization CLTC task in which the system is trained using labeled examples in a source language . English and it classifies documents in a different target language . Italian . The applicative interest for the CLTC is immediately clear in the globalized Web scenario. For example in the community based trade . eBay it is often necessary to archive texts in different languages by adopting common merceolog-ical categories very often defined by collections of documents in a source language . English . Another application along this direction is Cross Lingual Question Answering in which it would be very useful to

TÀI LIỆU MỚI ĐĂNG
20    112    0    26-12-2024