tailieunhanh - Báo cáo khoa học: "An IR Approach for Translating New Words from Nonparallel, Comparable Texts"

Introduction were constrained by the inadequate availability of same-domain, comparable texts in electronic form. The type of nonparallel texts obtained from the LDC or university libraries were often restricted, and were usually out-of-date as soon as they became available. For new word translation, the timeliness of corpus resources is a prerequisite, so is the continuous and automatic availability of nonparallel, comparable texts in electronic form. | An IR Approach for Translating New Words from Nonparallel Comparable Texts Pascale Fung and Lo Yuen Yee HKUST Human Language Technology Center Department of Electrical and Electronic Engineering University of Science and Technology Clear Water Bay Hong Kong pascale eeyy @ 1 Introduction In recent years there is a phenomenal growth in the amount of online text material available from the greatest information repository known as the World Wide Web. Various traditional information retrieval IR techniques combined with natural language processing NLP techniques have been re-targeted to enable efficient access of the WWW search engines indexing relevance feedback query term and keyword weighting document analysis document classification etc. Most of these techniques aim at efficient online search for information already on the Web. Meanwhile the corpus linguistic community regards the WWW as a vast potential of corpus resources. It is now possible to download a large amount of texts with automatic tools when one needs to compute for example a list of synonyms or download domain-specific monolingual texts by specifying a keyword to the search engine and then use this text to extract domain-specific terms. It remains to be seen how we can also make use of the multilingual texts as NLP resources. In the years since the appearance of the first papers on using statistical models for bilingual lexicon compilation and machine transla-tion Brown et al. 1993 Brown et al. 1991 Gale and Church 1993 Church 1993 Simard et al. 1992 large amount of human effort and time has been invested in collecting parallel corpora of translated texts. Our goal is to alleviate this effort and enlarge the scope of corpus resources by looking into monolingual comparable texts. This type of texts are known as nonparallel corpora. Such nonparallel monolingual texts should be much more prevalent than parallel texts. However previous attempts at using nonparallel corpora for terminology .