tailieunhanh - Báo cáo khoa học: "a new text alignment architecture"
We are presenting a new, hybrid alignment architecture for aligning bilingual, linguistically annotated parallel corpora. It is able to align simultaneously at paragraph, sentence, phrase and word level, using statistical and heuristic cues, along with linguistics-based rules. The system currently aligns English and German texts, and the linguistic annotation used covers POS-tags, lemmas and syntactic constitutents. However, as the system is highly modular, we can easily adapt it to new language pairs and other types of annotation. . | ATLAS - a new text alignment architecture Bettina Schrader Institute of cognitive Science University of Osnabriick 49069 Osnabriick bschrade@ Abstract We are presenting a new hybrid alignment architecture for aligning bilingual linguistically annotated parallel corpora. It is able to align simultaneously at paragraph sentence phrase and word level using statistical and heuristic cues along with linguistics-based rules. The system currently aligns English and German texts and the linguistic annotation used covers POS-tags lemmas and syntactic constitu-tents. However as the system is highly modular we can easily adapt it to new language pairs and other types of annotation. The hybrid nature of the system allows experiments with a variety of alignment cues to find solutions to word alignment problems like the correct alignment of rare words and multiwords or how to align despite syntactic differences between two languages. First performance tests are promising and we are setting up a gold standard for a thorough evaluation of the system. 1 Introduction Aligning parallel text . automatically setting the sentences or words in one text into correspondence with their equivalents in a translation is a very useful preprocessing step for a range of applications including but not limited to machine translation Brown et al. 1993 cross-language information retrieval Hiemstra 1996 dictionary creation Smadja et al. 1996 and induction of NLP-tools Kuhn 2004 . Aligned corpora can be also be used in translation studies Neumann and Hansen-Schirra 2005 . The alignment of sentences can be done sufficiently well using cues such as sentence length Gale and Church 1993 or cognates Simard et al. 1992 . Word alignment however is almost exclusively done using statistics Brown et al. 1993 Hiemstra 1996 Vogel et al. 1999 Toutanova et al. 2002 . Hence it is difficult to align so-called rare events . tokens with a frequency below 10. This is a considerable drawback as rare events .
đang nạp các trang xem trước