tailieunhanh - Báo cáo khoa học: "Using Similarity Scoring To Improve the Bilingual Dictionary for Word Alignment"

We describe an approach to improve the bilingual cooccurrence dictionary that is used for word alignment, and evaluate the improved dictionary using a version of the Competitive Linking algorithm. We demonstrate a problem faced by the Competitive Linking algorithm and present an approach to ameliorate it. In particular, we rebuild the bilingual dictionary by clustering similar words in a language and assigning them a higher cooccurrence score with a given word in the other language than each single word would have otherwise. Experimental results show a significant improvement in precision and recall for word alignment when the improved dicitonary. | Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics ACL Philadelphia July 2002 pp. 409-416. Using Similarity Scoring To Improve the Bilingual Dictionary for Word Alignment Katharina Probst Language Technologies Institute Carnegie Mellon University Pittsburgh PA USA 15213 kathrin@ Ralf Brown Language Technologies Institute Carnegie Mellon University Pittsburgh PA USA 15213 ralf@ Abstract We describe an approach to improve the bilingual cooccurrence dictionary that is used for word alignment and evaluate the improved dictionary using a version of the Competitive Linking algorithm. We demonstrate a problem faced by the Competitive Linking algorithm and present an approach to ameliorate it. In particular we rebuild the bilingual dictionary by clustering similar words in a language and assigning them a higher cooccurrence score with a given word in the other language than each single word would have otherwise. Experimental results show a significant improvement in precision and recall for word alignment when the improved dicitonary is used. 1 Introduction and Related Work Word alignment is a well-studied problem in Natural Language Computing. This is hardly surprising given its significance in many applications word-aligned data is crucial for example-based machine translation statistical machine translation but also other applications such as cross-lingual information retrieval. Since it is a hard and time-consuming task to hand-align bilingual data the automation of this task receives a fair amount of attention. In this paper we present an approach to improve the bilingual dictionary that is used by word alignment algorithms. Our method is based on similarity scores between words which in effect results in the clustering of morphological variants. One line of related work is research in clustering based on word similarities. This problem is an area of active research in the Information Retrieval community. For .