tailieunhanh - Báo cáo khoa học: "Cross Language Dependency Parsing using a Bilingual Lexicon∗"
This paper proposes an approach to enhance dependency parsing in a language by using a translated treebank from another language. A simple statistical machine translation method, word-by-word decoding, where not a parallel corpus but a bilingual lexicon is necessary, is adopted for the treebank translation. Using an ensemble method, the key information extracted from word pairs with dependency relations in the translated text is effectively integrated into the parser for the target language. | Cross Language Dependency Parsing using a Bilingual Lexicon Hai Zhao W t t Yan Song l Chunyu Kit Guodong Zhou Department of Chinese Translation and Linguistics City University of Hong Kong 83 Tat Chee Avenue Kowloon Hong Kong China School of Computer Science and Technology Soochow University Suzhou China 215006 haizhao yansong ctckit @ gdzhou@ Abstract This paper proposes an approach to enhance dependency parsing in a language by using a translated treebank from another language. A simple statistical machine translation method word-by-word decoding where not a parallel corpus but a bilingual lexicon is necessary is adopted for the treebank translation. Using an ensemble method the key information extracted from word pairs with dependency relations in the translated text is effectively integrated into the parser for the target language. The proposed method is evaluated in English and Chinese treebanks. It is shown that a translated English treebank helps a Chinese parser obtain a state-of-the-art result. 1 Introduction Although supervised learning methods bring state-of-the-art outcome for dependency parser inferring McDonald et al. 2005 Hall et al. 2007 a large enough data set is often required for specific parsing accuracy according to this type of methods. However to annotate syntactic structure either phrase- or dependency-based is a costly job. Until now the largest treebanks1 in various languages for syntax learning are with around one million words or some other similar units . Limited data stand in the way of further performance enhancement. This is the case for each individual language at least. But this is not the case as we observe all treebanks in different languages as a whole. For example of ten treebanks for CoNLL-2007 shared task none includes more than 500K The study is partially supported by City University of Hong Kong through the Strategic Research Grant 7002037 and 7002388. The first author is sponsored by a research .
đang nạp các trang xem trước