Đang chuẩn bị liên kết để tải về tài liệu:
Summary of Computer science master thesis: Enhancing the quality of machine translation system using cross lingual word embedding models
Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
The purpose of this thesis is to propose two models for using cross-lingual word embedding models to address the above impediment. The first model enhances the quality of the phrase-table in SMT, and the remaining model tackles the unknown word problem in NMT. | Summary of Computer science master thesis Enhancing the quality of machine translation system using cross lingual word embedding models VIETNAM NATIONAL UNIVERSITY HANOI UNIVERSITY OF ENGINEERNING AND TECHNOLOGY NGUYEN MINH THUAN ENHANCING THE QUALITY OF MACHINE TRANSLATION SYSTEM USING CROSS-LINGUAL WORD EMBEDDING MODELS Major Computer Science Code 8480101.01 SUMMARY OF COMPUTER SCIENCE MASTER THESIS SUPERVISOR Associate Professor Nguyen Phuong Thai Publication Minh-Thuan Nguyen Van-Tan Bui Huy-Hien Vu Phuong-Thai Nguyen Chi-Mai Luong Enhancing the quality of Phrase-table in Statistical Machine Translation for Less-Common and Low-Resource Languages in the 2018 International Conference on Asian Language Processing IALP 2018 . Hanoi 10 2018 2 Chapter 1 Introduction This chapter introduces the motivation of the thesis related works and our proposed models. Nowadays machine translation systems attain much success in practice and two approaches that have been widely used for MT are Phrase- based statistical machine translation PBSMT and Neural Machine Translation NMT . In PBSMT having a good phrase-table possibly makes translation systems improve the quality of translation. However attaining a rich phrase-table is a challenge since the phrase-table is extracted and trained from large amounts of bilingual corpora which require much effort and financial support especially for less-common languages such as Vietnamese Laos etc. In the NMT system To reduce the computational complexity conventional NMT systems often limit their vocabularies to be the top 30K-80K most frequent words in the source and target language and all words outside the vocabulary called unknown words are replaced into a single unk symbol. This approach leads to the inability to generate the proper translation for this unknown words during testing. Latterly there are several approaches to address the above impediments. Especially techniques using word embedding receive much interest from natural .