Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Does more data always yield better translations?"
Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Nowadays, there are large amounts of data available to train statistical machine translation systems. However, it is not clear whether all the training data actually help or not. A system trained on a subset of such huge bilingual corpora might outperform the use of all the bilingual data. This paper studies such issues by analysing two training data selection techniques: one based on approximating the probability of an indomain corpus; and another based on infrequent n-gram occurrence. Experimental results not only report significant improvements over random sentence selection but also an improvement over a system trained with the whole. | Does more data always yield better translations Guillem Gasco Martha-Alicia Rocha German Sanchis-Trilles Jesus Andres-Ferrer and Francisco Casacuberta Departament de Sistemes Informatics i Computacio Universitat Politecnica de Valencia Cami de Vera s n 46022 Valencia Spain ggasco mrocha gsanchis jandres fcn @dsic.upv.es Abstract Nowadays there are large amounts of data available to train statistical machine translation systems. However it is not clear whether all the training data actually help or not. A system trained on a subset of such huge bilingual corpora might outperform the use of all the bilingual data. This paper studies such issues by analysing two training data selection techniques one based on approximating the probability of an indomain corpus and another based on infrequent n-gram occurrence. Experimental results not only report significant improvements over random sentence selection but also an improvement over a system trained with the whole available data. Surprisingly the improvements are obtained with just a small fraction of the data that accounts for less than 0.5 of the sentences. Afterwards we show that a much larger room for improvement exists although this is done under non-realistic conditions. 1 Introduction Globalisation and the popularisation of the Internet have lead to a rapid increase in the amount of bilingual corpora available. Entities such as the European Union the United Nations and other multinational organisations need to translate all the documentation they generate. Such translations happen every day and provide very large multilingual corpora which are oftentimes difficult to process and significantly increase the computational requirements needed to train statistical machine translation SMT systems. For instance the corpora made available for recent machine translation evaluations are in the order of 1 billion running words Callison-Burch et al. 2010 . However two main problems arise when attempting to use this huge pool