tailieunhanh - Báo cáo khoa học: "Sub-sentential Alignment Using Substring Co-Occurrence Counts"

In this paper, we will present an efficient method to compute the co-occurrence counts of any pair of substring in a parallel corpus, and an algorithm that make use of these counts to create subsentential alignments on such a corpus. This algorithm has the advantage of being as general as possible regarding the segmentation of text. | Sub-sentential Alignment Using Substring Co-Occurrence Counts Fabien Cromieres GETA-CLIPS-IMAG BP53 38041 Grenoble Cedex 9 France Abstract In this paper we will present an efficient method to compute the co-occurrence counts of any pair of substring in a parallel corpus and an algorithm that make use of these counts to create sub-sentential alignments on such a corpus. This algorithm has the advantage of being as general as possible regarding the segmentation of text. 1 Introduction An interesting and important problem in the Statistical Machine Translation SMT domain is the creation of sub-sentential alignment in a parallel corpus a bilingual corpus already aligned at the sentence level . These alignments can later be used to for example train SMT systems or extract bilingual lexicons. Many algorithms have already been proposed for sub-sentential alignment. Some of them focus on word-to-word alignment Brown 97 or Melamed 97 . Others allow the generation of phrase-level alignments such as Och et al. 1999 Marcu and Wong 2002 or Zhang Vogel Waibel 2003 . However with the exception of Marcu and Wong these phrase-level alignment algorithms still place their analyses at the word level whether by first creating a word-to-word alignment or by computing correlation coefficients between pairs of individual words. This is in our opinion a limitation of these algorithms mainly because it makes them rely heavily on our capacity to segment a sentence in words. And defining what a word is is not as easy as it might seem. In peculiar in many Asians writings systems Japanese Chinese or Thai for example there is not a special symbol to delimit words such as the blank in most non Asian writing systems . Current systems usually work around this problem by using a segmentation tool to pre-process the data. There are however two major disadvantages - These tools usually need a lot of linguistic knowledge such as lexical dictionaries and hand-crafted .

TÀI LIỆU LIÊN QUAN