tailieunhanh - Báo cáo khoa học: "Adaptive String Distance Measures for Bilingual Dialect Lexicon Induction"
This paper compares different measures of graphemic similarity applied to the task of bilingual lexicon induction between a Swiss German dialect and Standard German. The measures have been adapted to this particular language pair by training stochastic transducers with the ExpectationMaximisation algorithm or by using handmade transduction rules. These adaptive metrics show up to 11% F-measure improvement over a static metric like Levenshtein distance. | Adaptive String Distance Measures for Bilingual Dialect Lexicon Induction Yves Scherrer Language Technology Laboratory LATL University of Geneva 1211 Geneva 4 Switzerland Abstract This paper compares different measures of graphemic similarity applied to the task of bilingual lexicon induction between a Swiss German dialect and Standard German. The measures have been adapted to this particular language pair by training stochastic transducers with the ExpectationMaximisation algorithm or by using handmade transduction rules. These adaptive metrics show up to 11 F-measure improvement over a static metric like Levenshtein distance. 1 Introduction Building lexical resources is a very important step in the development of any natural language processing system. However it is a time-consuming and repetitive task which makes research on automatic induction of lexicons particularly appealing. In this paper we will discuss different ways of finding lexical mappings for a translation lexicon between a Swiss German dialect and Standard German. The choice of this language pair has important consequences on the methodology. On the one hand given the so-ciolinguistic conditions of dialect use diglossia it is difficult to find written data of high quality parallel corpora are virtually non-existent. These data constraints place our work in the context of scarce-resource language processing. On the other hand as the two languages are closely related the lexical relations to be induced are less complex. We argue that this point alleviates the restrictions imposed by the scarcity of the resources. In particular we claim that if two languages are close even if one of them is scarcely documented we can successfully use techniques that require training. Finding lexical mappings amounts to finding word pairs that are maximally similar with respect to a particular definition of similarity. Similarity measures can be based on any level of linguistic analysis .
đang nạp các trang xem trước