tailieunhanh - Báo cáo khoa học: "An Automatic Treebank Conversion Algorithm for Corpus Sharing"

An automatic treebank conversion method is proposed in this paper to convert a treebank into another treebank. A new treebank associated with a different grammar can be generated automatically from the old one such that the information in the original treebank can be transformed to the new one and be shared among different research communities. The simple algorithm achieves conversion accuracy of when tested on 8,867 sentences between two major grammar revisions of a large MT system. | An Automatic Treebank Conversion Algorithm for Corpus Sharing Jong-Nae Wang Behavior Design Corporation No. 28 2F R D Road II Science-Based Industrial Park Hsinchu Taiwan 30077 . wjn@ Abstract An automatic treebank conversion method is proposed in this paper to convert a treebank into another treebank. A new treebank associated with a different grammar can be generated automatically from the old one such that the information in the original treebank can be transformed to the new one and be shared among different research communities. The simple algorithm achieves conversion accuracy of when tested on 8 867 sentences between two major grammar revisions of a large MT system. Motivation Corpus-based research is now a major branch for language processing. One major resource for corpus-based research is the treebanks available in many research organizations Marcus et al. 1993 which carry skeletal syntactic structures or brackets that have been manually verified. Unfortunately such resources may be based on different tag sets and grammar systems of the respective research organizations. As a result reusability of such resources across research laboratories is poor and cross-checking among different grammar systems and algorithms based on the same corpora can not be conducted effectively. In fact even for the same research organization a major revision of the original grammar system may result in a re-construction of the system corpora due to the variations between the revisions. As a side effect the evolution of a system is often blocked or discouraged by the unavailability of the corresponding corpora that were previously constructed. Under such circumstances much energy and cost may have to be devoted to the re-tagging or reconstruction of those previously available corpora. It is therefore highly desirable to automatically convert an existing treebank either from a previous revision of the current system or from another research organization into .