tailieunhanh - Báo cáo khoa học: "Unsupervized Word Segmentation: the case for Mandarin Chinese"
In this paper, we present an unsupervized segmentation system tested on Mandarin Chinese. Following Harris's Hypothesis in Kempe (1999) and Tanaka-Ishii's (2005) reformulation, we base our work on the Variation of Branching Entropy. We improve on (Jin and Tanaka-Ishii, 2006) by adding normalization and viterbidecoding. | Unsupervized Word Segmentation the case for Mandarin Chinese Pierre Magistry Alpage INRIA Univ. Paris 7 175 rue du Chevaleret 75013 Paris France Benoit Sagot Alpage INRIA Univ. Paris 7 175 rue du Chevaleret 75013 Paris France Abstract In this paper we present an unsupervized segmentation system tested on Mandarin Chinese. Following Harris s Hypothesis in Kempe 1999 and Tanaka-Ishii s 2005 reformulation we base our work on the Variation of Branching Entropy. We improve on Jin and Tanaka-Ishii 2006 by adding normalization and viterbi-decoding. This enable us to remove most of the thresholds and parameters from their model and to reach near state-of-the-art results Wang et al. 2011 with a simpler system. We provide evaluation on different corpora available from the Segmentation bake-off II Emerson 2005 and define a more precise topline for the task using cross-trained supervized system available off-the-shelf Zhang and Clark 2010 Zhao and Kit 2008 Huang and Zhao 2007 1 Introduction The Chinese script has no explicit word boundaries. Therefore tokenization itself although the very first step of many text processing systems is a challenging task. Supervized segmentation systems exist but rely on manually segmented corpora which are often specific to a genre or a domain and use many different segmentation guidelines. In order to deal with a larger variety of genres and domains or to tackle more theoretic questions about linguistic units unsupervized segmentation is still an important issue. After a short review of the corresponding literature in Section 2 we discuss the challenging issue of evaluating unsupervized word segmentation systems in Section 3. Section 4 and Section 5 present the core of our system. Finally in Section 6 we detail and discuss our results. 383 2 State of the Art Unsupervized word segmentation systems tend to make use of three different types of information the cohesion of the resulting units . .
đang nạp các trang xem trước