tailieunhanh - Báo cáo khoa học: "Unsupervized Word Segmentation: the case for Mandarin Chinese"

In this paper, we present an unsupervized segmentation system tested on Mandarin Chinese. Following Harris's Hypothesis in Kempe (1999) and Tanaka-Ishii's (2005) reformulation, we base our work on the Variation of Branching Entropy. We improve on (Jin and Tanaka-Ishii, 2006) by adding normalization and viterbidecoding. | Unsupervized Word Segmentation the case for Mandarin Chinese Pierre Magistry Alpage INRIA Univ. Paris 7 175 rue du Chevaleret 75013 Paris France Benoit Sagot Alpage INRIA Univ. Paris 7 175 rue du Chevaleret 75013 Paris France Abstract In this paper we present an unsupervized segmentation system tested on Mandarin Chinese. Following Harris s Hypothesis in Kempe 1999 and Tanaka-Ishii s 2005 reformulation we base our work on the Variation of Branching Entropy. We improve on Jin and Tanaka-Ishii 2006 by adding normalization and viterbi-decoding. This enable us to remove most of the thresholds and parameters from their model and to reach near state-of-the-art results Wang et al. 2011 with a simpler system. We provide evaluation on different corpora available from the Segmentation bake-off II Emerson 2005 and define a more precise topline for the task using cross-trained supervized system available off-the-shelf Zhang and Clark 2010 Zhao and Kit 2008 Huang and Zhao 2007 1 Introduction The Chinese script has no explicit word boundaries. Therefore tokenization itself although the very first step of many text processing systems is a challenging task. Supervized segmentation systems exist but rely on manually segmented corpora which are often specific to a genre or a domain and use many different segmentation guidelines. In order to deal with a larger variety of genres and domains or to tackle more theoretic questions about linguistic units unsupervized segmentation is still an important issue. After a short review of the corresponding literature in Section 2 we discuss the challenging issue of evaluating unsupervized word segmentation systems in Section 3. Section 4 and Section 5 present the core of our system. Finally in Section 6 we detail and discuss our results. 383 2 State of the Art Unsupervized word segmentation systems tend to make use of three different types of information the cohesion of the resulting units . .

TỪ KHÓA LIÊN QUAN
crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.