tailieunhanh - Báo cáo khoa học: "A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging"
The large combined search space of joint word segmentation and Part-of-Speech (POS) tagging makes efficient decoding very hard. As a result, effective high order features representing rich contexts are inconvenient to use. In this work, we propose a novel stacked subword model for this task, concerning both efficiency and effectiveness. | A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging Weiwei Sun Department of Computational Linguistics Saarland University German Research Center for Artificial Intelligence DFKI D-66123 Saarbrucken Germany wsun@ Abstract The large combined search space of joint word segmentation and Part-of-Speech POS tagging makes efficient decoding very hard. As a result effective high order features representing rich contexts are inconvenient to use. In this work we propose a novel stacked subword model for this task concerning both efficiency and effectiveness. Our solution is a two step process. First one word-based segmenter one character-based segmenter and one local character classifier are trained to produce coarse segmentation and POS information. Second the outputs of the three predictors are merged into sub-word sequences which are further bracketed and labeled with POS tags by a fine-grained sub-word tagger. The coarse-to-fine search scheme is efficient while in the sub-word tagging step rich contextual features can be approximately derived. Evaluation on the Penn Chinese Treebank shows that our model yields improvements over the best system reported in the literature. 1 Introduction Word segmentation and part-of-speech POS tagging are necessary initial steps for more advanced Chinese language processing tasks such as parsing and semantic role labeling. Joint approaches that resolve the two tasks simultaneously have received much attention in recent research. Previous work has shown that joint solutions led to accuracy improvements over pipelined systems by avoiding segmentation error propagation and exploiting POS information to help segmentation. A challenge for joint approaches is the large combined search 1385 space which makes efficient decoding and structured learning of parameters very hard. Moreover the representation ability of models is limited since using rich contextual word features makes the search
đang nạp các trang xem trước