tailieunhanh - Báo cáo khoa học: "Joint Word Segmentation and POS Tagging using a Single Perceptron"

For Chinese POS tagging, word segmentation is a preliminary step. To avoid error propagation and improve segmentation by utilizing POS information, segmentation and tagging can be performed simultaneously. A challenge for this joint approach is the large combined search space, which makes efficient decoding very hard. Recent research has explored the integration of segmentation and POS tagging, by decoding under restricted versions of the full combined search space. | Joint Word Segmentation and POS Tagging using a Single Perceptron Yue Zhang and Stephen Clark Oxford University Computing Laboratory Wolfson Building Parks Road Oxford OX1 3QD UK @ Abstract For Chinese POS tagging word segmentation is a preliminary step. To avoid error propagation and improve segmentation by utilizing POS information segmentation and tagging canbe performed simultaneously. A challenge for this joint approach is the large combined search space which makes efficient decoding very hard. Recent research has explored the integration of segmentation and POS tagging by decoding under restricted versions of the full combined search space. In this paper we propose a joint segmentation and POS tagging model that does not impose any hard constraints on the interaction between word and POS information. Fast decoding is achieved by using a novel multiple-beam search algorithm. The system uses a discriminative statistical model trained using the generalized perceptron algorithm. The joint model gives an error reduction in segmentation accuracy of and an error reduction in tagging accuracy of compared to the traditional pipeline approach. 1 Introduction Since Chinese sentences do not contain explicitly marked word boundaries word segmentation is a necessary step before POS tagging can be performed. Typically a Chinese POS tagger takes segmented inputs which are produced by a separate word seg-mentor. This two-step approach however has an obvious flaw of error propagation since word segmentation errors cannot be corrected by the POS tagger. A better approach would be to utilize POS in formation to improve word segmentation. For example the POS-word pattern number word A a common measure word can help in segmenting the character sequence A A into the word sequence one A measure word A person instead of one A A personal adj . Moreover the comparatively rare POS pattern number word number word can help to prevent .

TÀI LIỆU LIÊN QUAN