tailieunhanh - Báo cáo khoa học: "A Novel Word Segmentation Approach for Written Languages with Word Boundary Markers"

Most NLP applications work under the assumption that a user input is error-free; thus, word segmentation (WS) for written languages that use word boundary markers (WBMs), such as spaces, has been regarded as a trivial issue. However, noisy real-world texts, such as blogs, e-mails, and SMS, may contain spacing errors that require correction before further processing may take place. For the Korean language, many researchers have adopted a traditional WS approach, which eliminates all spaces in the user input and re-inserts proper word boundaries. . | A Novel Word Segmentation Approach for Written Languages with Word Boundary Markers Han-Cheol Choi Do-Gil Lee Jung-Tae Lee Pontus Stenetorp Jun ichi Tsujii and Hae-Chang Rim Graduate School of Information Science and Technology The University of Tokyo Tokyo Japan Dept. of Computer Radio Communications Engineering Korea University Seoul Korea hccho pontus tsujii @ dglee jtlee rim @ Abstract Most NLP applications work under the assumption that a user input is error-free thus word segmentation WS for written languages that use word boundary markers WBMs such as spaces has been regarded as a trivial issue. However noisy real-world texts such as blogs e-mails and SMS may contain spacing errors that require correction before further processing may take place. For the Korean language many researchers have adopted a traditional WS approach which eliminates all spaces in the user input and re-inserts proper word boundaries. Unfortunately such an approach often exacerbates the word spacing quality for user input which has few or no spacing errors such is the case because a perfect WS model does not exist. In this paper we propose a novel WS method that takes into consideration the initial word spacing information of the user input. Our method generates a better output than the original user input even if the user input has few spacing errors. Moreover the proposed method significantly outperforms a state-of-the-art Korean WS model when the user input initially contains less than 10 spacing errors and performs comparably for cases containing more spacing errors. We believe that the proposed method will be a very practical pre-processing module. 1 Introduction Word segmentation WS has been a fundamental research issue for languages that do not have word boundary markers WBMs on the contrary other languages that do have WBMs have regarded the issue as a trivial task. Texts segmented with such WBMs however could contain a human writer s .