tailieunhanh - Báo cáo khoa học: "A Phrase-based Statistical Model for SMS Text Normalization"

One advantage of Short Messaging Service (SMS) texts bethis pre-translation normalization is that the dihave quite differently from normal written versity in different user groups and domains can texts and have some very special phenombe modeled separately without accessing and ena. To translate SMS texts, traditional adapting the language model of the MT system approaches model such irregularities difor each SMS application. Another advantage is rectly in Machine Translation (MT). | A Phrase-based Statistical Model for SMS Text Normalization AiTi Aw Min Zhang Juan Xiao Jian Su Institute of Infocomm Research 21 Heng Mui Keng Terrace Singapore 119613 aaiti mzhang stuxj sujian @ Abstract Short Messaging Service SMS texts behave quite differently from normal written texts and have some very special phenomena. To translate SMS texts traditional approaches model such irregularities directly in Machine Translation MT . However such approaches suffer from customization problem as tremendous effort is required to adapt the language model of the existing translation system to handle SMS text style. We offer an alternative approach to resolve such irregularities by normalizing SMS texts before MT. In this paper we view the task of SMS normalization as a translation problem from the SMS language to the English language 1 and we propose to adapt a phrase-based statistical MT model for the task. Evaluation by 5-fold cross validation on a parallel SMS normalized corpus of 5000 sentences shows that our method can achieve in BLEU score against the baseline BLEU score . Another experiment of translating SMS texts from English to Chinese on a separate SMS text corpus shows that using SMS normalization as MT preprocessing can largely boost SMS translation performance from to in BLEU score. 1 Motivation SMS translation is a mobile Machine Translation MT application that translates a message from one language to another. Though there exists many commercial MT systems direct use of such systems fails to work well due to the special phenomena in SMS texts . the unique relaxed and creative writing style and the frequent use of unconventional and not yet standardized shortforms. Direct modeling of these special phenomena in MT requires tremendous effort. Alternatively we can normalize SMS texts into 1 This paper only discusses English SMS text normalization. grammatical texts before MT. In this way the traditional MT is .

TÀI LIỆU LIÊN QUAN