tailieunhanh - Báo cáo khoa học: "A hybrid rule/model-based finite-state framework for normalizing SMS messages"

In recent years, research in natural language processing has increasingly focused on normalizing SMS messages. Different well-defined approaches have been proposed, but the problem remains far from being solved: best systems achieve a 11% Word Error Rate. This paper presents a method that shares similarities with both spell checking and machine translation approaches. The normalization part of the system is entirely based on models trained from a corpus. Evaluated in French by 10-fold-cross validation, the system achieves a Word Error Rate and a BLEU score. . | A hybrid rule model-based finite-state framework for normalizing SMS messages Richard Beaufort1 Sophie Roekhaut2 Louise-Amélie Cougnon1 Cédrick Fairon1 1 CENTAL Université catholique de Louvain - 1348 Louvain-la-Neuve Belgium @ 2 TCTS Lab Université de Mons - 7000 Mons Belgium Abstract In recent years research in natural language processing has increasingly focused on normalizing SMS messages. Different well-defined approaches have been proposed but the problem remains far from being solved best systems achieve a 11 Word Error Rate. This paper presents a method that shares similarities with both spell checking and machine translation approaches. The normalization part of the system is entirely based on models trained from a corpus. Evaluated in French by 10-fold-cross validation the system achieves a Word Error Rate and a BLEU score. 1 Introduction Introduced a few years ago Short Message Service SMS offers the possibility of exchanging written messages between mobile phones. SMS has quickly been adopted by users. These messages often greatly deviate from traditional spelling conventions. As shown by specialists Thurlow and Brown 2003 Fairon et al. 2006 Bieswanger 2007 this variability is due to the simultaneous use of numerous coding strategies like phonetic plays 2ml read demain tomorrow phonetic transcriptions kom instead of comme like consonant skeletons tjrs for toujours always misapplied missing or incorrect separators j esper for j espère I hope j croibilk instead of je crois bien que I am pretty sure that etc. These deviations are due to three main factors the small number of characters allowed per text message by the service 140 bytes the constraints of the small phones keypads and last but not least the fact that people mostly communicate between friends and relatives in an informal register. Whatever their causes these deviations considerably hamper any

TỪ KHÓA LIÊN QUAN