tailieunhanh - Báo cáo khoa học: "Language Dynamics and Capitalization using Maximum Entropy"

This paper studies the impact of written language variations and the way it affects the capitalization task over time. A discriminative approach, based on maximum entropy models, is proposed to perform capitalization, taking the language changes into consideration. The proposed method makes it possible to use large corpora for training. The evaluation is performed over newspaper corpora using different testing periods. The achieved results reveal a strong relation between the capitalization performance and the elapsed time between the training and testing data periods. . | Language Dynamics and Capitalization using Maximum Entropy Fernando Batista Nuno Mamedea c and Isabel Trancosoa c a L2F - Spoken Language Systems Laboratory - INESC ID Lisboa R. Alves Redol 9 1000-029 Lisboa Portugal http b ISCTE - Instituto de Ciências do Trabalho e da Empresa Portugal c IST - Instituto Superior Técnico Portugal. fmmb njm imt @ Abstract This paper studies the impact of written language variations and the way it affects the capitalization task over time. A discriminative approach based on maximum entropy models is proposed to perform capitalization taking the language changes into consideration. The proposed method makes it possible to use large corpora for training. The evaluation is performed over newspaper corpora using different testing periods. The achieved results reveal a strong relation between the capitalization performance and the elapsed time between the training and testing data periods. 1 Introduction The capitalization task also known as truecasing Lita et al. 2003 consists of rewriting each word of an input text with its proper case information. The capitalization of a word sometimes depends on its current context and the intelligibility of texts is strongly influenced by this information. Different practical applications benefit from automatic capitalization as a preprocessing step when applied to speech recognition output which usually consists of raw text automatic capitalization provides relevant information for automatic content extraction named entity recognition and machine translation many computer applications such as word processing and e-mail clients perform automatic capitalization along with spell corrections and grammar check. The capitalization problem can be seen as a sequence tagging problem Chelba and Acero 2004 Lita et al. 2003 Kim and Woodland 2004 where each lower-case word is associated to a tag that describes its capitalization form. Chelba and Acero 2004 study the impact of .