tailieunhanh - Báo cáo khoa học: "truecasing"

Truecasing is the process of restoring case information to badly-cased or noncased text. This paper explores truecasing issues and proposes a statistical, language modeling based truecaser which achieves an accuracy of ∼98% on news articles. Task based evaluation shows a 26% F-measure improvement in named entity recognition when using truecasing. | tRuEcasIng Lucian Vlad Lita Carnegie Mellon llita@ Abe Ittycheriah Salim Roukos Nanda Kambhatla IBM TJ. Watson IBM TJ. Watson IBM TJ. Watson abei@ roukos@ nanda@ Abstract Truecasing is the process of restoring case information to badly-cased or noncased text. This paper explores truecas-ing issues and proposes a statistical language modeling based truecaser which achieves an accuracy of 98 on news articles. Task based evaluation shows a 26 F-measure improvement in named entity recognition when using truecasing. In the context of automatic content extraction mention detection on automatic speech recognition text is also improved by a factor of 8. Truecasing also enhances machine translation output legibility and yields a BLEU score improvement of . This paper argues for the use of truecasing as a valuable component in text processing applications. 1 Introduction While it is true that large high quality text corpora are becoming a reality it is also true that the digital world is flooded with enormous collections of low quality natural language text. Transcripts from various audio sources automatic speech recognition optical character recognition online messaging and gaming email and the web are just a few examples of raw text sources with content often produced in a hurry containing misspellings insertions deletions grammatical errors neologisms jargon terms Work done at IBM TJ Watson Research Center etc. We want to enhance the quality of such sources in order to produce better rule-based systems and sharper statistical models. This paper focuses on truecasing which is the process of restoring case information to raw text. Besides text rEaDaBILiTY truecasing enhances the quality of case-carrying data brings into the picture new corpora originally considered too noisy for various NLP tasks and performs case normalization across styles sources and genres. Consider the following mildly ambiguous sentence us rep. james pond .

TÀI LIỆU LIÊN QUAN