tailieunhanh - Báo cáo khoa học: "Tokenization: Returning to a Long Solved Problem"

We examine some of the frequently disregarded subtleties of tokenization in Penn Treebank style, and present a new rule-based preprocessing toolkit that not only reproduces the Treebank tokenization with unmatched accuracy, but also maintains exact stand-off pointers to the original text and allows flexible configuration to diverse use cases (. to genreor domain-specific idiosyncrasies). | Tokenization Returning to a Long Solved Problem A Survey Contrastive Experiment Recommendations and Toolkit Rebecca Dridan Stephan Oepen Institutt for Informatikk Universitetet i Oslo rdridan oe @ Abstract We examine some of the frequently disregarded subtleties of tokenization in Penn Treebank style and present a new rule-based preprocessing toolkit that not only reproduces the Treebank tokenization with unmatched accuracy but also maintains exact stand-off pointers to the original text and allows flexible configuration to diverse use cases . to genre-or domain-specific idiosyncrasies . 1 Introduction Motivation The task of tokenization is hardly counted among the grand challenges of NLP and is conventionally interpreted as breaking up natural language text . into distinct meaningful units or tokens Kaplan 2005 . Practically speaking however tokeniza-tion is often combined with other string-level pre-processing for example normalization of punctuation of different conventions for dashes say disambiguation of quotation marks into opening vs. closing quotes or removal of unwanted mark-up where the specifics of such pre-processing depend both on properties of the input text as well as on assumptions made in downstream processing. Applying some string-level normalization prior to the identification of token boundaries can improve or simplify tokenization and a sub-task like the disambiguation of quote marks would in fact be hard to perform after tokenization seeing that it depends on adjacency to whitespace. In the following we thus assume a generalized notion of tokenization comprising all string-level processing up to and including the conversion of a sequence of characters a string to a sequence of token Obviously some of the normalization we include in the to-kenization task in this generalized interpretation could be left to downstream analysis where a tagger or parser for example could be expected to accept non-disambiguated quote marks .

TỪ KHÓA LIÊN QUAN