tailieunhanh - Báo cáo khoa học: "One Tokenization per Source"
We report in this paper the observation of one tokenization per source. That is, the same critical fragment in different sentences from the same source almost always realize one and the same of its many possible tokenizations. This observation is demonstrated very helpful in sentence tokenization practice, and is argued to be with far-reaching implications in natural language processing. | One Tokenization per Source Jin GUO Kent Ridge Digital Labs 21 Heng Mui Keng Terrace Singapore 119613 Abstract We report in this paper the observation of one tokenization per source. That is the same critical fragment in different sentences from the same source almost always realize one and the same of its many possible tokenizations. This observation is demonstrated very helpful in sentence tokenization practice and is argued to be with far-reaching implications in natural language processing. 1 Introduction This paper sets to establish the hypothesis of one tokenization per source. That is if an ambiguous fragment appears two or more times in different sentences from the same source it is extremely likely that they will all share the same tokenization. Sentence tokenization is the task of mapping sentences from character strings into streams of tokens. This is a long-standing problem in Chinese Language Processing since in Chinese there is an apparent lack of such explicit word delimiters as white-spaces in English. And researchers have gradually been turning to model the task as a general lexicalization or bracketing problem in Computational Linguistics with the hope that the research might also benefit the study of similar problems in multiple languages. For instance in Machine Translation it is widely agreed that many multiple-word expressions such as idioms compounds and some collocations while not explicitly delimited in sentences are ideally to be treated as single lexicalized units. The primary obstacle in sentence tokenization is in the existence of uncertainties both in the notion of words tokens and in the recognition of words tokens in context. The same fragment in different contexts would have to be tokenized differently. For instance the character string todayissunday would normally be tokenized as today is Sunday but can also reasonably be today is sun day . In terms of possibility it has been argued that no lexically possible tokenization can not .
đang nạp các trang xem trước