tailieunhanh - Báo cáo khoa học: "Chinese sentence segmentation as comma classification"

We describe a method for disambiguating Chinese commas that is central to Chinese sentence segmentation. Chinese sentence segmentation is viewed as the detection of loosely coordinated clauses separated by commas. Trained and tested on data derived from the Chinese Treebank, our model achieves a classification accuracy of close to 90% overall, which translates to an F1 score of 70% for detecting commas that signal sentence boundaries. | Chinese sentence segmentation as comma classification Nianwen Xue and Yaqin Yang Brandeis University Computer Science Department Waltham Ma 02453 xuen yaqin @ Abstract We describe a method for disambiguating Chinese commas that is central to Chinese sentence segmentation. Chinese sentence segmentation is viewed as the detection of loosely coordinated clauses separated by commas. Trained and tested on data derived from the Chinese Treebank our model achieves a classification accuracy of close to 90 overall which translates to an F1 score of 70 for detecting commas that signal sentence boundaries. 1 Introduction Sentence segmentation or the detection of sentence boundaries is very much a solved problem for English. Sentence boundaries can be determined by looking for periods exclamation marks and question marks. Although the symbol dot that is used to represent period is ambiguous because it is also used as the decimal point or in abbreviations its resolution only requires local context. It can be resolved fairly easily with rules in the form of regular expressions or in a machine-learning framework Reynar and Ratnaparkhi 1997 . Chinese also uses periods albeit with a different symbol question marks and exclamation marks to indicate sentence boundaries. Where these punctuation marks exist sentence boundaries can be unambiguously detected. The difference is that the Chinese comma also functions similarly as the English period in some context and signals the boundary of a sentence. As a result if the commas are not disambiguated Chinese would have these run-on sen-631 tences that can only be plausibly translated into multiple English sentences. An example is given in 1 where one Chinese sentence is plausibly translated into three English sentences. 1 Ù a ttiw-fi o Ù this period time AS AS pay attention to this nano 3 1 0 n ffi T CL Nano 3 even in person visit AS à 5B b 2 M a few AS computer market comparatively ww 3 S speaking Zhuoyue s price relatively N 4

TỪ KHÓA LIÊN QUAN