tailieunhanh - Báo cáo khoa học: "Fragments and Text Categorization"

We introduce two novel methods of text categorization in which documents are split into fragments. We conducted experiments on English, French and Czech. In all cases, the problems referred to a binary document classification. We find that both methods increase the accuracy of text categorization. For the Na¨ve Bayes classifier this increase is ı significant. | Fragments and Text Categorization Jan Blatak and Eva Mrakova and Lubos Popelinsky Knowledge Discovery Lab Faculty of Informatics Masaryk University 602 00 Brno Czech Republic xblatak glum popel @ Abstract We introduce two novel methods of text categorization in which documents are split into fragments. We conducted experiments on English French and Czech. In all cases the problems referred to a binary document classification. We find that both methods increase the accuracy of text categorization. For the Naive Bayes classifier this increase is significant. 1 Motivation In the process of automatic classifying documents into several predefined classes - text categorization Sebastiani 2002 - text documents are usually seen as sets or bags of all the words that have appeared in a document maybe after removing words in a stop-list. In this paper we describe a novel approach to text categorization in which each documents is first split into subparts called fragments. Each fragment is consequently seen as a new document which shares the same label with its source document. We introduce two variants of this approach - skip-tail and fragments. Both of these methods are briefly described below. We demonstrate the increased accuracy that we observed. Skipping the tail of a document The first method uses only the first X sentences of a document and is henceforth referred to as skip-tail. The idea behind this approach is that the beginning of each document contains enough information for the classification. In the process of learning each document is first replaced by its initial part. The learning algorithm then uses only these initial fragments as learning test examples. We also sought the minimum length of initial fragments that preserve the accuracy of the classification. Splitting a document into fragments The second method splits the documents into fragments which are classified independently of each others. This method is henceforth referred to as .

TỪ KHÓA LIÊN QUAN