tailieunhanh - Báo cáo khoa học: "Exploring the Use of Linguistic Features in Domain and Genre Classification"

The central questions are: How useful is information about part-of-speech frequency for text categorisation? Is it feasible to limit word features to content words for text classifications? This is examined for 5 domain and 4 genre classification tasks using LIMAS, the German equivalent of the Brown corpus. Because LIMAS is too heterogeneous, neither question can be answered reliably for any of the tasks. However, the results suggest that both questions have to be examined separately for each task at hand, because in some cases, the additional information can indeed improve performance. . | Proceedings of EACL 99 Exploring the Use of Linguistic Features in Domain and Genre Classification Maria Wolters1 and Mathias Kirsten2 nst. f. Kommunikationsforschung u. Phonetik Bonn wolters@ 2German Natl. Res. Center for St. Augustin Abstract The central questions are How useful is information about part-of-speech frequency for text categorisation Is it feasible to limit word features to content words for text classifications This is examined for 5 domain and 4 genre classification tasks using LIMAS the German equivalent of the Brown corpus. Because LIMAS is too heterogeneous neither question can be answered reliably for any of the tasks. However the results suggest that both questions have to be examined separately for each task at hand because in some cases the additional information can indeed improve performance. 1 Introduction The greater the amounts of text people can access and have to process the more important efficient methods for text categorisation become. So far most research has concentrated on contentbased categories. But determining the genre of a text can also be very important for example when having to distinguish an EU press release on the introduction of the euro from a newspaper commentary on the same topic. The results of . Lewis 1992 Yang and Pedersen 1997 indicate that for good content classification we basically need a vector which contains the most relevant words of the text. Using n-grams hardly yields significant improvements because the dimension of the document representation space increases exponentially. But do wordbased vectors also work well for genre detection Or do we need additional linguistically motivated features to capture the different styles of writing associated with different genres In this paper we present a pilot study based on a set of easily computable linguistic features namely the frequency of part-of-speech POS tags and a corpus of German LIMAS Gias 1975 which

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN