tailieunhanh - Báo cáo khoa học: "Linguistic Profiling for Author Recognition and Verification"
A new technique is introduced, linguistic profiling, in which large numbers of counts of linguistic features are used as a text profile, which can then be compared to average profiles for groups of texts. The technique proves to be quite effective for authorship verification and recognition. The best parameter settings yield a False Accept Rate of at a False Reject Rate equal to zero for the verification task on a test corpus of student essays, and a 2-way recognition accuracy on the same corpus. . | Linguistic Profiling for Author Recognition and Verification Hans van Halteren Language and Speech Univ. of Nijmegen . Box 9103 NL-6500 HD Nijmegen The Netherlands hvh@ Abstract A new technique is introduced linguistic profiling in which large numbers of counts of linguistic features are used as a text profile which can then be compared to average profiles for groups of texts. The technique proves to be quite effective for authorship verification and recognition. The best parameter settings yield a False Accept Rate of at a False Reject Rate equal to zero for the verification task on a test corpus of student essays and a 2-way recognition accuracy on the same corpus. 1 Introduction There are several situations in language research or language engineering where we are in need of a specific type of extra-linguistic information about a text document and we would like to determine this information on the basis of linguistic properties of the text. Examples are the determination of the language variety or genre of a text or a classification for document routing or information retrieval. For each of these applications techniques have been developed focusing on specific aspects of the text often based on frequency counts of functions words in linguistics and of content words in language engineering. In the technique we are introducing in this paper linguistic profiling we make no a priori choice for a specific type of word or more complex feature to be counted. Instead all possible features are included and it is determined by the statistics for the texts under consideration and the distinction to be made how much weight if any each feature is to receive. Furthermore the frequency counts are not used as absolute values but rather as deviations from a norm which is again determined by the situation at hand. Our hypothesis is that this technique can bring a useful contribution to all tasks where it is necessary to distinguish one group of texts from .
đang nạp các trang xem trước