tailieunhanh - Báo cáo khoa học: "Language Identification of Search Engine Queries"
We consider the language identification problem for search engine queries. First, we propose a method to automatically generate a data set, which uses clickthrough logs of the Yahoo! Search Engine to derive the language of a query indirectly from the language of the documents clicked by the users. Next, we use this data set to train two decision tree classifiers; one that only uses linguistic features and is aimed for textual language identification, and one that additionally uses a non-linguistic feature, and is geared towards the identification of the language intended by the users of the search engine. . | Language Identification of Search Engine Queries Hakan Ceylan Department of Computer Science University of North Texas Denton TX 76203 hakan@ Yookyung Kim Yahoo Inc. 2821 Mission College Blvd. Santa Clara CA 95054 ykim@ Abstract We consider the language identification problem for search engine queries. First we propose a method to automatically generate a data set which uses clickthrough logs of the Yahoo Search Engine to derive the language of a query indirectly from the language of the documents clicked by the users. Next we use this data set to train two decision tree classifiers one that only uses linguistic features and is aimed for textual language identification and one that additionally uses a non-linguistic feature and is geared towards the identification of the language intended by the users of the search engine. Our results show that our method produces a highly reliable data set very efficiently and our decision tree classifier outperforms some of the best methods that have been proposed for the task of written language identification on the domain of search engine queries. 1 Introduction The language identification problem refers to the task of deciding in which natural language a given text is written. Although the problem is heavily studied by the Natural Language Processing community most of the research carried out to date has been concerned with relatively long texts such as articles or web pages which usually contain enough text for the systems built for this task to reach almost perfect accuracy. Figure 1 shows the performance of 6 different language identification methods on written texts of 10 European languages that use the Roman Alphabet. It can be seen that the methods reach a very high accuracy when the text has 100 or more characters. However search engine queries are very short in length they have about 2 to 3 words on average Input size characters Figure 1 Performance of six Language Identification methods on varying
đang nạp các trang xem trước