tailieunhanh - Báo cáo khoa học: "Blog Categorization Exploiting Domain Dictionary and Dynamically Estimated Domains of Unknown Words"

This paper presents an approach to text categorization that i) uses no machine learning and ii) reacts on-the-fly to unknown words. These features are important for categorizing Blog articles, which are updated on a daily basis and filled with newly coined words. We categorize 600 Blog articles into 12 domains. As a result, our categorization method achieved an accuracy of (564/600). | Blog Categorization Exploiting Domain Dictionary and Dynamically Estimated Domains of Unknown Words Chikara Hashimoto Graduate School of Science and Engineering Yamagata University Yonezawa-shi Yamagata 992-8510 Japan ch@ Sadao Kurohashi Graduate School of Informatics Kyoto University Sakyo-ku Kyoto 606-8501 Japan kuro@ Abstract This paper presents an approach to text categorization that i uses no machine learning and ii reacts on-the-fly to unknown words. These features are important for categorizing Blog articles which are updated on a daily basis and filled with newly coined words. We categorize 600 Blog articles into 12 domains. As a result our categorization method achieved an accuracy of 564 600 . 1 Introduction This paper presents a simple but high-performance method for text categorization. The method assigns domain tags to words in an article and categorizes the article as the most dominant domain. In this study the 12 domains in Table 1 are used following Hashimoto and Kurohashi 2007 H K hereafter 1 . Fundamental words are assigned with a do- Table 1 Domains Assumed in H K CULTURE LIVING SCIENCE RECREATION DIET BUSINESS SPORTS TRANSPORTATION MEDIA HEALTH EDUCATION GOVERNMENT main tag by H K s domain dictionary while the domains of non-fundamental words . unknown words are dynamically estimated which makes the method different from previous ones. Another hallmark of the method is that it requires no machine tn addition NODOMAIN is prepared for words belonging to no particular domain like blue or people. learning. All you need is the domain dictionary and the access to the Web. 2 The Domain Dictionary H K constructed a domain dictionary where about 30 000 Japanese fundamental content words JFWs are associated with appropriate domains. For example homer is associated with SPORTS. Construction Process Preparing Keywords for each Domain About 20 keywords for each domain were collected manually from words that .

TÀI LIỆU LIÊN QUAN