tailieunhanh - Báo cáo khoa học: "Automatic Labelling of Topic Models"

We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. | Automatic Labelling of Topic Models Jey Han Lau Karl Grieser David Newman and Timothy Baldwin Ậ NICTA Victoria Research Laboratory Ọ Dept of Computer Science and Software Engineering University of Melbourne Dept of Computer Science University of California Irvine jhlau@ kgrieser@ newman@ tb@ Abstract We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms titles of Wikipedia articles containing the top-ranking topic terms and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics significantly better than a benchmark method. 1 Introduction Topic modelling is an increasingly popular framework for simultaneously soft-clustering terms and documents into a fixed number of topics which take the form of a multinomial distribution over terms in the document collection Blei et al. 2003 . It has been demonstrated to be highly effective in a wide range of tasks including multidocument summarisation Haghighi and Vander-wende 2009 word sense discrimination Brody and Lapata 2009 sentiment analysis Titov and McDonald 2008 information retrieval Wei and Croft 2006 and image labelling Feng and Lapata 2010 . One standard way of interpreting a topic is to use the marginal probabilities p wi tj associated with each term wi in a given topic tj to extract out the 10 terms with highest marginal probability. This results in term lists such as 1 stock market investor fund trading investment firm exchange companies share Here and throughout the paper we will represent a topic tj via its ranking of top-10 topic terms based on p wi tj . 1536 which are clearly associated with the domain of stock market trading. The aim of this

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN