tailieunhanh - Báo cáo khoa học: "Employing Topic Models for Pattern-based Semantic Class Discovery"

A semantic class is a collection of items (words or phrases) which have semantically peer or sibling relationship. This paper studies the employment of topic models to automatically construct semantic classes, taking as the source data a collection of raw semantic classes (RASCs), which were extracted by applying predefined patterns to web pages. The primary requirement (and challenge) here is dealing with multi-membership: An item may belong to multiple semantic classes; and we need to discover as many as possible the different semantic classes the item belongs to. . | Employing Topic Models for Pattern-based Semantic Class Discovery Huibin Zhang1 Mingjie Zhu2 Shuming Shi3 Ji-Rong Wen3 1Nankai University 2University of Science and Technology of China 3Microsoft Research Asia v-huibzh v-mingjz shumings jrwen @ Abstract A semantic class is a collection of items words or phrases which have semantically peer or sibling relationship. This paper studies the employment of topic models to automatically construct semantic classes taking as the source data a collection of raw semantic classes RASCs which were extracted by applying predefined patterns to web pages. The primary requirement and challenge here is dealing with multi-membership An item may belong to multiple semantic classes and we need to discover as many as possible the different semantic classes the item belongs to. To adopt topic models we treat RASCs as documents items as words and the final semantic classes as topics . Appropriate preprocessing and postprocessing are performed to improve results quality to reduce computation cost and to tackle the fixed- constraint of a typical topic model. Experiments conducted on 40 million web pages show that our approach could yield better results than alternative approaches. 1 Introduction Semantic class construction Lin and Pantel 2001 Pantel and Lin 2002 Pasca 2004 Shinza-to and Torisawa 2005 Ohshima et al. 2006 tries to discover the peer or sibling relationship among terms or phrases by organizing them into semantic classes. For example red white black. is a semantic class consisting of color instances. A popular way for semantic class discovery is pattern-based approach where predefined patterns Table 1 are applied to a This work was performed when the authors were interns at Microsoft Research Asia collection of web pages or an online web search engine to produce some raw semantic classes abbreviated as RASCs Table 2 . RASCs cannot be treated as the ultimate semantic classes because they are typically noisy and .