tailieunhanh - Báo cáo khoa học: "Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining"
This paper focuses on mining the hyponymy (or is-a) relation from large-scale, open-domain web documents. A nonlinear probabilistic model is exploited to model the correlation between sentences in the aggregation of pattern matching results. Based on the model, we design a set of evidence combination and propagation algorithms. | Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining Fan Zhang2 Shuming Shi1 Jing Liu2 Shuqi Sun3 Chin-Yew Lin1 1Microsoft Research Asia 2Nankai University China 3Harbin Institute of Technology China shumings cyl @ Abstract This paper focuses on mining the hyponymy or is-a relation from large-scale open-domain web documents. A nonlinear probabilistic model is exploited to model the correlation between sentences in the aggregation of pattern matching results. Based on the model we design a set of evidence combination and propagation algorithms. These significantly improve the result quality of existing approaches. Experimental results conducted on 500 million web pages and hypernym labels for 300 terms show over 20 performance improvement in terms of P@5 MAP and R-Precision. 1 Introduction An important task in text mining is the automatic extraction of entities and their lexical relations this has wide applications in natural language processing and web search. This paper focuses on mining the hyponymy or is-a relation from large-scale open-domain web documents. From the viewpoint of entity classification the problem is to automatically assign fine-grained class labels to terms. There have been a number of approaches Hearst 1992 Pantel Ravichandran 2004 Snow et al. 2005 Durme Pasca 2008 Talukdar et al. 2008 to address the problem. These methods typically exploited manually-designed or automatical- This work was performed when Fan Zhang and Shuqi Sun were interns at Microsoft Research Asia 1159 ly-learned patterns . NP such as NP NP like NP NP is a NP . Although some degree of success has been achieved with these efforts the results are still far from perfect in terms of both recall and precision. As will be demonstrated in this paper even by processing a large corpus of 500 million web pages with the most popular patterns we are not able to extract correct labels for many especially rare entities. Even for popular terms incorrect .
đang nạp các trang xem trước