tailieunhanh - Báo cáo khoa học: "Untangling the Cross-Lingual Link Structure of Wikipedia"

Wikipedia articles in different languages are connected by interwiki links that are increasingly being recognized as a valuable source of cross-lingual information. Unfortunately, large numbers of links are imprecise or simply wrong. In this paper, techniques to detect such problems are identified. We formalize their removal as an optimization task based on graph repair operations. | Untangling the Cross-Lingual Link Structure of Wikipedia Gerard de Melo Max Planck Institute for Informatics Saarbriicken Germany demelo@ Gerhard Weikum Max Planck Institute for Informatics Saarbrucken Germany weikum@ Abstract Wikipedia articles in different languages are connected by interwiki links that are increasingly being recognized as a valuable source of cross-lingual information. Unfortunately large numbers of links are imprecise or simply wrong. In this paper techniques to detect such problems are identified. We formalize their removal as an optimization task based on graph repair operations. We then present an algorithm with provable properties that uses linear programming and a region growing technique to tackle this challenge. This allows us to transform Wikipedia into a much more consistent multilingual register of the world s entities and concepts. 1 Introduction Motivation. The open community-maintained encyclopedia Wikipedia has not only turned the Internet into a more useful and linguistically diverse source of information but is also increasingly being used in computational applications as a large-scale source of linguistic and encyclopedic knowledge. To allow cross-lingual navigation Wikipedia offers cross-lingual interwiki links that for instance connect the Indonesian article about Albert Einstein to the corresponding articles in over 100 other languages. Such links are extraordinarily valuable for cross-lingual applications. In the ideal case a set of articles connected directly or indirectly via such links would all describe the same entity or concept. Due to conceptual drift different granularities as well as mistakes made by editors we frequently find concepts as different as economics and manager in the same connected component. Filtering out inaccurate links enables us to exploit Wikipedia s multilinguality in a much safer manner and allows us to create a multilingual register of named entities. Contribution.

TỪ KHÓA LIÊN QUAN