tailieunhanh - Báo cáo khoa học: "Concept Unification of Terms in Different Languages for IR"
Due to the historical and cultural reasons, English phases, especially the proper nouns and new words, frequently appear in Web pages written primarily in Asian languages such as Chinese and Korean. Although these English terms and their equivalences in the Asian languages refer to the same concept, they are erroneously treated as independent index units in traditional Information Retrieval (IR). This paper describes the degree to which the problem arises in IR and suggests a novel technique to solve it | Concept Unification of Terms in Different Languages for IR Qing Li Sung-Hyon Myaeng Information Communications University Korea liqing myaeng @ Yun Jin Chungnam National University Korea wkim@ Bo-yeong Kang Seoul National University Korea comeng99@ Abstract Due to the historical and cultural reasons English phases especially the proper nouns and new words frequently appear in Web pages written primarily in Asian languages such as Chinese and Korean. Although these English terms and their equivalences in the Asian languages refer to the same concept they are erroneously treated as independent index units in traditional Information Retrieval IR . This paper describes the degree to which the problem arises in IR and suggests a novel technique to solve it. Our method firstly extracts an English phrase from Asian language Web pages and then unifies the extracted phrase and its equivalence s in the language as one index unit. Experimental results show that the high precision of our conceptual unification approach greatly improves the IR performance. 1 Introduction The mixed use of English and local languages presents a classical problem of vocabulary mismatch in monolingual information retrieval MIR . The problem is significant especially in Asian language because words in the local languages are often mixed with English words. Although English terms and their equivalences in a local language refer to the same concept they are erroneously treated as independent index units in traditional MIR. Such separation of semantically identical words in different languages may limit retrieval performance. For instance as shown in Figure 1 there are three kinds of Chinese Web pages containing information related with Viterbi Algorithm ÍỆELMEẾ . The first case contains Viterbi Algorithm but not its Chinese equivalence ÍỆELMEẾ . The second num MTSKt-romtM as the states of. HMM hidden Markov models in which the 1 attic eiVitetbi algorithm is employed for .
đang nạp các trang xem trước