Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Semantic classification of Chinese unknown words"

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ

This paper describes a classifier that assigns semantic thesaurus categories to unknown Chinese words (words not already in the CiLin thesaurus and the Chinese Electronic Dictionary, but in the Sinica Corpus). The focus of the paper differs in two ways from previous research in this particular area. Prior research in Chinese unknown words mostly focused on proper nouns (Lee 1993, Lee, Lee and Chen 1994, Huang, Hong and Chen 1994, Chen and Chen 2000). This paper does not address proper nouns, focusing rather on common nouns, adjectives, and verbs. . | Semantic classification of Chinese unknown words Huihsin Tseng Linguistics University of Colorado at Boulder tseng@colorado.edu Abstract This paper describes a classifier that assigns semantic thesaurus categories to unknown Chinese words words not already in the CiLin thesaurus and the Chinese Electronic Dictionary but in the Sinica Corpus . The focus of the paper differs in two ways from previous research in this particular area. Prior research in Chinese unknown words mostly focused on proper nouns Lee 1993 Lee Lee and Chen 1994 Huang Hong and Chen 1994 Chen and Chen 2000 . This paper does not address proper nouns focusing rather on common nouns adjectives and verbs. My analysis of the Sinica Corpus shows that contrary to expectation most of unknown words in Chinese are common nouns adjectives and verbs rather than proper nouns. Other previous research has focused on features related to unknown word contexts Caraballo 1999 Roark and Charniak 1998 . While context is clearly an important feature this paper focuses on non-contextual features which may play a key role for unknown words that occur only once and hence have limited context. The feature I focus on following Ciaramita 2002 is morphological similarity to words whose semantic category is known. My nearest neighbor approach to lexical acquisition computes the distance between an unknown word and examples from the CiLin thesaurus based upon its morphological structure. The classifier improves on baseline semantic categorization performance for adjectives and verbs but not for nouns. 1 Introduction The biggest problem for assigning semantic categories to words lies in the incompleteness of dictionaries. It is impractical to construct a dictionary that will contain all words that may occur in some previously unseen corpora. This issue is particularly problematic for natural language processing applications that work with Chinese texts. Specifically for the Sinica Corpus1 Bai Chen and Chen 1998 found that .