tailieunhanh - Báo cáo khoa học: "A Syllable Based Word Recognition Model for Korean Noun Extraction"

Noun extraction is very important for many NLP applications such as information retrieval, automatic text classification, and information extraction. Most of the previous Korean noun extraction systems use a morphological analyzer or a Partof-Speech (POS) tagger. Therefore, they require much of the linguistic knowledge such as morpheme dictionaries and rules (. morphosyntactic rules and morphological rules). | A Syllable Based Word Recognition Model for Korean Noun Extraction Do-Gil Lee and Hae-Chang Rim Heui-Seok Lim Dept. of Computer Science Engineering Dept. of Information Communications Korea University Chonan University 1 5-ka Anam-dong Seongbuk-ku 115 AnSeo-dong Seoul 136-701 Korea CheonAn 330-704 Korea dglee rim @ limhs@ Abstract Noun extraction is very important for many NLP applications such as information retrieval automatic text classification and information extraction. Most of the previous Korean noun extraction systems use a morphological analyzer or a Part-of-Speech POS tagger. Therefore they require much of the linguistic knowledge such as morpheme dictionaries and rules . morphosyntactic rules and morphological rules . This paper proposes a new noun extraction method that uses the syllable based word recognition model. It finds the most probable syllable-tag sequence of the input sentence by using automatically acquired statistical information from the POS tagged corpus and extracts nouns by detecting word boundaries. Furthermore it does not require any labor for constructing and maintaining linguistic knowledge. We have performed various experiments with a wide range of variables influencing the performance. The experimental results show that without morphological analysis or POS tagging the proposed method achieves comparable performance with the previous methods. 1 Introduction Noun extraction is a process to find every noun in a document Lee et al. 2001 . In Korean Nouns are used as the most important terms features that express the document in NLP applications such as information retrieval document categorization text summarization information extraction and etc. Korean is a highly agglutinative language and nouns are included in Eojeols. An Eojeol is a surface level form consisting of more than one combined morpheme. Therefore morphological analysis or POS tagging is required to extract Korean nouns. The .