tailieunhanh - Báo cáo khoa học: "Feature-Rich Part-of-speech Tagging for Morphologically Complex Languages: Application to Bulgarian"

We present experiments with part-ofspeech tagging for Bulgarian, a Slavic language with rich inflectional and derivational morphology. Unlike most previous work, which has used a small number of grammatical categories, we work with 680 morpho-syntactic tags. We combine a large morphological lexicon with prior linguistic knowledge and guided learning from a POS-annotated corpus, achieving accuracy of , which is a significant improvement over the state-of-the-art for Bulgarian. | Feature-Rich Part-of-speech Tagging for Morphologically Complex Languages Application to Bulgarian Georgi Georgiev and Valentin Zhikov Petya Osenova and Kiril Simov Ontotext AD IICT Bulgarian Academy of Sciences 135 Tsarigradsko Sh. Sofia Bulgaria 25A Acad. G. Bonchev Sofia Bulgaria @ petya kivs @ Preslav Nakov Qatar Computing Research Institute Qatar Foundation Tornado Tower floor 10 . Box 5825 Doha Qatar pnakov@ Abstract We present experiments with part-of-speech tagging for Bulgarian a Slavic language with rich inflectional and derivational morphology. Unlike most previous work which has used a small number of grammatical categories we work with 680 morpho-syntactic tags. We combine a large morphological lexicon with prior linguistic knowledge and guided learning from a POS-annotated corpus achieving accuracy of which is a significant improvement over the state-of-the-art for Bulgarian. 1 Introduction Part-of-speech POS tagging is the task of assigning each of the words in a given piece of text a contextually suitable grammatical category. This is not trivial since words can play different syntactic roles in different contexts . can is a noun in I opened a can of coke. but a verb in I can write. Traditionally linguists have classified English words into the following eight basic POS categories noun pronoun adjective verb adverb preposition conjunction and interjection this list is often extended a bit . with determiners particles participles etc. but the number of categories considered is rarely more than 15. Computational linguistics works with a larger inventory of POS tags . the Penn Treebank Marcus et al. 1993 uses 48 tags 36 for part-of-speech and 12 for punctuation and currency symbols. This increase in the number of tags is partially due to finer granularity . there are special tags for determiners particles modal verbs cardinal numbers foreign words existential there

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN