tailieunhanh - Báo cáo khoa học: "Morphological Richness Offsets Resource Demand- Experiences in Constructing a POS Tagger for Hindi"

In this paper we report our work on building a POS tagger for a morphologically rich language- Hindi. The theme of the research is to vindicate the stand that- if morphology is strong and harnessable, then lack of training corpora is not debilitating. We establish a methodology of POS tagging which the resource disadvantaged (lacking annotated corpora) languages can make use of. | Morphological Richness Offsets Resource Demand- Experiences in Constructing a POS Tagger for Hindi Smriti Singh Kuhoo Gupta Manish Shrivastava Pushpak Bhattacharyya Department of Computer Science and Engineering Indian Institute of Technology Bombay Powai Mumbai 400076 Maharashtra India smriti kuhoo manshri pb @ Abstract In this paper we report our work on building a POS tagger for a morphologically rich language- Hindi. The theme of the research is to vindicate the stand that- if morphology is strong and harnessable then lack of training corpora is not debilitating. We establish a methodology of POS tagging which the resource disadvantaged lacking annotated corpora languages can make use of. The methodology makes use of locally annotated modestly-sized corpora 15 562 words exhaustive morpohological analysis backed by high-coverage lexicon and a decision tree based learning algorithm CN2 . The evaluation of the system was done with 4-fold cross validation of the corpora in the news domain hindi . The current accuracy of POS tagging is and can be further improved. 1 Motivation and Problem Definition Part-Of-Speech POS tagging is a complex task fraught with challenges like ambiguity of parts of speech and handling of lexical absence proper nouns foreign words deriva-tionally morphed words spelling variations and other unknown words Manning and Schutze 2002 . For English there are many POS taggers employing machine learning techniques like transformation-based error-driven learning Brill 1995 decision trees Black et al. 1992 markov model Cutting et al. 1992 maximum entropy methods Ratnaparkhi 1996 etc. There are also taggers which are hybrid using both stochastic and rule-based approaches such as CLAWS Garside and Smith 1997 . The accuracy of these taggers ranges from 93-98 approximately. English has annotated corpora in abundance enabling usage of powerful data driven machine learning methods. But very few languages in the world have

crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.