tailieunhanh - Báo cáo khoa học: "Tagging Urdu Text with Parts of Speech: A Tagger Comparison"
In this paper, four state-of-art probabilistic taggers . TnT tagger, TreeTagger, RF tagger and SVM tool, are applied to the Urdu language. For the purpose of the experiment, a syntactic tagset is proposed. A training corpus of 100,000 tokens is used to train the models. Using the lexicon extracted from the training corpus, SVM tool shows the best accuracy of . After providing a separate lexicon of 70,568 types, SVM tool again shows the best accuracy of . | Tagging Urdu Text with Parts of Speech A Tagger Comparison Hassan Sajjad Universitat Stuttgart Stuttgart. Germany sajjad@ Helmut Schmid Universitat Stuttgart Stuttgart Germany schmid@ Abstract In this paper four state-of-art probabilistic taggers . TnT tagger TreeTagger RF tagger and SVM tool are applied to the Urdu language. For the purpose of the experiment a syntactic tagset is proposed. A training corpus of 100 000 tokens is used to train the models. Using the lexicon extracted from the training corpus SVM tool shows the best accuracy of . After providing a separate lexicon of 70 568 types SVM tool again shows the best accuracy of . 1 Urdu Language Urdu belongs to the Indo-Aryan language family. It is the national language of Pakistan and is one of the official languages of India. The majority of the speakers of Urdu spread over the area of South Asia South Africa and the United King-dom1. Urdu is a free order language with general word order SOV. It shares its phonological morphological and syntactic structures with Hindi. Some linguists considered them as two different dialects of one language Bhatia and Koul 2000 . However Urdu is written in Perso-arabic script and inherits most of the vocabulary from Arabic and Persian. On the other hand Hindi is written in Devanagari script and inherits vocabulary from Sanskrit. Urdu is a morphologically rich language. Forms of the verb as well as case gender and number are expressed by the morphology. Urdu represents case with a separate character after the head noun of the noun phrase. Due to their separate occurrence and their place of occurrence they are sometimes considered as postpositions. Considering them as case markers Urdu has no minative ergative accusative dative instrumental genitive and locative cases Butt 1995 pg 10 . The Urdu verb phrase contains a main verb a light verb describing the aspect and a tense verb describing the tense of the phrase Hardie .
đang nạp các trang xem trước