tailieunhanh - Báo cáo khoa học: "Manually Annotated Hungarian Corpus"
Current paper presents the results of a two-year project during which a consortium of the University of Szeged and the MorphoLogic Ltd. Budapest developed a morpho-syntactically parsed and annotated (disambiguated) corpus for Hungarian. For morpho-syntactic encoding, the Hungarian version of MSD (MorphoSyntactic Description) has been used. The corpus contains texts of five different topic areas: schoolchildren's compositions, fiction, computer-related texts, news, and legal texts. During annotation, linguists have checked the morphosyntactic parsing of each word. . | Manually Annotated Hungarian Corpus Zoltán Alexin Department of Informatics University of Szeged alexin Tibor Gyimóthy Research Group on Artifical Intelligence at University of Szeged gyimothy@ Csaba Hatvani Department of Informatics University of Szeged hacso@ László Tihanyi MorphoLogic Budapest János Csirik Department of Informatics University of Szeged csirik@ Károly Bibok Slavic Institute University of Szeged kbibok@ Gabor Proszeky MorphoLogic Budapest proszeky@ Abstract Current paper presents the results of a two-year project during which a consortium of the University of Szeged and the MorphoLogic Ltd. Budapest developed a morpho-syntactically parsed and annotated disambiguated corpus for Hungarian. For morpho-syntactic encoding the Hungarian version of MSD Morpho-Syntactic Description has been used. The corpus contains texts of five different topic areas school children s compositions fiction computer-related texts news and legal texts. During annotation linguists have checked the morpho-syntactic parsing of each word. Finding part-of-speech tagging disambiguation rules by machine learning algorithms was also studied by the researchers of the consortium. Due to the fact that the size of the corpus reaches up to 1 million text words without punctuation characters it may serve as a reference source for numerous future research applications. The corpus can be obtained freely via Internet for research and educational purposes. 1 Introduction The beginning of the work dates back to 1998 when the authors started a research project on the application of ILP Inductive Logic Programming learning methods for part-of-speech tagging. This research was done within the framework of a European ESPRIT project LTR 20237 ILP2 where first studies were based on the so-called TELRI corpus Erjavec et al. 1998 . Since the corpus annotation had several deficiencies and
đang nạp các trang xem trước