Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Manually Annotated Hungarian Corpus"
Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Current paper presents the results of a two-year project during which a consortium of the University of Szeged and the MorphoLogic Ltd. Budapest developed a morpho-syntactically parsed and annotated (disambiguated) corpus for Hungarian. For morpho-syntactic encoding, the Hungarian version of MSD (MorphoSyntactic Description) has been used. The corpus contains texts of five different topic areas: schoolchildren's compositions, fiction, computer-related texts, news, and legal texts. During annotation, linguists have checked the morphosyntactic parsing of each word. . | Manually Annotated Hungarian Corpus Zoltán Alexin Department of Informatics University of Szeged alexin inf.u-s zeged.hu Tibor Gyimóthy Research Group on Artifical Intelligence at University of Szeged gyimothy@inf.u-s zeged.hu Csaba Hatvani Department of Informatics University of Szeged hacso@inf.u-szeged.hu László Tihanyi MorphoLogic Budapest tihanyigmorphologic.hu János Csirik Department of Informatics University of Szeged csirik@inf.u-szeged.hu Károly Bibok Slavic Institute University of Szeged kbibok@lit.u-szeged.hu Gabor Proszeky MorphoLogic Budapest proszeky@morphologic.hu Abstract Current paper presents the results of a two-year project during which a consortium of the University of Szeged and the MorphoLogic Ltd. Budapest developed a morpho-syntactically parsed and annotated disambiguated corpus for Hungarian. For morpho-syntactic encoding the Hungarian version of MSD Morpho-Syntactic Description has been used. The corpus contains texts of five different topic areas school children s compositions fiction computer-related texts news and legal texts. During annotation linguists have checked the morpho-syntactic parsing of each word. Finding part-of-speech tagging disambiguation rules by machine learning algorithms was also studied by the researchers of the consortium. Due to the fact that the size of the corpus reaches up to 1 million text words without punctuation characters it may serve as a reference source for numerous future research applications. The corpus can be obtained freely via Internet for research and educational purposes. 1 Introduction The beginning of the work dates back to 1998 when the authors started a research project on the application of ILP Inductive Logic Programming learning methods for part-of-speech tagging. This research was done within the framework of a European ESPRIT project LTR 20237 ILP2 where first studies were based on the so-called TELRI corpus Erjavec et al. 1998 . Since the corpus annotation had several deficiencies and