Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Manually Annotated Hungarian Corpus"

Quang Ninh 49 4 pdf

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ Tải xuống

Current paper presents the results of a two-year project during which a consortium of the University of Szeged and the MorphoLogic Ltd. Budapest developed a morpho-syntactically parsed and annotated (disambiguated) corpus for Hungarian. For morpho-syntactic encoding, the Hungarian version of MSD (MorphoSyntactic Description) has been used. The corpus contains texts of five different topic areas: schoolchildren's compositions, fiction, computer-related texts, news, and legal texts. During annotation, linguists have checked the morphosyntactic parsing of each word. . | Manually Annotated Hungarian Corpus Zoltán Alexin Department of Informatics University of Szeged alexin inf.u-s zeged.hu Tibor Gyimóthy Research Group on Artifical Intelligence at University of Szeged gyimothy@inf.u-s zeged.hu Csaba Hatvani Department of Informatics University of Szeged hacso@inf.u-szeged.hu László Tihanyi MorphoLogic Budapest tihanyigmorphologic.hu János Csirik Department of Informatics University of Szeged csirik@inf.u-szeged.hu Károly Bibok Slavic Institute University of Szeged kbibok@lit.u-szeged.hu Gabor Proszeky MorphoLogic Budapest proszeky@morphologic.hu Abstract Current paper presents the results of a two-year project during which a consortium of the University of Szeged and the MorphoLogic Ltd. Budapest developed a morpho-syntactically parsed and annotated disambiguated corpus for Hungarian. For morpho-syntactic encoding the Hungarian version of MSD Morpho-Syntactic Description has been used. The corpus contains texts of five different topic areas school children s compositions fiction computer-related texts news and legal texts. During annotation linguists have checked the morpho-syntactic parsing of each word. Finding part-of-speech tagging disambiguation rules by machine learning algorithms was also studied by the researchers of the consortium. Due to the fact that the size of the corpus reaches up to 1 million text words without punctuation characters it may serve as a reference source for numerous future research applications. The corpus can be obtained freely via Internet for research and educational purposes. 1 Introduction The beginning of the work dates back to 1998 when the authors started a research project on the application of ILP Inductive Logic Programming learning methods for part-of-speech tagging. This research was done within the framework of a European ESPRIT project LTR 20237 ILP2 where first studies were based on the so-called TELRI corpus Erjavec et al. 1998 . Since the corpus annotation had several deficiencies and

TÀI LIỆU LIÊN QUAN

Kỷ yếu tóm tắt báo cáo khoa học: Hội nghị khoa học tim mạch toàn quốc lần thứ XI - Hội tim mạch Quốc gia Việt Nam

Báo cáo nghiên cứu khoa học: "Danh lục các loài thú ở khu bảo tồn thiên nhiên Pù Huống tỉnh Nghệ An và ý nghĩa bảo tồn nguồn gen quí hiếm của chúng"

Báo cáo khoa học: Hỗ trợ nâng cao năng lực quản lý chất thải sinh hoạt tại thành phố Hội An

Báo cáo nghiên cứu khoa học: " DỊCH CHUYỂN TRUY VẤN OQL VÀO CÁC PHÉP TÍNH BAO HÀM"

Báo cáo nghiên cứu khoa học: "Tính năng động nghệ thuật của văn học hiện đại Việt Nam và một cách nhìn hành trình thể loại"

Báo cáo khoa học: " Áp dụng thủ tục phân tích trong kiểm toán báo cáo tài chính"

Báo cáo nghiên cứu khoa học: "Người lính trở về sau chiến tranh với mặc cảm “ăn mày dĩ vãng’ trong tiểu thuyết Chu Lai"

Báo cáo nghiên cứu khoa học: "Khảo sát hiện tượng chuyển đổi chức năng - nghĩa của động từ tiếng Việt"

Báo cáo nghiên cứu khoa học: " BẢN CHẤT KHOA HỌC VÀ CÁCH MẠNG LÀ CỘI NGUỒN SỨC SỐNG CỦA CHỦ NGHĨA MÁC - LÊNIN"

Báo cáo khoa học: " CẢI TIẾN CÁC THUẬT TOÁN MƯỢN VÀ KHOÁ KÊNH TẦN SỐ MẠNG DI ĐỘNG TẾ BÀO"