Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content"

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ

The written form of Arabic, Modern Standard Arabic (MSA), differs quite a bit from the spoken dialects of Arabic, which are the true “native” languages of Arabic speakers used in daily life. However, due to MSA’s prevalence in written form, almost all Arabic datasets have predominantly MSA content. We present the Arabic Online Commentary Dataset, a 52M-word monolingual dataset rich in dialectal content, and we describe our long-term annotation effort to identify the dialect level (and dialect itself) in each sentence of the dataset. . | The Arabic Online Commentary Dataset an Annotated Dataset of Informal Arabic with High Dialectal Content Omar F. Zaidan and Chris Callison-Burch Dept. of Computer Science Johns Hopkins University Baltimore MD 21218 USA ozaidan ccb @cs.jhu.edu Abstract The written form of Arabic Modern Standard Arabic MSA differs quite a bit from the spoken dialects of Arabic which are the true native languages of Arabic speakers used in daily life. However due to MSA s prevalence in written form almost all Arabic datasets have predominantly MSA content. We present the Arabic Online Commentary Dataset a 52M-word monolingual dataset rich in dialectal content and we describe our long-term annotation effort to identify the dialect level and dialect itself in each sentence of the dataset. So far we have labeled 108K sentences 41 of which as having dialectal content. We also present experimental results on the task of automatic dialect identification using the collected labels for training and evaluation. 1 Introduction The Arabic language is characterized by an interesting linguistic dichotomy whereby the written form of the language Modern Standard Arabic MSA differs in a non-trivial fashion from the various spoken varieties of Arabic. As the variant of choice for written and official communication MSA content significantly dominates dialectal content and in turn MSA dominates in datasets available for linguistic research especially in textual form. The abundance of MSA data has greatly aided research on computational methods applied to Arabic but only the MSA variant of it. A state-of-the-art Arabic-to-English machine translation system performs quite well when translating MSA source sentences but often produces incomprehensible output when the input is dialectal. For example most words 37 Src MSA í 4 a jj AJjJi oi jj TL mtY snrY h h Alvlp mn Almjrmyn ỊtxDE IlmHAkmp MT When will we see this group of offenders subject to a trial Src Lev í S âĨỊ jj j J J áịịI TL AymtY rH n wf hAl lp mn