tailieunhanh - Báo cáo khoa học: "An Unsupervised System for Identifying English Inclusions in German Text"

We present an unsupervised system that exploits linguistic knowledge resources, namely English and German lexical databases and the World Wide Web, to identify English inclusions in German text. We describe experiments with this system and the corpus which was developed for this task. We report the classification results of our system and compare them to the performance of a trained machine learner in a series of in- and crossdomain experiments. | An Unsupervised System for Identifying English Inclusions in German Text Beatrice Alex School of Informatics University of Edinburgh Edinburgh EH8 9LW Uk v1balex@ Abstract We present an unsupervised system that exploits linguistic knowledge resources namely English and German lexical databases and the World Wide Web to identify English inclusions in German text. We describe experiments with this system and the corpus which was developed for this task. We report the classification results of our system and compare them to the performance of a trained machine learner in a series of in- and crossdomain experiments. 1 Introduction The recognition of foreign words and foreign named entities NEs in otherwise mono-lingual text is beyond the capability of many existing approaches and is only starting to be addressed. This language mixing phenomenon is prevalent in German where the number of anglicisms has increased considerably. We have developed an unsupervised and highly efficient system that identifies English inclusions in German text by means of a computationally inexpensive lookup procedure. By unsupervised we mean that the system does not require any annotated training data and only relies on lexicons and the Web. Our system allows linguists and lexicographers to observe language changes over time and to investigate the use and frequency of foreign words in a given language and domain. The output also represents valuable information for a number of ap- plications including polyglot text-to-speech TTS synthesis and machine translation MT . We will first explain the issue of foreign inclusions in German text in greater detail with examples in Section 2. Sections 3 and 4 describe the data we used and the architecture of our system. In Section 5 we provide an evaluation of the system output and compare the results with those of a series of in- and cross-domain machine learning experiments outlined in Section 6. We conclude and outline future work in Section

TỪ KHÓA LIÊN QUAN