Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Lemmatisation as a Tagging Task"

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ

We present a novel approach to the task of word lemmatisation. We formalise lemmatisation as a category tagging task, by describing how a word-to-lemma transformation rule can be encoded in a single label and how a set of such labels can be inferred for a specific language. In this way, a lemmatisation system can be trained and tested using any supervised tagging model. | Lemmatisation as a Tagging Task Andrea Gesmundo Department of Computer Science University of Geneva andrea.gesmundo@unige.ch Tanja Samardzic Department of Linguistics University of Geneva tanj a.samardzic@unige.ch Abstract We present a novel approach to the task of word lemmatisation. We formalise lemmati-sation as a category tagging task by describing how a word-to-lemma transformation rule can be encoded in a single label and how a set of such labels can be inferred for a specific language. In this way a lemmatisation system can be trained and tested using any supervised tagging model. In contrast to previous approaches the proposed technique allows us to easily integrate relevant contextual information. We test our approach on eight languages reaching a new state-of-the-art level for the lemmatisation task. 1 Introduction Lemmatisation and part-of-speech POS tagging are necessary steps in automatic processing of language corpora. This annotation is a prerequisite for developing systems for more sophisticated automatic processing such as information retrieval as well as for using language corpora in linguistic research and in the humanities. Lemmatisation is especially important for processing morphologically rich languages where the number of different word forms is too large to be included in the part-of-speech tag set. The work on morphologically rich languages suggests that using comprehensive morphological dictionaries is necessary for achieving good results Hajic 2000 Erjavec and Dzeroski 2004 . However such dictionaries are constructed manually and they cannot be expected to be developed quickly for many languages. 368 In this paper we present a new general approach to the task of lemmatisation which can be used to overcome the shortage of comprehensive dictionaries for languages for which they have not been developed. Our approach is based on redefining the task of lemmatisation as a category tagging task. Formulating lemmatisation as a tagging task .