tailieunhanh - Báo cáo khoa học: "Evolving new lexical association measures using genetic programming"

Automatic extraction of collocations from large corpora has been the focus of many research efforts. Most approaches concentrate on improving and combining known lexical association measures. In this paper, we describe a genetic programming approach for evolving new association measures, which is not limited to any specific language, corpus, or type of collocation. Our preliminary experimental results show that the evolved measures outperform three known association measures. | Evolving new lexical association measures using genetic programming Jan Snajder Bojana Dalbelo Basic Sasa Petrovic Ivan Sikiric Faculty of Electrical Engineering and Computing University of Zagreb Unska 3 Zagreb Croatia @ Abstract Automatic extraction of collocations from large corpora has been the focus of many research efforts. Most approaches concentrate on improving and combining known lexical association measures. In this paper we describe a genetic programming approach for evolving new association measures which is not limited to any specific language corpus or type of collocation. Our preliminary experimental results show that the evolved measures outperform three known association measures. 1 Introduction A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things Manning and Schutze 1999 . Related to the term collocation is the term n-gram which is used to denote any sequence of n words. There are many possible applications of collocations automatic language generation word sense disambiguation improving text categorization information retrieval etc. As different applications require different types of collocations that are often not found in dictionaries automatic extraction of collocations from large textual corpora has been the focus of much research in the last decade see for example Pecina and Schlesinger 2006 Evert and Krenn 2005 . Automatic extraction of collocations is usually performed by employing lexical association measures AMs to indicate how strongly the words comprising an n-gram are associated. However the use of lexical AMs for the purpose of collocation extraction has reached a plateau recent research in this field has focused on combining the existing AMs in the hope of improving the results Pecina and Schlesinger 2006 . In this paper we propose an approach for deriving new AMs for collocation extraction based on .