tailieunhanh - Báo cáo khoa học: "Co-dispersion: A Windowless Approach to Lexical Association"
We introduce an alternative approach to extracting word pair associations from corpora, based purely on surface distances in the text. We contrast it with the prevailing windowbased co-occurrence model and show it to be more statistically robust and to disclose a broader selection of significant associative relationships - owing largely to the property of scale-independence. In the process we provide insights into the limiting characteristics of window-based methods which complement the sometimes conflicting application-oriented literature in this area. . | Co-dispersion A Windowless Approach to Lexical Association Justin Washtell University of Leeds Leeds UK washtell@ Abstract We introduce an alternative approach to extracting word pair associations from corpora based purely on surface distances in the text. We contrast it with the prevailing windowbased co-occurrence model and show it to be more statistically robust and to disclose a broader selection of significant associative relationships - owing largely to the property of scale-independence. In the process we provide insights into the limiting characteristics of window-based methods which complement the sometimes conflicting application-oriented literature in this area. 1 Introduction The principle of using statistical measures of cooccurrence from corpora as a proxy for word association - by comparing observed frequencies of co-occurrence with expected frequencies - is relatively young. One of the most well known computational studies is that of Church Hanks 1989 . The method by which co-occurrences are counted now as then is based on a device which dates back at least to Weaver 1949 the context window. While variations on the specific notion of context have been explored separation of content and function words asymmetrical and non-contiguous contexts the sentence or the document as context and increasingly sophisticated association measures have been proposed see Evert 2007 for a thorough review the basic principle - that of counting token frequencies within a context region - remains ubiquitous. Herein we discuss some of the intrinsic limitations of this approach as are being felt in recent research and present a principled solution which does not rely on co-occurrence windows at all but instead on measurements of the surface distance between words. 2 The impact of window size The issue of how to determine appropriate window size and shape has often been glossed over in the literature with such parameters being determined arbitrarily or .
đang nạp các trang xem trước