tailieunhanh - Báo cáo khoa học: "Estimating Upper and Lower Bounds on the Performance of Word-Sense Disambiguation Programs"
We have recently reported on two new word-sense disambiguation systems, one trained on bilingual material (the Canadian Hansards) and the other trained on monolingual material (Roget's Thesaurus and Grolier's Encyclopedia). After using both the monolingual and bilingual classifiers for a few months, we have convinced ourselves that the performance is remarkably good. Nevertheless, we would really like to be able to make a stronger statement, and therefore, we decided to try to develop some more objective evaluation measures. . | Estimating Upper and Lower Bounds on the Performance of Word-Sense Disambiguation Programs William Gale Kenneth Ward Church David Yarowsky AT T Bell Laboratories 600 Mountain Ave. Murray Hill NJ 07974 kwc@ Abstract We have recently reported on two new word-sense disambiguation systems one trained on bilingual material the Canadian Hansards and the other trained on monolingual material Roget s Thesaurus and Grolier s Encyclopedia . After using both the monolingual and bilingual classifiers for a few months we have convinced ourselves that the performance is remarkably good. Nevertheless we would really like to be able to make a stronger statement and therefore we decided to try to develop some more objective evaluation measures. Although there has been a fair amount of literature on sense-disambiguation the literature does not offer much guidance in how we might establish the success or failure of a proposed solution such as the two systems mentioned in the previous paragraph. Many papers avoid quantitative evaluations altogether because it is so difficult to come up with credible estimates of performance. This paper will attempt to establish upper and lower bounds on the level of performance that can be expected in an evaluation. An estimate of the lower bound of 75 averaged over ambiguous types is obtained by measuring the performance produced by a baseline system that ignores context and simply assigns the most likely sense in all cases. An estimate of the upper bound is obtained by assuming that our ability to measure performance is largely limited by our ability obtain reliable judgments from human informants. Not surprisingly the upper bound is very dependent on the instructions given to the judges. Jorgensen for example suspected that lexicographers tend to depend too much on judgments by a single informant and found considerable variation over judgments only 68 agreement as she had suspected. In our own experiments we have set out to find .
đang nạp các trang xem trước