tailieunhanh - Báo cáo khoa học: "Active Learning for Statistical Natural Language Parsing"

It is necessary to have a (large) annotated corpus to build a statistical parser. Acquisition of such a corpus is costly and time-consuming. This paper presents a method to reduce this demand using active learning, which selects what samples to annotate, instead of annotating blindly the whole training corpus. Sample selection for annotation is based upon “representativeness” and “usefulness”. A model-based distance is proposed to measure the difference of two sentences and their most likely parse trees. Based on this distance, the active learning process analyzes the sample distribution by clustering and calculates the density of each sample to. | Active Learning for Statistical Natural Language Parsing Min Tang Spoken Language Systems Group MIT Laboratory for Computer Science Cambridge Massachusetts 02139 USA mtang@ Xiaoqiang Luo Salim Roukos IBM . Watson Research Center Yorktown Heights NY 10598 xiaoluo roukos@ Abstract It is necessary to have a large annotated corpus to build a statistical parser. Acquisition of such a corpus is costly and time-consuming. This paper presents a method to reduce this demand using active learning which selects what samples to annotate instead of annotating blindly the whole training corpus. Sample selection for annotation is based upon representativeness and usefulness . A model-based distance is proposed to measure the difference of two sentences and their most likely parse trees. Based on this distance the active learning process analyzes the sample distribution by clustering and calculates the density of each sample to quantify its representativeness. Further more a sentence is deemed as useful if the existing model is highly uncertain about its parses where uncertainty is measured by various entropy-based scores. Experiments are carried out in the shallow semantic parser of an air travel dialog system. Our result shows that for about the same parsing accuracy we only need to annotate a third of the samples as compared to the usual random selection method. 1 Introduction A prerequisite for building statistical parsers Jelinek et al. 1994 Collins 1996 Ratnaparkhi 1997 Charniak 1997 is the availability of a large corpus of parsed sentences. Acquiring such a corpus is expensive and timeconsuming and is often the bottleneck to build a parser for a new application or domain. The goal of this study is to reduce the amount of annotated sentences and hence the development time required for a statistical parser to achieve a satisfactory performance using active learning. Active learning has been studied in the context of many natural language processing

TỪ KHÓA LIÊN QUAN
crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.