Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Big Data versus the Crowd: Looking for Relationships in All the Right Places"
Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Classically, training relation extractors relies on high-quality, manually annotated training data, which can be expensive to obtain. To mitigate this cost, NLU researchers have considered two newly available sources of less expensive (but potentially lower quality) labeled data from distant supervision and crowd sourcing. | Big Data versus the Crowd Looking for Relationships in All the Right Places Ce Zhang Feng Niu Christopher Re Jude Shavlik Department of Computer Sciences University of Wisconsin-Madison USA czhang leonn chrisre shavlik @cs.wisc.edu Abstract Classically training relation extractors relies on high-quality manually annotated training data which can be expensive to obtain. To mitigate this cost NLU researchers have considered two newly available sources of less expensive but potentially lower quality labeled data from distant supervision and crowd sourcing. There is however no study comparing the relative impact of these two sources on the precision and recall of post-learning answers. To fill this gap we empirically study how state-of-the-art techniques are affected by scaling these two sources. We use corpus sizes of up to 100 million documents and tens of thousands of crowd-source labeled examples. Our experiments show that increasing the corpus size for distant supervision has a statistically significant positive impact on quality F1 score . In contrast human feedback has a positive and statistically significant but lower impact on precision and recall. 1 Introduction Relation extraction is the problem of populating a target relation representing an entity-level relationship or attribute with facts extracted from naturallanguage text. Sample relations include people s titles birth places and marriage relationships. Traditional relation-extraction systems rely on manual annotations or domain-specific rules provided by experts both of which are scarce resources that are not portable across domains. To remedy these problems recent years have seen interest in the distant supervision approach for rela 825 tion extraction Wu and Weld 2007 Mintz et al. 2009 . The input to distant supervision is a set of seed facts for the target relation together with an unlabeled text corpus and the output is a set of noisy annotations that can be used by any machine learning technique .