tailieunhanh - Báo cáo khoa học: "SystemT: SystemT: An Algebraic Approach to Declarative Information Extraction"
As information extraction (IE) becomes more central to enterprise applications, rule-based IE engines have become increasingly important. In this paper, we describe SystemT, a rule-based IE system whose basic design removes the expressivity and performance limitations of current systems based on cascading grammars. SystemT uses a declarative rule language, AQL, and an optimizer that generates high-performance algebraic execution plans for AQL rules. We compare SystemT’s approach against cascading grammars, both theoretically and with a thorough experimental evaluation. . | SystemT An Algebraic Approach to Declarative Information Extraction Laura Chiticariu Rajasekar Krishnamurthy Yunyao Li Sriram Raghavan Frederick R. Reiss Shivakumar Vaithyanathan IBM Research - Almaden San Jose CA USA chiti sekar yunyaoli rsriram frreiss vaithyan @ Abstract As information extraction IE becomes more central to enterprise applications rule-based IE engines have become increasingly important. In this paper we describe SystemT a rule-based IE system whose basic design removes the expressivity and performance limitations of current systems based on cascading grammars. SystemT uses a declarative rule language AQL and an optimizer that generates high-performance algebraic execution plans for AQL rules. We compare SystemT s approach against cascading grammars both theoretically and with a thorough experimental evaluation. Our results show that SystemT can deliver result quality comparable to the state-of-the-art and an order of magnitude higher annotation throughput. 1 Introduction In recent years enterprises have seen the emergence of important text analytics applications like compliance and data redaction. This increase combined with the inclusion of text into traditional applications like Business Intelligence has dramatically increased the use of information extraction IE within the enterprise. While the traditional requirement of extraction quality remains critical enterprise applications also demand efficiency transparency customizability and maintainability. In recent years these systemic requirements have led to renewed interest in rule-based IE systems Doan et al. 2008 SAP 2010 IBM 2010 SAS 2010 . Until recently rule-based IE systems Cunningham et al. 2000 Boguraev 2003 Drozdzynski et al. 2004 were predominantly based on the cascading grammar formalism exemplified by the Common Pattern Specification Language CPSL specification Appelt and Onyshkevych 1998 . In CPSL the input text is viewed as a sequence of annotations and extraction rules
đang nạp các trang xem trước