Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "A Feature-Rich Constituent Context Model for Grammar Induction"

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ

We present LLCCM, a log-linear variant of the constituent context model (CCM) of grammar induction. LLCCM retains the simplicity of the original CCM but extends robustly to long sentences. On sentences of up to length 40, LLCCM outperforms CCM by 13.9% bracketing F1 and outperforms a right-branching baseline in regimes where CCM does not. | A Feature-Rich Constituent Context Model for Grammar Induction Dave Golland John DeNero Jakob Uszkoreit University of California Berkeley Google Google dsg@cs.berkeley.edu denero@google.com uszkoreit@google.com Abstract We present LLCCM a log-linear variant of the constituent context model CCM of grammar induction. LLCCM retains the simplicity of the original CCM but extends robustly to long sentences. On sentences of up to length 40 LLCCM outperforms CCM by 13.9 bracketing F1 and outperforms a right-branching baseline in regimes where CCM does not. 1 Introduction Unsupervised grammar induction is a fundamental challenge of statistical natural language processing Lari and Young 1990 Pereira and Schabes 1992 Carroll and Charniak 1992 . The constituent context model CCM for inducing constituency parses Klein and Manning 2002 was the first unsupervised approach to surpass a right-branching baseline. However the CCM only effectively models short sentences. This paper shows that a simple reparameterization of the model which ties together the probabilities of related events allows the CCM to extend robustly to long sentences. Much recent research has explored dependency grammar induction. For instance the dependency model with valence DMV of Klein and Manning 2004 has been extended to utilize multilingual information Berg-Kirkpatrick and Klein 2010 Cohen et al. 2011 lexical information Headden III et al. 2009 and linguistic universals Naseem et al. 2010 . Nevertheless simplistic dependency models like the DMV do not contain information present in a constituency parse such as the attachment order of object and subject to a verb. Unsupervised constituency parsing is also an active research area. Several studies Seginer 2007 Reichart and Rappoport 2010 Ponvert et al. 2011 17 have considered the problem of inducing parses over raw lexical items rather than part-of-speech POS tags. Additional advances have come from more complex models such as combining CCM and DMV Klein and