tailieunhanh - Báo cáo khoa học: "Probabilistic CFG with latent annotations"

This paper defines a generative probabilistic model of parse trees, which we call PCFG-LA. This model is an extension of PCFG in which non-terminal symbols are augmented with latent variables. Finegrained CFG rules are automatically induced from a parsed corpus by training a PCFG-LA model using an EM-algorithm. Because exact parsing with a PCFG-LA is NP-hard, several approximations are described and empirically compared. In experiments using the Penn WSJ corpus, our automatically trained model gave a per40 formance of (F , sentences words), which is comparable to that of an unlexicalized PCFG parser created using extensive manual feature. | Probabilistic CFG with latent annotations Takuya Matsuzakif YusukeMiyaof Jun ichi Tsujiift f Graduate School of Information Science and Technology University of Tokyo Hongo 7-3-1 Bunkyo-ku Tokyo 113-0033 ịCREST JST Japan Science and Technology Agency Honcho 4-1-8 Kawaguchi-shi Saitama 332-0012 matuzaki yusuke tsujii @ Abstract This paper defines a generative probabilistic model of parse trees which we call PCFG-LA. This model is an extension of PCFG in which non-terminal symbols are augmented with latent variables. Finegrained CFG rules are automatically induced from a parsed corpus by training a PCFG-LA model using an EM-algorithm. Because exact parsing with a PCFG-LA is NP-hard several approximations are described and empirically compared. In experiments using the Penn WSJ corpus our automatically trained model gave a performance of Fl sentences 40 words which is comparable to that of an unlexicalized PCFG parser created using extensive manual feature selection. 1 Introduction Variants of PCFGs form the basis of several broadcoverage and high-precision parsers Collins 1999 Charniak 1999 Klein and Manning 2003 . In those parsers the strong conditional independence assumption made in vanilla treebank PCFGs is weakened by annotating non-terminal symbols with many features Goodman 1997 Johnson 1998 . Examples of such features are head words of constituents labels of ancestor and sibling nodes and subcategorization frames of lexical heads. Effective features and their good combinations are normally explored using trial-and-error. This paper defines a generative model of parse trees that we call PCFG with latent annotations PCFG-LA . This model is an extension of PCFG models in which non-terminal symbols are annotated with latent variables. The latent variables work just like the features attached to non-terminal symbols. A fine-grained PCFG is automatically induced from parsed corpora by training a PCFG-LA model using an EM-algorithm which .