tailieunhanh - Báo cáo khoa học: "Compiling Boostexter Rules into a Finite-state Transducer"

A number of NLP tasks have been effectively modeled as classification tasks using a variety of classification techniques. Most of these tasks have been pursued in isolation with the classifier assuming unambiguous input. In order for these techniques to be more broadly applicable, they need to be extended to apply on weighted packed representations of ambiguous input. One approach for achieving this is to represent the classification model as a weighted finite-state transducer (WFST). | Compiling Boostexter Rules into a Finite-state Transducer Srinivas Bangalore AT T Labs-Research 180 Park Avenue Florham Park NJ 07932 Abstract A number of NLP tasks have been effectively modeled as classification tasks using a variety of classification techniques. Most of these tasks have been pursued in isolation with the classifier assuming unambiguous input. In order for these techniques to be more broadly applicable they need to be extended to apply on weighted packed representations of ambiguous input. One approach for achieving this is to represent the classification model as a weighted finite-state transducer WFST . In this paper we present a compilation procedure to convert the rules resulting from an AdaBoost classifier into an WFST. We validate the compilation technique by applying the resulting WFST on a call-routing application. 1 Introduction Many problems in Natural Language Processing NLP can be modeled as classification tasks either at the word or at the sentence level. For example part-of-speech tagging named-entity identification supertagging1 word sense disambiguation are tasks that have been modeled as classification problems at the word level. In addition there are problems that classify the entire sentence or document into one of a set of categories. These problems are loosely characterized as semantic classification and have been used in many practical applications including call routing and text classification. Most of these problems have been addressed in isolation assuming unambiguous one-best input. Typically however in NLP applications these modules are chained together with each module introducing some amount of error. In order to alleviate the errors introduced by a module it is typical for a module to provide multiple weighted solutions ideally as a packed representation that serve as input to the next module. For example a speech recognizer provides a lattice of possible recognition outputs that is to be annotated with part-of-speech