tailieunhanh - Báo cáo khoa học: "Using Machine Learning to Maintain Rule-based Named-Entity Recognition and Classification Systems"

This paper presents a method that assists in maintaining a rule-based named-entity recognition and classification system. The underlying idea is to use a separate system, constructed with the use of machine learning, to monitor the performance of the rule-based system. The training data for the second system is generated with the use of the rule-based system, thus avoiding the need for manual tagging. The disagreement of the two systems acts as a signal for updating the rule-based system. The generality of the approach is illustrated by applying it to large corpora in two different languages: Greek and French. . | Using Machine Learning to Maintain Rule-based Named-Entity Recognition and Classification Systems Georgios Petasis t Frantz Vichot Francis Wolinski Georgios Paliouras t Vangelis Karkaletsis t and Constantine D. Spyropoulos t t Institute of Informatics and Telecommunications Informatique-CDC National Centre for Scientific Research Demokritos 4 rue Berthollet 15310 Ag. Paraskevi Athens Greece 94114 Arcueil France petasis paliourg vangelis costass @ @ Abstract This paper presents a method that assists in maintaining a rule-based named-entity recognition and classification system. The underlying idea is to use a separate system constructed with the use of machine learning to monitor the performance of the rule-based system. The training data for the second system is generated with the use of the rule-based system thus avoiding the need for manual tagging. The disagreement of the two systems acts as a signal for updating the rule-based system. The generality of the approach is illustrated by applying it to large corpora in two different languages Greek and French. The results are very encouraging showing that this alternative use of machine learning can assist significantly in the maintenance of rulebased systems. 1 Introduction Machine learning has recently been proposed as a promising solution to a major problem in language engineering the construction of lexical resources. Most of the real-world language engineering systems make use of a variety of lexical resources in particular grammars and lexicons. The use of general-purpose resources is ineffective since in most applications a specialised vocabulary is used which is not supported by general-purpose lexicons and grammars. For this reason significant effort is currently put into the construction of generic tools that can quickly adapt to a particular thematic domain. The adaptation of these tools mainly involves the adaptation of domain-specific .