tailieunhanh - Báo cáo khoa học: "Tagging English by Path Voting Constraints"
We describe a constraint-based tagging approach where individual constraint rules vote on sequences of matching tokens and tags. Disambiguation of all tokens in a sentence is performed at the very end by selecting tags that appear on the path that receives the highest vote. This constraint application paradigm makes the outcome of the disambiguation independent of the rule sequence, and hence relieves the rule developer from worrying about potentially conflicting rule sequencing. | Tagging English by Path Voting Constraints Gõkhan Tiir and Kemal Oflazer Department of Computer Engineering and Information Science Bilkent University Bilkent Ankara TR-06533 TURKEY tur ko Abstract We describe a constraint-based tagging approach where individual constraint rules vote on sequences of matching tokens and tags. Disambiguation of all tokens in a sentence is performed at the very end by selecting tags that appear on the path that receives the highest vote. This constraint application paradigm makes the outcome of the disambiguation independent of the rule sequence and hence relieves the rule developer from worrying about potentially conflicting rule sequencing. The approach can also combine statistically and manually obtained constraints and incorporate negative constraint rules to rule out certain patterns. We have applied this approach to tagging English text from the Wall Street Journal and the Brown Corpora. Our results from the Wall Street Journal Corpus indicate that with 400 statistically derived constraint rules and about 800 hand-crafted constraint rules we can attain an average accuracy of on the training corpus and an average accuracy of on the testing corpus. We can also relax the single tag per token limitation and allow ambiguous tagging which lets US trade recall and precision. 1 Introduction Part-of-speech tagging is one of the preliminary steps in many natural language processing systems in which the proper part-of-speech tag of the tokens comprising the sentences are disambiguated using either statistical or symbolic local contextual information. Tagging systems have used either a statistical approach where a large corpora is employed to train a probabilistic model which then is used to tag unseen text . Church 1988 Cutting et ah 1992 DeRose 1988 or a constraint-based approach which employs a large number of hand-crafted linguistic constraints that are used to eliminate impossible sequences or .
đang nạp các trang xem trước