tailieunhanh - Báo cáo khoa học: "A Suite of Shallow Processing Tools for Portuguese: LX-Suite"

The purpose of this paper is to present LX-Suite, a set of tools for the shallow processing of Portuguese, developed under the TagShare1 project by the NLX The tools included in this suite are a sentence chunker; a tokenizer; a POS tagger; a nominal featurizer; a nominal lemmatizer; and a verbal featurizer and lemmatizer. These tools were implemented as autonomous modules. This option allows to easily replace any of the modules by an updated version or even by a third-party tool. It also allows to use any of these tools separately, outside the pipeline of the suite. . | A Suite of Shallow Processing Tools for Portuguese LX-Suite Antonio Branco Department of Informatics University of Lisbon ahb@ Joao Ricardo Silva Department of Informatics University of Lisbon jsilva@ Abstract In this paper we present LX-Suite a set of tools for the shallow processing of Portuguese. This suite comprises several modules namely a sentence chunker a tokenizer a POS tagger featurizers and lemmatizers. 1 Introduction The purpose of this paper is to present LX-Suite a set of tools for the shallow processing of Portuguese developed under the TagShare1 project by the NLX The tools included in this suite are a sentence chunker a tokenizer a POS tagger a nominal fea-turizer a nominal lemmatizer and a verbal featur-izer and lemmatizer. These tools were implemented as autonomous modules. This option allows to easily replace any of the modules by an updated version or even by a third-party tool. It also allows to use any of these tools separately outside the pipeline of the suite. The evaluation results mentioned in the next sections have been obtained using an accurately hand-tagged 280 000 token corpus composed of newspaper articles and short novels. 2 Sentence chunker The sentence chunker is a finite state automaton FSA where the state transitions are triggered by specified character sequences in the input and the emitted symbols correspond to sentence s and paragraph p boundaries. Within this setup a transition rule could define for example 1http 2NLX Natural Language and Speech Group at the Department of Informatics of the University of Lisbon Faculty of Sciences http that a period when followed by a space and a capital letter marks a sentence boundary . A- . s s A- Being a rule-based chunker it was tailored to handle orthographic conventions that are specific to Portuguese in particular those governing dialog excerpts. This allowed the tool to reach a very good performance with values of .

TỪ KHÓA LIÊN QUAN
crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.