tailieunhanh - Báo cáo khoa học: "The Natural Language Toolkit"

The Natural Language Toolkit is a suite of program modules, data sets and tutorials supporting research and teaching in computational linguistics and natural language processing. NLTK is written in Python and distributed under the GPL open source license. Over the past year the toolkit has been rewritten, simplifying many linguistic data structures and taking advantage of recent enhancements in the Python language. This paper reports on the simplified toolkit and explains how it is used in teaching NLP | NLTK The Natural Language Toolkit Steven Bird Department of Computer Science and Software Engineering University of Melbourne Victoria 3010 AUSTRALIA Linguistic Data Consortium University of Pennsylvania Philadelphia PA 19104-2653 USA Abstract The Natural Language Toolkit is a suite of program modules data sets and tutorials supporting research and teaching in computational linguistics and natural language processing. NLTK is written in Python and distributed under the GPL open source license. Over the past year the toolkit has been rewritten simplifying many linguistic data structures and taking advantage of recent enhancements in the Python language. This paper reports on the simplified toolkit and explains how it is used in teaching NLP. 1 Introduction NLTK the Natural Language Toolkit is a suite of Python modules providing many NLP data types processing tasks corpus samples and readers together with animated algorithms tutorials and problem sets Loper and Bird 2002 . Data types include tokens tags chunks trees and feature structures. Interface definitions and reference implementations are provided for tokenizers stemmers taggers regexp ngram Brill chunkers parsers recursive-descent shift-reduce chart probabilistic clusterers and classifiers. Corpus samples and readers include Brown Corpus CoNLL-2000 Chunking Corpus CMU Pronunciation Dictionary NIST IEER Corpus PP Attachment Corpus Penn Treebank and the SIL Shoebox corpus format. NLTK is ideally suited to students who are learning NLP or conducting research in NLP or closely related areas. NLTK has been used successfully as a teaching tool as an individual study tool and as a platform for prototyping and building research systems Liddy and McCracken 2005 S tre et al. 2005 . We chose Python for its shallow learning curve transparent syntax and good string-handling. Python permits exploration via its interactive interpreter. As an object-oriented language Python permits data and code to be encapsulated and re-used