tailieunhanh - Báo cáo khoa học: "Word Frequency Distributions in R"

We introduce the zipfR package, a powerful and user-friendly open-source tool for LNRE modeling of word frequency distributions in the R statistical environment. We give some background on LNRE models, discuss related software and the motivation for the toolkit, describe the implementation, and conclude with a complete sample session showing a typical LNRE analysis. | zipfR Word Frequency Distributions in R Stefan Evert IKW University of Osnabriick Albrechtstr. 28 49069 Osnabriick Germany Marco Baroni CIMeC University of Trento Bettini 31 38068 Rovereto Italy Abstract We introduce the zipfR package a powerful and user-friendly open-source tool for LNRE modeling of word frequency distributions in the R statistical environment. We give some background on LNRE models discuss related software and the motivation for the toolkit describe the implementation and conclude with a complete sample session showing a typical LNRE analysis. 1 Introduction As has been known at least since the seminal work of Zipf 1949 words and other type-rich linguistic populations are characterized by the fact that even the largest samples corpora do not contain instances of all types in the population. Consequently the number and distribution of types in the available sample are not reliable estimators of the number and distribution of types in the population. Large-Number-of-Rare-Events LNRE models Baayen 2001 are a class of specialized statistical models that estimate the distribution of occurrence probabilities in such type-rich linguistic populations from our limited samples. LNRE models have applications in many branches of linguistics and NLP. A typical use case is to predict the number of different types the vocabulary size in a larger sample or the whole population based on the smaller sample available to the researcher. For example one could use LNRE models to infer how many words a 5-year-old child knows in total given a sample of her writing. LNRE 29 models can also be used to quantify the relative productivity of two morphological processes as illustrated below or of two rival syntactic constructions by looking at their vocabulary growth rate as sample size increases. Practical NLP applications include making informed guesses about type counts in very large data sets . How many typos are there on

TÀI LIỆU LIÊN QUAN