tailieunhanh - Báo cáo khoa học: "A STOCHASTIC PROCESS FOR WORD FREQUENCY DISTRIBUTIONS"

A stochastic model based on insights of Mandelbrot (1953) and Simon (1955) is discussed against the background of new criteria of adequacy that have become available recently as a result of studies of the similarity relations between words as found in large computerized text corpora. | A STOCHASTIC PROCESS FOR WORD FREQUENCY DISTRIBUTIONS Harald Baayen Max-Planck-Institut fur Psycholinguistik Wundtlaan 1 NL-6525 XD Nijmegen Internet baayen@ ABSTRACT A stochastic model based on insights of Mandelbrot 1953 and Simon 1955 is discussed against the background of new criteria of adequacy that have become available recently as a result of studies of the similarity relations between words as found in large computerized text corpora. FREQUENCY DISTRIBUTIONS Various models for word frequency distributions have been developed since Zipf 1935 applied the zeta distribution to describe a wide range of lexical data. Mandelbrot 1953 1962 extended Zipf s distribution law f where fi is the sample frequency of the ith type in a ranking according to decreasing frequency with the parameter B e K B iA w by means of which fits are obtained that are more accurate with respect to the higher frequency words. Simon 1955 1960 developed a stochastic process which has the Yule distribution A AB i p l 3 with the parameter A and B i p 1 the Beta function in i p 1 as its stationary solutions. For i oo 3 can be written as i r p l ị- 1 in other words 3 approximates Zipf s law with respect to the lower frequency words the tail of 1 am indebted to Klaas van Harn Richard Gill Bert Hoeks and Erik Schils for stimulating discussions on the statistical analysis of lexical similarity relations. the distribution other models such as Good 1953 Waring-Herdan Herdan 1960 Muller 1979 and Sichel 1975 have been put forward all of which have Zipf s law as some special or limiting form. Unrelated to Zipf s law is the lognormal hypothesis advanced for word frequency distributions by Carroll 1967 1969 which gives rise to reasonable fits and is widely used in psycholinguistic research on word frequency effects in mental processing. A problem that immediately arises in the context of the study of word frequency distributions concerns the fact that these distributions have two important .

TỪ KHÓA LIÊN QUAN