Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: " Entropy Rate Constancy in Text"

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ

We present a constancy rate principle governing language generation. We show that this principle implies that local measures of entropy (ignoring context) should increase with the sentence number. We demonstrate that this is indeed the case by measuring entropy in three different ways. We also show that this effect has both lexical (which words are used) and non-lexical (how the words are used) causes. | Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics ACL Philadelphia July 2002 pp. 199-206. Entropy Rate Constancy in Text Dmitriy Genzel and Eugene Charniak Brown Laboratory for Linguistic Information Processing Department of Computer Science Brown University Providence RI USA 02912 dg ec @cs.brown.edu Abstract We present a constancy rate principle governing language generation. We show that this principle implies that local measures of entropy ignoring context should increase with the sentence number. We demonstrate that this is indeed the case by measuring entropy in three different ways. We also show that this effect has both lexical which words are used and non-lexical how the words are used causes. 1 Introduction It is well-known from Information Theory that the most efficient way to send information through noisy channels is at a constant rate. If humans try to communicate in the most efficient way then they must obey this principle. The communication medium we examine in this paper is text and we present some evidence that this principle holds here. Entropy is a measure of information first proposed by Shannon 1948 . Informally entropy of a random variable is proportional to the difficulty of correctly guessing the value of this variable when the distribution is known . Entropy is the highest when all values are equally probable and is lowest equal to 0 when one of the choices has probability of 1 i.e. deterministically known in advance. In this paper we are concerned with entropy of English as exhibited through written text though these results can easily be extended to speech as well. The random variable we deal with is therefore a unit of text a word for our purposes1 that a random person who has produced all the previous words in the text stream is likely to produce next. We have as many random variables as we have words in a text. The distributions of these variables are obviously different and depend on all previous