tailieunhanh - Báo cáo khoa học: " THE DISTRIBUTION OF WORD LENGTH IN TECHNICAL RUSSIAN"
IN the course of an analysis of several samples of technical Russian undertaken as part of a study in mechanical translation, a number of statistical data reflecting the structure of these samples were compiled. One of these, the distribution of word length, is presented here as Fig. | Mechanical Translation December 1954 pp. 38-40 THE DISTRIBUTION OF WORD LENGTH IN TECHNICAL RUSSIAN Anthony G. Oettinger Computation Laboratory Harvard University IN the course of an analysis of several samples of technical Russian undertaken as part of a study in mechanical translation a number of statistical data reflecting the structure of these samples were compiled. One of these the distribution of word length is presented here as Fig. 1. The theoretical interest of this distribution arises from the possibility of using it as a basis for an operational definition of words in printed texts. If texts are considered purely as sequences of symbols including the letters punctuation marks and space the resulting sequences are of a length which no practicable machine can manage. A study of the distribution of the number of symbols between pairs of successive symbols of certain classes would be one way to reveal structural characteristics of the text sequences potentially useful toward the definition of manageable and significant subsequences. The subsequences included between successive occurrences of letter pairs have not been investigated. Those included between successive pairs of periods exclamation points or question marks can be identified with the classical sentence and finally those included between successive pairs of punctuation marks or spaces can be identified with words. The length distribution of the latter subsequences has the desirable property not shared by the others of being concentrated at relatively low values of length and of having no elements exceeding a certain length Fig. 1 . Words defined in this fashion can readily be identified by a machine and they are of limited variety so that their listing in a dictionary is practicable. From the practical point of view the distribution is useful in planning input and storage facilities in experimental translating equipment. The samples used were relatively small and Fig. 1 should .
đang nạp các trang xem trước