tailieunhanh - Báo cáo khoa học: "Multilingual Text Processing in a Two-Byte Code"

National and international standards committees are now discussing a two-byte code for multilingual information processing. This provides for 65,536 separate character and control codes, enough to make permanent code assiguments for all the charanters of ell national alphabets of the world, and also to include Chinese/Japanese characters. This paper discusses the kinds of flexibility required to handle both Roman and non-Roman . It is crucial to separate information units (codes) from graphic forms, to maximize processing p ower, Comparing alphabets around the world, we find the graphic devices (letters, digraphs, accent marks, punctuation, spacing, etc.) represent a very. | Multilingual Text Processing in a Two-Byte Code Lloyd B. Anderson Ecological Linguistics 316 A St. ỗ. E. Washington D. c. 20003 ABSTRACT National and international standards committees are now discussing a two-byte code for multilingual information processing. This provides for 65 536 separate character and control codes enough to make permanent code assignments for all the characters of all national alphabets of the world and also to include Chinese Japanese characters. This paper discusses the kinds of flexibility required to handle both Roman and non-Roman alphabets. It is crucial to separate Information units codes from graphic forms to maximize processing power. Comparing alphabets around the world we find that the graphic devices letters digraphs accent marks punctuation spacing etc. represent a very limited number of information units. It is possible to arrange alphabet codes to provide transliteration equivalence the best of three solutions compared as a framework for code assignments. Information vs. Form. In developing proposals far codes in information processing the most Important decisions are the choices of what to code. In a proposal for a multilingual two-byte code Xerox Corporation has made explicit a principle which we can state precisely as follows Basic codes Stand for Independently functioning information units not for visual forms The choice of type font presence or absence of serifs and variations like boldface Italics or underlining are matters of form. Such choices are normally made once for spans at least as long as one word. We do not use ComPLeX mIXturEs but consistent strings like this THIS this or THIS. By assigning the same basic code to variations of a single letter as a a A a all variants will automatically be alphabetized the same way which is as it should be. The choice of variant farms Is specified by supplementary looks information. The capitalization of first letters of sentences proper names ar nouns is a kind of punctuation.

TỪ KHÓA LIÊN QUAN