tailieunhanh - Báo cáo khoa học: "GRAMMATICAL AN ALYSIS BY COMPUTER OF THE LANCASTER OSLO/BERGEN (LOB) CORPUS OF BRITISH ENGLISH TEXTS."

Research has been under way at the Unit for Computer Research on the ~hglish Language at the University of Lancaster, England, to develop a suite of computer programs which provide a detailed grammatical analysis of the LOB corpus, a collection of about 1 million words of British English texts available in machine readable form. The first phrase of the pruject, completed in September 1983, produced a grammatically annotated version of the corpus giving a tag showing the word class of each word token. . | GRAMMATICAL ANALYSIS BY COMPUTER OF THE LANCASTER-OSLO BERGEN LOB CORPUS OF BRITISH ENGLISH TEXTS. Andrew David Beale Unit for Computer Research on the Ehglish Language Bowland College University of Lancaster Bailrigg Lancaster Ehgland LAI AYT. ABSTRACT Research has been under way at the Unit for Computer Research on the Ehglish Language at the University of Lancaster Ehgland to develop a suite of computer programs which provide a detailed grammatical analysis of the LOB corpus a collection of about 1 million words of British Ehglish texts available in machine readable form. The first phrase of the project completed in September 1983 produced a grammatically annotated version of the corpus giving a tag showing the word class of each word token. Over 93 per cent of the word tags were correctly selected by using a matrix of tag pair probabilities and this figure was upgraded by a further 3 per cent by retagging problematic strings of words prior to disambiguation and by altering the probability weightings for sequences of three tags. Hie remaining 3 to A per cent were corrected by a human post-editor. The system was originally designed to run in batch mode over the corpus but we have recently modified procedures to run interactively for sample sentences typed in by a user at a terminal. We are currently extending the word tag set and improving the word tagging procedures to further reduce manual intervention. A similar probabilistic system is being developed for phrase and clause tagging. THE STRUCTURE AND PURPOSE OF THE LOB CORPUS. The LOB Corpus Johansson Leech and Goodluck 1978 like its American Ehglish counterpart the Brown Corpus Kụcéra and Francis 196A Hauge and Hofland 1978 is a collection of 500 samples of British Ehgiish texts each containing about 2 000 word tokens. The samples are representations of 15 different text categories A. Press Reportage B. Press Editorial c. Press Reviews D. Religion E. Stills and Hobbies F. Popular Lore G. Belles Lettres .