tailieunhanh - Báo cáo khoa học: "A PROBABILISTIC APPROACH TO GRAMMATICAL ANALYSIS OF WRITTEN ENGLISH BY COMPUTER"

Work at the Unit for Computer Research on the Eaglish Language at the University of Lancaster has been directed towards producing a grammatically snnotated version of the Lancaster-Oslo/ Bergen (LOB) Corpus of written British English texts as the prel~minary stage in developing computer programs and data files for providing a grammatical analysis of -n~estricted English text. From 1981-83, a suite of PASCAL programs was devised to automatically produce a single level of grammatical description with one word tag representing the word class or part of speech of each word token in the corpus. . | A PROBABILISTIC APPROACH TO GRAMMATICAL ANALYSIS OF WRITTEN ENGLISH BY COMPUTER. Andrew David Beale Unit for Computer Research on the aiglish Language University of Lancaster Bowland College Bailrigg Lancaster Ehgland LAI 4YT. ABSTRACT Work at the Unit for Computer Research on the Ehglish Language at the University of Lancaster has been directed towards producing a grammatically annotated version of the Lancaster-Oslo Bergen LOB Corpus of written British Ehglish texts as the preliminary stage in developing computer programs and data files for providing a grammatical analysis of unrestricted Biglish text. From 1981-85 a suite of PASCAL programs was devised to automatically produce a single level of grammatical description with one word tag representing the word class or part of speech of each word token in the corpus. Error analysis and subsequent modification to the system resulted in over 96 per cent of word tags being correctly assigned automatically. The remaining Ĩ to 4 per cent were corrected by human post-editors. Work is now in progress to devise a suite of programs to provide a constituent analysis of the sentences in the corpus. So far sample sentences have been automatically assigned phrase and clause tags using a probabilistic system similar to word tagging. It is hoped that the entire corpus will eventually be parsed. THE LOB CORPUS The LOB Corpus Johansson Leech and Goodluck 1978 is a collection of 500 text samples each containing about 2 000 word tokens of written British Ehglish published in a single year 1961 . The 500 text samples fall into 15 different text categories representing a variety of styles such as press reporting science fiction scholarly and scientific writing romantic fiction and religious writing. There are two main sections informative prose and imaginative prose. The corpus contains just over 1 million word tokens in all- Preparation of the LOB corpus in machine readable form began at the Department of Linguistics and Modern .

TỪ KHÓA LIÊN QUAN
TÀI LIỆU MỚI ĐĂNG