tailieunhanh - Báo cáo khoa học: "Large linguistically-processed Web corpora for multiple languages"

The Web contains vast amounts of linguistic data. One key issue for linguists and language technologists is how to access it. Commercial search engines give highly compromised access. An alternative is to crawl the Web ourselves, which also allows us to remove duplicates and nearduplicates, navigational material, and a range of other kinds of non-linguistic matter. We can also tokenize, lemmatise and part-of-speech tag the corpus, and load the data into a corpus query tool which supports sophisticated linguistic queries. We have now done this for German and Italian, with corpus sizes of over 1 billion words in each. | Large linguistically-processed Web corpora for multiple languages Marco Baroni SSLMIT University of Bologna Italy baroni@ Adam Kilgarriff Lexical Computing Ltd. and University of Sussex Brighton UK adam@ Abstract The Web contains vast amounts of linguistic data. One key issue for linguists and language technologists is how to access it. Commercial search engines give highly compromised access. An alternative is to crawl the Web ourselves which also allows us to remove duplicates and nearduplicates navigational material and a range of other kinds of non-linguistic matter. We can also tokenize lemmatise and part-of-speech tag the corpus and load the data into a corpus query tool which supports sophisticated linguistic queries. We have now done this for German and Italian with corpus sizes of over 1 billion words in each case. We provide Web access to the corpora in our query tool the Sketch Engine. 1 Introduction The Web contains vast amounts of linguistic data for many languages Kilgarriff and Grefenstette 2003 . One key issue for linguists and language technologists is how to access it. The drawbacks of using commercial search engines are presented in Kilgarriff 2003 . An alternative is to crawl the Web We have done this for two languages German and Italian and here we report on the pipeline of processes which give us reasonably well-behaved clean corpora for each language. 1Another Web access option is Alexa http pages . company who allow the user for a modest fee to access their cached Web directly. Using Alexa would mean one did not need to crawl however in our experience crawling given free software like Heritrix is not the bottleneck. The point at which input is required is the filtering out of non-linguistic material. We use the German corpus which was developed first as our example throughout. The procedure was carried on a server running RH Fedora Core 3 with 4 GB RAM Dual Xeon GHz CPUs .

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN