tailieunhanh - Báo cáo khoa học: "Analysis of Selective Strategies to Build a Dependency-Analyzed Corpus"

This paper discusses sampling strategies for building a dependency-analyzed corpus and analyzes them with different kinds of corpora. We used the Kyoto Text Corpus, a dependency-analyzed corpus of newspaper articles, and prepared the IPAL corpus, a dependency-analyzed corpus of example sentences in dictionaries, as a new and different kind of corpus. The experimental results revealed that the length of the test set controlled the accuracy and that the longest-first strategy was good for an expanding corpus, but this was not the case when constructing a corpus from scratch. . | Analysis of Selective Strategies to Build a Dependency-Analyzed Corpus Kiyonori Ohtake National Institute of Information and Communications Technology NICT ATR Spoken Language Communication Research Labs. 2-2-2 Hikaridai Keihanna Science City Kyoto 619-0288 Japan at Abstract This paper discusses sampling strategies for building a dependency-analyzed corpus and analyzes them with different kinds of corpora. We used the Kyoto Text Corpus a dependency-analyzed corpus of newspaper articles and prepared the IPAL corpus a dependency-analyzed corpus of example sentences in dictionaries as a new and different kind of corpus. The experimental results revealed that the length of the test set controlled the accuracy and that the longest-first strategy was good for an expanding corpus but this was not the case when constructing a corpus from scratch. 1 Introduction Dependency-structure analysis plays a very important role in natural language processing NLP . Thus so far much research has been done on this subject with many analyzers being developed such as rule-based analyzers and corpus-based analyzers that use machine-learning techniques. However the maximum accuracy achieved by state-of-the art analyzers is almost 90 for newspaper articles it seems very difficult to exceed this figure of 90 . To improve our analyzers we have to write more rules for rule-based analyzers or prepare more corpora for corpus-based analyzers. If we take a machine-learning approach it is important to consider what features are used. However there are several machine-learning techniques such as support vector machines SVMs with a kernel function that have strong generalization ability and are very robust for choosing the right features. If we use such machine-learning techniques we will be free from choosing a feature set because it will be possible to use all possible features with little or no decline in performance. Actually Sasano tried to expand the feature set for a

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN