tailieunhanh - Báo cáo khoa học: "Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing"
An algorithm based on the Generalized Hebbian Algorithm is described that allows the singular value decomposition of a dataset to be learned based on single observation pairs presented serially. The algorithm has minimal memory requirements, and is therefore interesting in the natural language domain, where very large datasets are often used, and datasets quickly become intractable. The technique is demonstrated on the task of learning word and letter bigram pairs from text. | Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing Genevieve Gorrell Department of Computer and Information Science Linkoping University 581 83 LINKOPING Sweden gengo@ Abstract An algorithm based on the Generalized Hebbian Algorithm is described that allows the singular value decomposition of a dataset to be learned based on single observation pairs presented serially. The algorithm has minimal memory requirements and is therefore interesting in the natural language domain where very large datasets are often used and datasets quickly become intractable. The technique is demonstrated on the task of learning word and letter bigram pairs from text. 1 Introduction Dimensionality reduction techniques are of great relevance within the field of natural language processing. A persistent problem within language processing is the over-specificity of language and the sparsity of data. Corpusbased techniques depend on a sufficiency of examples in order to model human language use but the Zipfian nature of frequency behaviour in language means that this approach has diminishing returns with corpus size. In short there are a large number of ways to say the same thing and no matter how large your corpus is you will never cover all the things that might reasonably be said. Language is often too rich for the task being performed for example it can be difficult to establish that two documents are discussing the same topic. Likewise no matter how much data your system has seen during training it will invariably see something new at run-time in a domain of any complexity. Any approach to au tomatic natural language processing will encounter this problem on several levels creating a need for techniques which compensate for this. Imagine we have a set of data stored as a matrix. Techniques based on eigen decomposition allow such a matrix to be transformed into a set of orthogonal vectors each with an associated strength or
đang nạp các trang xem trước