tailieunhanh - Báo cáo khoa học: "Word Clustering and Word Selection based Feature Reduction for MaxEnt based Hindi NER"
Statistical machine learning methods are employed to train a Named Entity Recognizer from annotated data. Methods like Maximum Entropy and Conditional Random Fields make use of features for the training purpose. These methods tend to overfit when the available training corpus is limited especially if the number of features is large or the number of values for a feature is large. To overcome this we proposed two techniques for feature reduction based on word clustering and selection. A number of word similarity measures are proposed for clustering words for the Named Entity Recognition task. A few corpus based statistical. | Word Clustering and Word Selection based Feature Reduction for MaxEnt based Hindi NER Sujan Kumar Saha Indian Institute of Technology Kharagpur West Bengal India-721302 Pabitra Mitra Indian Institute of Technology Kharagpur West Bengal India-721302 r pabitra@ Sudeshna Sarkar Indian Institute of Technology Kharagpur West Bengal India-721302 shudeshna@ Abstract Statistical machine learning methods are employed to train a Named Entity Recognizer from annotated data. Methods like Maximum Entropy and Conditional Random Fields make use of features for the training purpose. These methods tend to overfit when the available training corpus is limited especially if the number of features is large or the number of values for a feature is large. To overcome this we proposed two techniques for feature reduction based on word clustering and selection. A number of word similarity measures are proposed for clustering words for the Named Entity Recognition task. A few corpus based statistical measures are used for important word selection. The feature reduction techniques lead to a substantial performance improvement over baseline Maximum Entropy technique. 1 Introduction Named Entity Recognition NER involves locating and classifying the names in a text. NER is an important task having applications in information extraction question answering machine translation and in most other Natural Language Processing NLP applications. NER systems have been developed for English and few other languages with high accuracy. These belong to two main categories based on machine learning Bikel et al. 1997 Borthwick 1999 McCallum and Li 2003 and language or domain specific rules Grishman 1995 Wakao et al. 1996 . In English the names are usually capitalized which is an important clue for identifying a name. Absence of capitalization makes the Hindi NER task difficult. Also person names are more diverse in Indian languages many common words being used as .
đang nạp các trang xem trước