tailieunhanh - Báo cáo khoa học: "Finding Hedges by Chasing Weasels: Hedge Detection Using Wikipedia Tags and Shallow Linguistic Features"
We investigate the automatic detection of sentences containing linguistic hedges using corpus statistics and syntactic patterns. We take Wikipedia as an already annotated corpus using its tagged weasel words which mark sentences and phrases as non-factual. We evaluate the quality of Wikipedia as training data for hedge detection, as well as shallow linguistic features. | Finding Hedges by Chasing Weasels Hedge Detection Using Wikipedia Tags and Shallow Linguistic Features Viola Ganter and Michael Strube EML Research gGmbH Heidelberg Germany http nlp Abstract We investigate the automatic detection of sentences containing linguistic hedges using corpus statistics and syntactic patterns. We take Wikipedia as an already annotated corpus using its tagged weasel words which mark sentences and phrases as non-factual. We evaluate the quality of Wikipedia as training data for hedge detection as well as shallow linguistic features. 1 Introduction While most research in natural language processing is dealing with identifying extracting and classifying facts recent years have seen a surge in research on sentiment and subjectivity see Pang Lee 2008 for an overview . However even opinions have to be backed up by facts to be effective as arguments. Distinguishing facts from fiction requires to detect subtle variations in the use of linguistic devices such as linguistic hedges which indicate that speakers do not back up their opinions with facts Lakoff 1973 Hyland 1998 . Many NLP applications could benefit from identifying linguistic hedges . question answering systems Riloff et al. 2003 information extraction from biomedical documents Medlock Briscoe 2007 Szarvas 2008 and deception detection Bachenko et al. 2008 . While NLP research on classifying linguistic hedges has been restricted to analysing biomedical documents the above incomplete list of applications suggests that domain- and languageindependent approaches for hedge detection need to be developed. We investigate Wikipedia as a source of training data for hedge classification. We adopt Wikipedia s notion of weasel words which we argue to be closely related to hedges and private states. Many Wikipedia articles contain a specific weasel tag so that Wikipedia can be viewed as a readily annotated corpus. Based on this data we have built a system to detect sentences that
đang nạp các trang xem trước