Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Entity Type Variation across Two Biomedical Subdomains"
Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
There are lexical, syntactic, semantic and discourse variations amongst the languages used in various biomedical subdomains. It is important to recognise such differences and understand that biomedical tools that work well on some subdomains may not work as well on others. We report here on the semantic variations that occur in the sublanguages of two biomedical subdomains, i.e. cell biology and pharmacology, at the level of named entity information. By building a classifier using ratios of named entities as features, we show that named entity information can discriminate between documents from each subdomain. . | What s in a Name Entity Type Variation across Two Biomedical Subdomains Claudiu Mihăilă and Riza Theresa Batista-Navarro National Centre for Text Mining School of Computer Science University of Manchester Manchester Interdisciplinary Biocentre 131 Princess Street M1 7DN Manchester UK claudiu.mihaila@cs.man.ac.uk riza.batista-navarro@cs.man.ac.uk Abstract There are lexical syntactic semantic and discourse variations amongst the languages used in various biomedical subdomains. It is important to recognise such differences and understand that biomedical tools that work well on some subdomains may not work as well on others. We report here on the semantic variations that occur in the sublanguages of two biomedical subdomains i.e. cell biology and pharmacology at the level of named entity information. By building a classifier using ratios of named entities as features we show that named entity information can discriminate between documents from each subdomain. More specifically our classifier can distinguish between documents belonging to each subdomain with an accuracy of 91.1 F-score. 1 Introduction Biomedical information extraction efforts in the past decade have focussed on fundamental tasks needed to create intelligent systems capable of improving search engine results and easing the work of biologists. More specifically researchers have concentrated mainly on named entity recognition mapping them to concepts in curated databases Krallinger et al. 2008 and extracting simple binary relations between entities. Recently an increasing number of resources that facilitate the training of systems to extract more detailed information have become available e.g. PennBioIE Kulick et al. 2004 GENETAG Tanabe et al. 2005 BioInfer Pyysalo et al. 2007 GENIA Kim et al. 2008 GREC Thompson et al. 2009 and Metaknowledge GE-NIA Thompson et al. 2011 . Moreover several other annotated corpora have been developed for shared task purposes such as BioCreative I II III Arighi et al. 2011 and