tailieunhanh - Báo cáo khoa học: "Development of Corpora within the CLaRK System The BulTreeBank Project Experience"

CLaRK is an XML-based software system for corpora development. It incorporates several technologies: XML technology; Un i code ; Regular Cascaded Grammars; Constraints over XML Documents. The basic components of the system are: a tagger, a concordancer, an extractor, a grammar processor, a constraint engine. | Development of Corpora within the CLaRK System The BulTreeBank Project Experience Kiril Simov Alexander Simov Milen Kouylekov Krasimira Ivanova Ilko Grigorov Hristo Ganev BulTreeBank Project Linguistic Modelling Laboratory - CLPPI BAS Sofia Bulgaria kivs@ adis_78@ mkouylekov@ kras sy_v @ abv. bg ilko _grigorov @ yahoo. com hristo _gane v79 @ yahoo. com Abstract CLaRK is an XML-based software system for corpora development. It incorporates several technologies XML technology Unicode Regular Cascaded Grammars Constraints over XML Documents. The basic components of the system are a tagger a concordancer an extractor a grammar processor a constraint engine. 1 Introduction The CLaRK System is an XML-based system for corpora development - see Simov et. al. 2001 . The main aim behind the design of the system is the minimization of human intervention during the creation of language resources. It incorporates the following technologies XML technology Unicode Regular Cascaded Grammars Constraints over XML Documents. For document management storing and querying we chose the XML technology because of its popularity and its ease of understanding. The core of CLaRK is an Unicode XML Editor which is the main interface to the system. Besides the XML language itself we implemented an XPath language for navigation in documents and an XSLT engine for transformation of XML documents. The XSL transformations can be applied locally to an XML element and its content. For multilingual processing tasks CLaRK is based on an Unicode encoding of the text inside the system. There is a mechanism for the creation of a hierarchy of tokenisers. They can be attached to the elements in the DTDs and in this way there are different tokenisers for different parts of the documents. The basic mechanism of CLaRK for linguistic processing of text corpora is the cascaded regular grammar processor. The main challenge to the grammars in question is how to apply them on XML encoding

TỪ KHÓA LIÊN QUAN
crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.