tailieunhanh - Báo cáo khoa học: "Syntagmatic and Paradigmatic Representations of Term Variation"

A two-tier model for the description of morphological, syntactic and semantic variations of multi-word terms is presented. It is applied to term normalization of French and English corpora in the medical and agricultural domains. Five different sources of morphological and semantic knowledge are exploited (MULTEXT, CELEX, AGROVOC, , and Microsoft Word97 thesaurus). | Syntagmatic and Paradigmatic Representations of Term Variation Christian Jacquemin LIMSI-CNRS BP 133 91403 ORSAY Cedex FRANCE jacquemin@ Abstract A two-tier model for the description of morphological syntactic and semantic variations of multi-word terms is presented. It is applied to term normalization of French and English corpora in the medical and agricultural domains. Five different sources of morphological and semantic knowledge are exploited MULTEXT CELEX AGROVOC and Microsoft Word97 thesaurus . 1 Introduction In the classical approach to text retrieval terms are assigned to queries and documents. The terms are generated by a process called automatic indexing. Then given a query the similarity between the query and the documents is computed and a ranked list of documents is produced as output of the system for information access Salton and McGill 1983 . The similarity between queries and documents depends on the terms they have in common. The same concept can be formulated in many different ways known as variants which should be conflated in order to avoid missing relevant documents. For this purpose this paper proposes a novel model of term variation that integrates linguistic knowledge and performs accurate term normalization. It relies on previous or ongoing linguistic studies on this topic Sparck Jones and Tait 1984 Jacquemin et al. 1997 Hamon et al. 1998 . Terms are described in a two-tier framework composed of a paradigmatic level and a syntagmatic level that account for the three linguistic dimensions of term variability morphology syntax and semantics . Term variants are extracted from tagged corpora through FASTR1 a unification-based transformational parser described in Jacquemin et al. 1997 . Four experiments are performed on the French and the English languages and a measure of precision is provided for each of them. Two experiments are made on a French corpus AGRIC composed of X 106 words of scientific abstracts in 1FASTR .