Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots"

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ

We present a clustering algorithm for Arabic words sharing the same root. Root based clusters can substitute dictionaries in indexing for IR. Modifying Adamson and Boreham (1974), our Two-stage algorithm applies light stemming before calculating word pair similarity coefficients using techniques sensitive to Arabic morphology. Tests show a successful treatment of infixes and accurate clustering to up to 94.06% for unedited Arabic text samples, without the use of dictionaries. | A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots Anne N. DE ROECK Department of Computer Science University of Essex Colchester CO4 3SQ U.K. deroe@essex.ac.uk Abstract We present a clustering algorithm for Arabic words sharing the same root. Root based clusters can substitute dictionaries in indexing for IR. Modifying Adamson and Boreham 1974 our Two-stage algorithm applies light stemming before calculating word pair similarity coefficients using techniques sensitive to Arabic morphology. Tests show a successful treatment of infixes and accurate clustering to up to 94.06 for unedited Arabic text samples without the use of dictionaries. Introduction Canonisation of words for indexing is an important and difficult problem for Arabic IR. Arabic is a highly inflectional language with 85 of words derived from tri-lateral roots Al-Fedaghi and Al-Anzi 1989 . Stems are derived from roots through the application of a set of fixed patterns. Addition of affixes to stems yields words. Words sharing a root are semantically related and root indexing is reported to outperform stem and word indexing on both recall and precision Hmeidi et al 1997 . However Arabic morphology is excruciatingly complex the Appendix attempts a brief introduction and root identification on a scale useful for IR remains problematic. Research on Arabic IR tends to treat automatic indexing and stemming separately. Al-Shalabi and Evans 1998 and El-Sadany and Hashish 1989 developed stemming algorithms. Hmeidi et al 1997 developed an information retrieval system with an index but does not explain the underlying stemming algorithm. In Al-Kharashi and Evans 1994 stemming is done manually Waleed AL-FARES Computer Science Department College of Business Studies Hawaly Kuwait al-fareswaleed@usa.net and the IR index is built by manual insertion of roots stems and words. Typically Arabic stemming algorithms operate by trial and error . Affixes are stripped away and stems undone according