tailieunhanh - An improved indexing method for querying big XML files

The exponential growth of bioinformatics in the healthcare domain has revolutionized our understanding of DNA, proteins, and other biomolecular entities. This remarkable progress has generated an overwhelming volume of data, necessitating big data technologies for efficient storage and indexing. While big data technologies like Hadoop offer substantial support for big XML file storage, the challenges of indexing data sizes and XPath query performance persist. | Journal of Computer Science and Cybernetics 2023 323 342 DOI no 1813-9663 19018 AN IMPROVED INDEXING METHOD FOR QUERYING BIG XML FILES DINH DUC LUONG1 VUONG QUANG PHUONG2 HOANG DO THANH TUNG2 1 Food Industrial College Phu Tho Viet Nam 2 Institute of Information Technology Vietnam Academy of Science and Technology Ha Noi Viet Nam Abstract. The exponential growth of bioinformatics in the healthcare domain has revolutionized our understanding of DNA proteins and other biomolecular entities. This remarkable progress has generated an overwhelming volume of data necessitating big data technologies for efficient storage and indexing. While big data technologies like Hadoop offer substantial support for big XML file storage the challenges of indexing data sizes and XPath query performance persist. To enhance the efficiency of XPath queries and address the data size problem a novel approach that is derived from the spatial indexing method of the R-tre family. The proposed method is to modify the structure of leaf nodes in the indexing tree to preserve XML-sibling connections. Then new algorithms for constructing the new tree structure and processing sibling queries better are introduced. Experimental results demonstrate the superior efficiency of sibling XPath queries with reduced data sizes for indexing while other XPath queries exhibit notable performance improvements. This research contributes to the development of more effective indexing methods for managing and querying large XML datasets in bioinformatics applications ultimately advancing biomedical research and healthcare initiatives. Keywords. Big data Indexing Analysis of XML Bio-XML files XML Query processing. 1. INTRODUCTION XML documents store structured text data also known as semi-structured data 40 41 . They have been popular for decades because of their flexible data structure and easy sharing over the Internet. Usually the XML documents used on the internet are not very large. However the