tailieunhanh - Lecture Notes in Computer Science- P15

Lecture Notes in Computer Science- P15:This year, we received about 170 submissions to ICWL 2008. There were a total of 52 full papers, representing an acceptance rate of about 30%, plus one invited paper accepted for inclusion in this LNCS proceedings. The authors of these accepted papers | 60 J. Qiu et al. To solve above problems a new method need to be proposed. It should be able to distinguish content pages from non-content pages and then extract main contents from content pages without using template and DOM-Tree. In this paper we propose a novel main contents extracting method. Main contributions include 1 Define a new concept of block and propose a block-partition method for web page. Without using DOM-Tree and template main contents and noise may be well partitioned into different blocks. 2 Define a concept of Block Distribution and study its features. Based on these features we employ classification method to distinguish content page from non-content page and then employ outlier analysis to get main contents from Block Distribution. The remaining of this paper is organized as follows. Section 2 gives a brief introduction to related works. Section 3 represents blocks partition method for web page. Section 4 introduces block distribution concept and its statistics feature. Section 5 gives a thorough study on performance of new method. Section 6 summarizes our work. 2 Related Works Some works 3 4 5 have studied template-based methods on contents extraction of web pages. Li 3 proposes a hybrid method that employed both tag sequence matching and tree matching to extract news from news web pages. Geng 4 firstly generates mapping rules from specified news pages. Then employ these rules to extract information from web page which have same page structure. Yi 5 assumed that layout of web pages is fixed in same website. He builds a Style Tree for the website. Contents of web pages of the website may be well extracted by using Style Tree. Lin 6 partitions web page to blocks then build profile vector for each block. According to the entropy value of each feature in a content block the entropy of the block may be derived. By entropy blocks are determined being either informative or redundant. Cai 8 utilizes visual cues of web pages such as layout font size

TỪ KHÓA LIÊN QUAN