DocEng 2011: An Efficient Language-Independent Method to Extract Content from News Webpages

Eduardo Teixeira Cardoso Google Tech Talks

19:14

0 views

Published September 29, 2011

About this talk

The 11th ACM Symposium on Document Engineering Mountain View, California, USA September 19-22, 2011 An Efficient Language-Independent Method to Extract Content from News Webpages Eduardo Teixeira Cardoso, Iam Jabour, Eduardo Laber, Rogério Ferreira Rodrigues, Pedro Lazéra Cardoso Presented by Eduardo Teixeira Cardoso. ABSTRACT We tackle the task of news webpage segmentation, specifically identifying the news title, publication date and story body. While there are very good results in the literature, most of them rely on webpage rendering, which is a very time-consuming step. We focus on scenarios with a high volume of documents, where performance is a must. The chosen approach extends our previous work in the area, combining structural properties with hints of visual presentation styles, computed with a quicker method than regular rendering, and machine learning algorithms. In our experiments, we took special attention to some aspects that are often overlooked in the literature, such as processing time and the generalization of the extraction results for unseen domains. Our approach has shown to be about an order of magnitude faster than an equivalent full rendering alternative while retaining a good quality of extraction.