title: Extracting Data Records from the Web Using Tag Path Clustering
creator: Miao, Gengxin
creator: Tatemura, Junichi
creator: Hsiung, Wang-Pin
creator: Sawires, Arsany
creator: Moser, Louise E.
description: Fully automatic methods that extract lists of objects from the Web have been studied extensively. Record extraction, the ﬁrst step of this object extraction process, identiﬁes a set of Web page segments, each of which represents an individual object (e.g., a product). State-of-the-art methods suffice for simple search, but they often fail to handle more complicated or noisy Web page structures due to a key limitation – their greedy manner of identifying a list of records through pairwise comparison (i.e., similarity match) of consecutive segments. This paper introduces a new method for record extraction that captures a list of objects in a more robust way based on a holistic analysis of a Web page. The method focuses on how a distinct tag path appears repeatedly in the DOM tree of the Web document. Instead of comparing a pair of individual segments, it compares a pair of tag path occurrence patterns (called visual signals ) to estimate how likely these two tag paths represent the same list of objects. The paper introduces a similarity measure that captures how closely the visual signals appear and interleave. Clustering of tag paths is then performed based on this similarity measure, and sets of tag paths that form the structure of data records are extracted. Experiments show that this method achieves higher accuracy than previous methods.
date: 2009-04
type: Conference or Workshop Item
type: PeerReviewed
format: application/pdf
identifier: http://www2009.eprints.org/99/1/p981.pdf
identifier: Miao, Gengxin <http://www2009.eprints.org/view/author/Miao=3AGengxin=3A=3A.html> and Tatemura, Junichi <http://www2009.eprints.org/view/author/Tatemura=3AJunichi=3A=3A.html> and Hsiung, Wang-Pin <http://www2009.eprints.org/view/author/Hsiung=3AWang-Pin=3A=3A.html> and Sawires, Arsany <http://www2009.eprints.org/view/author/Sawires=3AArsany=3A=3A.html> and Moser, Louise E. <http://www2009.eprints.org/view/author/Moser=3ALouise_E=2E=3A=3A.html> (2009) Extracting Data Records from the Web Using Tag Path Clustering. In: 18th International World Wide Web Conference, April 20th-24th, 2009, Madrid, Spain.
relation: http://www2009.eprints.org/99/