AKT EPrint Archive

Large scale acquisition and maintenance from the web without source access

Leonard, Thomas and Glaser, Hugh (2001) Large scale acquisition and maintenance from the web without source access. In Proceedings K-CAP 2001.

Full text available as:

PDF - Requires Adobe Acrobat Reader or other PDF viewer.

Although different web sites structure their pages differently, the pages within a single site are often generated from a database and have a regular layout from which it is possible to extract information automatically. Dome is a visual tool for manipulating tree-structured docu-ments. It can import and export in XML or HTML formats, making it ideal for harvesting information from web pages. Editing is performed using a direct manipulation interface and the operations are recorded for later playback. The knowledge extracted from a web page may be updated by replaying the recorded sequence when the source page changes. The same sequence can be applied to other pages with a similar format, and facilities are provided to batch process a large collection of pages in one operation. In this paper we describe how Dome may be used to extract knowledge from web sites in such a way that the extraction process may be reliably replayed.

Subjects:Status > AKT Showcase Papers
AKT Challenges > Knowledge acquisition
ID Code:8
Deposited By:Glaser, Hugh
Deposited On:22 January 2002

Contact the site administrator at: hg@ecs.soton.ac.uk