Large scale acquisition and maintenance from the web without source access

Leonard, Thomas and Glaser, Hugh (2001) Large scale acquisition and maintenance from the web without source access. In Proceedings K-CAP 2001.

Full text available as:

PDF - Requires Adobe Acrobat Reader or other PDF viewer.

Although different web sites structure their pages differently, the pages within a single site are often generated from a database and have a regular layout from which it is possible to extract information automatically. Dome is a visual tool for manipulating tree-structured docu-ments. It can import and export in XML or HTML formats, making it ideal for harvesting information from web pages. Editing is performed using a direct manipulation interface and the operations are recorded for later playback. The knowledge extracted from a web page may be updated by replaying the recorded sequence when the source page changes. The same sequence can be applied to other pages with a similar format, and facilities are provided to batch process a large collection of pages in one operation. In this paper we describe how Dome may be used to extract knowledge from web sites in such a way that the extraction process may be reliably replayed.

Subjects:	Status > AKT Showcase Papers AKT Challenges > Knowledge acquisition
ID Code:	8
Deposited By:	Glaser, Hugh
Deposited On:	22 January 2002

Contact the site administrator at: hg@ecs.soton.ac.uk


	AKT EPrint Archive
	AKT EPrints Home \|\| AKT Home \|\| About \|\| Browse Subjects \|\| Browse by Year