Large scale acquisition and maintenance from the web without source access
2001) Large scale acquisition and maintenance from the web without source access. In Proceedings K-CAP 2001. (
Full text available as:
PDF - Requires Adobe Acrobat Reader or other PDF viewer. |
Although different web sites structure their pages differently, the pages within a single site are often generated from a database and have a regular layout from which it is possible to extract information automatically. Dome is a visual tool for manipulating tree-structured docu-ments. It can import and export in XML or HTML formats, making it ideal for harvesting information from web pages. Editing is performed using a direct manipulation interface and the operations are recorded for later playback. The knowledge extracted from a web page may be updated by replaying the recorded sequence when the source page changes. The same sequence can be applied to other pages with a similar format, and facilities are provided to batch process a large collection of pages in one operation. In this paper we describe how Dome may be used to extract knowledge from web sites in such a way that the extraction process may be reliably replayed.
Subjects: | Status > AKT Showcase Papers AKT Challenges > Knowledge acquisition |
---|---|
ID Code: | 8 |
Deposited By: | Glaser, Hugh |
Deposited On: | 22 January 2002 |
Contact the site administrator at: hg@ecs.soton.ac.uk