AKT EPrint Archive

Mining the Semantic Web: Requirements for Machine Learning

Ciravegna, Prof. Fabio and Chapman, Mr. Sam (2005) Mining the Semantic Web: Requirements for Machine Learning. Machine Learning for the Semantic Web Dagstuhl Seminar 05071, Dagstuhl, DE.

Full text available as:

PDF - Requires Adobe Acrobat Reader or other PDF viewer.

In the current form of the Web, content is designed and published for human reading and it is not typi- cally tractable by machines; the Semantic Web, SW, is expected to extend this by providing structured content via the addition of annotations. A prereq- uisite for the SW is the availability of structured knowledge, so methods need to be employed to gen- erate it from existing unstructured content (docu- ment annotation). A number of tools have been proposed for manual annotation of documents, some of them use Information Extraction, IE, to re- duce the burden on the user side. Relying on a man- ual process presents some risks for the SW, because it creates a bottleneck: convincing millions of users to annotate documents requires a world-wide action of unlikely outcome. Moreover there are some se- rious concerns about the quality of manual annota- tion, due to user inability or to spamming. To produce a viable and maintainable SW, large scale automatic anno- tation services, similar to today's search engines, are needed. They must be: (1) easily defined for a specific ontological component or ser- vice; (2) able to constantly re-index documents (so to solve problem of obsolete/misaligned annotation). Machine Learning, ML, and IE become then in- dispensable for developing SW tools able to extract and structure information: in this paper we focus on identifying requirements an challenges for future re- search in ML and IE applied to SW. When detailing the requirements and challenge we refer, as an ex- ample, to Armadillo. Armadillo is a tool for extracting and integrating information from large repositories (e.g. the Web) developed at She±eld. Armadillo is able to (1) learn to extract facts and entities in a largely unsupervised way; (2) cope with unstructured documents such as semi- structured and free documents as well. The learn- ing algorythm currently integrated into Armadillo is (LP)2, implemented in Amilcare. The requirements and challenges that we identify, however, are not related simply to Ar- madillo but can be shared by other SW tools with similar aims.

Subjects:AKT Challenges > Knowledge acquisition
ID Code:399
Deposited By:Norton, Mr Barry
Deposited On:12 March 2005
Alternative Locations:http://www.smi.ucd.ie/Dagstuhl-MLSW/proceedings/ciravegna-chapman.pdf

Contact the site administrator at: hg@ecs.soton.ac.uk