(LP)2, an Adaptive Algorithm for Information Extraction from Web-related Texts

Ciravegna, Dr. Fabio (2001) (LP)2, an Adaptive Algorithm for Information Extraction from Web-related Texts. In Proceedings IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle.

Full text available as:

PDF - Requires Adobe Acrobat Reader or other PDF viewer.

(LP)2 is an algorithm for adaptive Information Extraction from Web-related text that induces symbolic rules by learning from a corpus tagged with SGML tags. Induction is performed by bottom-up generalisation of examples in a training corpus. Training is performed in two steps: initially a set of tagging rules is learned; then additional rules are induced to correct mistakes and imprecision in tagging. Shallow NLP is used to generalise rules beyond the flat word structure. Generalization allows a better coverage on unseen texts, as it limits data sparseness and overfitting in the training phase. In experiments on publicly available corpora the algorithm outperforms any other algorithm presented in literature and tested on the same corpora. Experiments also show a significant gain in using NLP in terms of (1) effectiveness (2) reduction of training time and (3) training corpus size. In this paper we present the machine learning algorithm for rule induction. In particular we focus on the NLP-based generalisation and the strategy for pruning both the search space and the final rule set.

Keywords:	Natural Language Processing, Adaptive Information Extraction, rulle induction, corpus annotation
Subjects:	Doctoral Symposia > First Doctoral Symposium
ID Code:	120
Deposited By:	Brewster, Christopher
Deposited On:	27 February 2003
Alternative Locations:	http://www.dcs.shef.ac.uk/~fabio/cira-papers.html

Contact the site administrator at: hg@ecs.soton.ac.uk


	AKT EPrint Archive
	AKT EPrints Home \|\| AKT Home \|\| About \|\| Browse Subjects \|\| Browse by Year