(LP)2, an Adaptive Algorithm for Information Extraction from Web-related Texts
2001) (LP)2, an Adaptive Algorithm for Information Extraction from Web-related Texts. In Proceedings IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle. (
Full text available as:
PDF - Requires Adobe Acrobat Reader or other PDF viewer. |
(LP)2 is an algorithm for adaptive Information Extraction from Web-related text that induces symbolic rules by learning from a corpus tagged with SGML tags. Induction is performed by bottom-up generalisation of examples in a training corpus. Training is performed in two steps: initially a set of tagging rules is learned; then additional rules are induced to correct mistakes and imprecision in tagging. Shallow NLP is used to generalise rules beyond the flat word structure. Generalization allows a better coverage on unseen texts, as it limits data sparseness and overfitting in the training phase. In experiments on publicly available corpora the algorithm outperforms any other algorithm presented in literature and tested on the same corpora. Experiments also show a significant gain in using NLP in terms of (1) effectiveness (2) reduction of training time and (3) training corpus size. In this paper we present the machine learning algorithm for rule induction. In particular we focus on the NLP-based generalisation and the strategy for pruning both the search space and the final rule set.
Keywords: | Natural Language Processing, Adaptive Information Extraction, rulle induction, corpus annotation |
---|---|
Subjects: | Doctoral Symposia > First Doctoral Symposium |
ID Code: | 120 |
Deposited By: | Brewster, Christopher |
Deposited On: | 27 February 2003 |
Alternative Locations: | http://www.dcs.shef.ac.uk/~fabio/cira-papers.html |
Contact the site administrator at: hg@ecs.soton.ac.uk