This site has been permanently archived. This is a static copy provided by the University of Southampton.
---
abstract: |-
This paper describes the National Research Council (NRC)
Word Sense Disambiguation (WSD) system, as applied to the
English Lexical Sample (ELS) task in Senseval-3. The NRC system
approaches WSD as a classical supervised machine learning problem,
using familiar tools such as the Weka machine learning software
and Brill's rule-based part-of-speech tagger. Head words are
represented as feature vectors with several hundred features.
Approximately half of the features are syntactic and the other
half are semantic. The main novelty in the system is the method for
generating the semantic features, based on word co-occurrence
probabilities. The probabilities are estimated using
the Waterloo MultiText System with a corpus of about one terabyte of
unlabeled text, collected by a web crawler.
altloc: []
chapter: ~
commentary: ~
commref: ~
confdates: July 25-26
conference: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (SENSEVAL-3)
confloc: 'Barcelona, Spain'
contact_email: ~
creators_id: []
creators_name:
- family: Turney
given: Peter D.
honourific: ''
lineage: ''
date: 2004
date_type: published
datestamp: 2004-07-30
department: ~
dir: disk0/00/00/37/32
edit_lock_since: ~
edit_lock_until: ~
edit_lock_user: ~
editors_id: []
editors_name: []
eprint_status: archive
eprintid: 3732
fileinfo: /style/images/fileicons/application_pdf.png;/3732/1/NRC%2D47167.pdf
full_text_status: public
importid: ~
institution: ~
isbn: ~
ispublished: pub
issn: ~
item_issues_comment: []
item_issues_count: 0
item_issues_description: []
item_issues_id: []
item_issues_reported_by: []
item_issues_resolved_by: []
item_issues_status: []
item_issues_timestamp: []
item_issues_type: []
keywords: ~
lastmod: 2011-03-11 08:55:39
latitude: ~
longitude: ~
metadata_visibility: show
note: ~
number: ~
pagerange: 239-242
pubdom: FALSE
publication: ~
publisher: ~
refereed: TRUE
referencetext: |-
Eric Brill. 1994. Some advances in transformation-based part of speech tagging. In Proceedings of the 12th National Conference on Artificial Intelligence (AAAI-94), pages 722-727.
Charles L.A. Clarke and Gordon V. Cormack. 2000. Shortest substring retrieval and ranking. ACM Transactions on Information Systems (TOIS), 18(1):44-78.
Charles L.A. Clarke, G.V. Cormack, and F.J. Burkowski. 1995. An algebra for structured text search and a framework for its implementation. The Computer Journal, 38(1):43-56.
Egidio L. Terra and Charles L.A. Clarke. 2003. Frequency estimates for statistical word similarity measures. In Proceedings of the Human Language Technology and North American Chapter of Association of Computational Linguistics Conference 2003 (HLT/NAACL 2003), pages 244-251.
Peter D. Turney. 2001. Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001), pages 491-502.
Peter D. Turney. 2003. Coherent keyphrase extraction via Web mining. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03), pages 434-439.
Ian H. Witten and Eibe Frank. 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Mateo, CA.
D. Yarowsky, S. Cucerzan, R. Florian, C. Schafer, and R. Wicentowski. 2001. The Johns Hopkins SENSEVAL2 system descriptions. In Proceedings of SENSEVAL2, pages 163-166.
relation_type: []
relation_uri: []
reportno: ~
rev_number: 12
series: ~
source: ~
status_changed: 2007-09-12 16:53:07
subjects:
- comp-sci-lang
- ling-comput
- ling-sem
- comp-sci-mach-learn
succeeds: ~
suggestions: ~
sword_depositor: ~
sword_slug: ~
thesistype: ~
title: Word Sense Disambiguation by Web Mining for Word Co-occurrence Probabilities
type: confpaper
userid: 2175
volume: ~