--- abstract: |- This paper describes the National Research Council (NRC) Word Sense Disambiguation (WSD) system, as applied to the English Lexical Sample (ELS) task in Senseval-3. The NRC system approaches WSD as a classical supervised machine learning problem, using familiar tools such as the Weka machine learning software and Brill's rule-based part-of-speech tagger. Head words are represented as feature vectors with several hundred features. Approximately half of the features are syntactic and the other half are semantic. The main novelty in the system is the method for generating the semantic features, based on word co-occurrence probabilities. The probabilities are estimated using the Waterloo MultiText System with a corpus of about one terabyte of unlabeled text, collected by a web crawler. altloc: [] chapter: ~ commentary: ~ commref: ~ confdates: July 25-26 conference: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (SENSEVAL-3) confloc: 'Barcelona, Spain' contact_email: ~ creators_id: [] creators_name: - family: Turney given: Peter D. honourific: '' lineage: '' date: 2004 date_type: published datestamp: 2004-07-30 department: ~ dir: disk0/00/00/37/32 edit_lock_since: ~ edit_lock_until: ~ edit_lock_user: ~ editors_id: [] editors_name: [] eprint_status: archive eprintid: 3732 fileinfo: /style/images/fileicons/application_pdf.png;/3732/1/NRC%2D47167.pdf full_text_status: public importid: ~ institution: ~ isbn: ~ ispublished: pub issn: ~ item_issues_comment: [] item_issues_count: 0 item_issues_description: [] item_issues_id: [] item_issues_reported_by: [] item_issues_resolved_by: [] item_issues_status: [] item_issues_timestamp: [] item_issues_type: [] keywords: ~ lastmod: 2011-03-11 08:55:39 latitude: ~ longitude: ~ metadata_visibility: show note: ~ number: ~ pagerange: 239-242 pubdom: FALSE publication: ~ publisher: ~ refereed: TRUE referencetext: |- Eric Brill. 1994. Some advances in transformation-based part of speech tagging. In Proceedings of the 12th National Conference on Artificial Intelligence (AAAI-94), pages 722-727. Charles L.A. Clarke and Gordon V. Cormack. 2000. Shortest substring retrieval and ranking. ACM Transactions on Information Systems (TOIS), 18(1):44-78. Charles L.A. Clarke, G.V. Cormack, and F.J. Burkowski. 1995. An algebra for structured text search and a framework for its implementation. The Computer Journal, 38(1):43-56. Egidio L. Terra and Charles L.A. Clarke. 2003. Frequency estimates for statistical word similarity measures. In Proceedings of the Human Language Technology and North American Chapter of Association of Computational Linguistics Conference 2003 (HLT/NAACL 2003), pages 244-251. Peter D. Turney. 2001. Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001), pages 491-502. Peter D. Turney. 2003. Coherent keyphrase extraction via Web mining. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03), pages 434-439. Ian H. Witten and Eibe Frank. 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Mateo, CA. D. Yarowsky, S. Cucerzan, R. Florian, C. Schafer, and R. Wicentowski. 2001. The Johns Hopkins SENSEVAL2 system descriptions. In Proceedings of SENSEVAL2, pages 163-166. relation_type: [] relation_uri: [] reportno: ~ rev_number: 12 series: ~ source: ~ status_changed: 2007-09-12 16:53:07 subjects: - comp-sci-lang - ling-comput - ling-sem - comp-sci-mach-learn succeeds: ~ suggestions: ~ sword_depositor: ~ sword_slug: ~ thesistype: ~ title: Word Sense Disambiguation by Web Mining for Word Co-occurrence Probabilities type: confpaper userid: 2175 volume: ~