---
abstract: "This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words. PMI-IR is empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL) and 50 synonym test questions from a collection of tests for students of English as a Second Language (ESL). On both tests, the algorithm obtains a score of 74%. PMI-IR is contrasted with Latent Semantic Analysis (LSA), which achieves a score of 64% on the same 80 TOEFL questions. The paper discusses potential applications of the new unsupervised learning algorithm and some implications of the results for LSA and LSI (Latent Semantic Indexing). \n\n"
altloc:
  - http://extractor.iit.nrc.ca/reports/ecml2001.html
chapter: ~
commentary: ~
commref: ~
confdates: 'September 3-7, 2001'
conference: Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001)
confloc: 'Freiburg, Germany'
contact_email: ~
creators_id: []
creators_name:
  - family: Turney
    given: Peter
    honourific: ''
    lineage: ''
date: 2001
date_type: published
datestamp: 2001-09-12
department: ~
dir: disk0/00/00/17/96
edit_lock_since: ~
edit_lock_until: ~
edit_lock_user: ~
editors_id: []
editors_name:
  - family: De Raedt
    given: Luc
    honourific: ''
    lineage: ''
  - family: Flach
    given: Peter
    honourific: ''
    lineage: ''
eprint_status: archive
eprintid: 1796
fileinfo: /style/images/fileicons/application_postscript.png;/1796/1/ECML2001.ps|/style/images/fileicons/application_pdf.png;/1796/5/ECML2001.pdf
full_text_status: public
importid: ~
institution: ~
isbn: ~
ispublished: pub
issn: ~
item_issues_comment: []
item_issues_count: 0
item_issues_description: []
item_issues_id: []
item_issues_reported_by: []
item_issues_resolved_by: []
item_issues_status: []
item_issues_timestamp: []
item_issues_type: []
keywords: 'PMI-IR, synonyms, LSA, LSI, Latent Semantic Analysis, text mining, web mining, TOEFL, mutual information'
lastmod: 2011-03-11 08:54:47
latitude: ~
longitude: ~
metadata_visibility: show
note: ~
number: ~
pagerange: 491-502
pubdom: FALSE
publication: ~
publisher: Springer-Verlag
refereed: TRUE
referencetext: |-
  1. Church, K.W., Hanks, P.: Word Association Norms, Mutual Information and Lexicography.
  In: Proceedings of the 27th Annual Conference of the Association of Computational Lin-guistics,
  (1989) 76-83.
  2. Church, K.W., Gale, W., Hanks, P., Hindle, D.: Using Statistics in Lexical Analysis. In: Uri
  Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. New
  Jersey: Lawrence Erlbaum (1991) 115-164.
  3. AltaVista, AltaVista Company, Palo Alto, California, http://www.altavista.com/.
  4. Test of English as a Foreign Language (TOEFL), Educational Testing Service, Princeton,
  New Jersey, http://www.ets.org/.
  5. Tatsuki, D.: Basic 2000 Words - Synonym Match 1. In: Interactive JavaScript Quizzes for
  ESL Students, http://www.aitech.ac.jp/~iteslj/quizzes/js/dt/mc-2000-01syn.html (1998).
  6. Landauer, T.K., Dumais, S.T.: A Solution to Plato’s Problem: The Latent Semantic Analysis
  Theory of the Acquisition, Induction, and Representation of Knowledge. Psychological Re-view,
  104 (1997) 211-240.
  7. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by
  Latent Semantic Analysis. Journal of the American Society for Information Science, 41
  (1990) 391-407.
  8. Berry, M.W., Dumais, S.T., Letsche, T.A.: Computational Methods for Intelligent Informa-tion
  Access. Proceedings of Supercomputing ’95, San Diego, California, (1995).
  9. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. Cam-bridge,
  Massachusetts: MIT Press (1999).
  10. Firth, J.R.: A Synopsis of Linguistic Theory 1930-1955. In Studies in Linguistic Analysis,
  pp. 1-32. Oxford: Philological Society (1957). Reprinted in F.R. Palmer (ed.), Selected Pa-pers
  of J.R. Firth 1952-1959, London: Longman (1968).
  11. AltaVista: AltaVista Advanced Search Cheat Sheet, AltaVista Company, Palo Alto, Cali-fornia,
  http://doc.altavista.com/adv_search/syntax.html (2001).
  12. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. Cambridge, Massachusetts:
  MIT Press (1998). For more information: http://www.cogsci.princeton.edu/~wn/.
  13. Haase, K.: Interlingual BRICO. IBM Systems Journal, 39 (2000) 589-596. For more infor-mation:
  http://www.framerd.org/brico/.
  14. Vossen, P. (ed.): EuroWordNet: A Multilingual Database with Lexical Semantic Networks.
  Dordrecht, Netherlands: Kluwer (1998). See: http://www.hum.uva.nl/~ewn/.
  15. Turney, P.D.: Learning Algorithms for Keyphrase Extraction. Information Retrieval, 2
  (2000) 303-336.
  16. Grefenstette, G.: Finding Semantic Similarity in Raw Text: The Deese Antonyms. In: R.
  Goldman, P. Norvig, E. Charniak and B. Gale (eds.), Working Notes of the AAAI Fall Sym-posium
  on Probabilistic Approaches to Natural Language. AAAI Press (1992) 61-65.
  17. Schütze, H.: Word Space. In: S.J. Hanson, J.D. Cowan, and C.L. Giles (eds.), Advances in
  Neural Information Processing Systems 5, San Mateo California: Morgan Kaufmann (1993)
  895-902.
  18. Lin, D.: Automatic Retrieval and Clustering of Similar Words. In: Proceedings of the 17th
  International Conference on Computational Linguistics and 36th Annual Meeting of the As-sociation
  for Computational Linguistics, Montreal (1998) 768-773.
  19. Richardson, R., Smeaton, A., Murphy, J.: Using WordNet as a Knowledge Base for Meas-uring
  Semantic Similarity between Words. In Proceedings of AICS Conference. Trinity
  College, Dublin (1994).
  20. Lee, J.H., Kim, M.H., Lee, Y.J.: Information Retrieval Based on Conceptual Distance in IS-A
  Hierarchies. Journal of Documentation, 49 (1993) 188-207.
  21. Resnik, P.: Semantic Similarity in a Taxonomy: An Information-Based Measure and its
  Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelli-gence
  Research, 11 (1998) 95-130.
  22. Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxon-omy.
  In: Proceedings of the 10th International Conference on Research on Computational
  Linguistics, Taiwan, (1997).
  23. Brin, S., Motwani, R., Ullman, J., Tsur, S.: Dynamic Itemset Counting and Implication
  Rules for Market Basket Data. In: Proceedings of the 1997 ACM-SIGMOD International
  Conference on the Management of Data (1997) 255-264.
  24. Sullivan, D.: Search Engine Sizes. SearchEngineWatch.com, internet.com Corporation,
  Darien, Connecticut, http://searchenginewatch.com/reports/sizes.html (2000).
  25. Papadimitriou, C.H., Raghavan, P., Tamaki, H., Vempala, S.: Latent Semantic Indexing: A
  Probabilistic Analysis. In: Proceedings of the Seventeenth ACM-SIGACT-SIGMOD-SIGART
  Symposium on Principles of Database Systems, Seattle, Washington (1998) 159-
  168.
  26. Sparck Jones, K.: Comparison Between TREC2 and TREC3. In: D. Harman (ed.), The
  Third Text REtrieval Conference (TREC3), National Institute of Standards and Technology
  Special Publication 500-226, Gaithersburg, Maryland (1994) C1-C4.
  27. Buckley, C., Salton, G., Allan, J., Singhal, A.: Automatic Query Expansion Using SMART:
  TREC 3. In: The Third Text REtrieval Conference (TREC3), D. Harman (ed.), National In-stitute
  of Standards and Technology Special Publication 500-226, Gaithersburg, Maryland
  (1994) 69-80.
relation_type: []
relation_uri: []
reportno: ~
rev_number: 14
series: ~
source: ~
status_changed: 2007-09-12 16:40:36
subjects:
  - comp-sci-lang
  - comp-sci-mach-learn
  - comp-sci-stat-model
succeeds: ~
suggestions: ~
sword_depositor: ~
sword_slug: ~
thesistype: ~
title: 'Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL'
type: confpaper
userid: 2175
volume: ~
<script src='https://archive-bar.soton.ac.uk/archive-bar.js'></script>
<script src='https://archive-bar.soton.ac.uk/google-analytics.js'></script>