---
abstract: 'Many academic journals ask their authors to provide a list of about five to fifteen keywords, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a wide variety of tasks for which keyphrases are useful, as we discuss in this paper. We approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. We evaluate the performance of nine different configurations of C4.5. The second set of experiments applies the GenEx algorithm to the task. We developed the GenEx algorithm specifically for automatically extracting keyphrases from text. The experimental results support the claim that a custom-designed algorithm (GenEx), incorporating specialized procedural domain knowledge, can generate better keyphrases than a general-purpose algorithm (C4.5). Subjective human evaluation of the keyphrases generated by GenEx suggests that about 80% of the keyphrases are acceptable to human readers. This level of performance should be satisfactory for a wide variety of applications. '
altloc:
  - http://extractor.iit.nrc.ca/reports/IR2000.html
chapter: ~
commentary: ~
commref: ~
confdates: ~
conference: ~
confloc: ~
contact_email: ~
creators_id: []
creators_name:
  - family: Turney
    given: Peter
    honourific: ''
    lineage: ''
date: 2000
date_type: published
datestamp: 2001-09-13
department: ~
dir: disk0/00/00/17/97
edit_lock_since: ~
edit_lock_until: ~
edit_lock_user: ~
editors_id: []
editors_name: []
eprint_status: archive
eprintid: 1797
fileinfo: /style/images/fileicons/application_postscript.png;/1797/1/IR2000.ps|/style/images/fileicons/application_pdf.png;/1797/5/IR2000.pdf
full_text_status: public
importid: ~
institution: ~
isbn: ~
ispublished: pub
issn: ~
item_issues_comment: []
item_issues_count: 0
item_issues_description: []
item_issues_id: []
item_issues_reported_by: []
item_issues_resolved_by: []
item_issues_status: []
item_issues_timestamp: []
item_issues_type: []
keywords: 'machine learning, summarization, indexing, keywords, keyphrase extraction.'
lastmod: 2011-03-11 08:54:47
latitude: ~
longitude: ~
metadata_visibility: show
note: ~
number: 4
pagerange: 303-336
pubdom: FALSE
publication: Information Retrieval
publisher: Kluwer
refereed: TRUE
referencetext: |-
  Brandow, R., Mitze, K., and Rau, L.R. (1995). The automatic condensation of electronic publications
  by sentence selection. Information Processing and Management, 31 (5), 675-685.
  Breiman, L. (1996a). Arcing Classifiers. Technical Report 460, Statistics Department, University of
  California at Berkeley.
  Breiman, L. (1996b). Bagging predictors. Machine Learning, 24 (2), 123-140.
  Buntine, W. (1989). Stratifying samples to improve learning. In Proceedings of the IJCAI-89 Work-shop
  on Knowledge Discovery in Databases. Detroit, Michigan.
  Carter, C., and Catlett, J. (1987). Assessing credit card applications using machine learning. IEEE
  Expert, Fall issue, 71-79.
  Catlett, J. (1991). Megainduction: Machine Learning on Very Large Databases. Ph.D. Dissertation,
  Basser Department of Computer Science, University of Sydney.
  Croft, W.B., Turtle, H., and Lewis, D. (1991). The use of phrases and structured queries in information
  retrieval. SIGIR-91: Proceedings of the 14th Annual International ACM SIGIR Conference on
  Research and Development in Information Retrieval, pp. 32-45, New York: ACM.
  Deming, W.E. (1978). Sample surveys: The field. In William H. Kruskal and Judith M. Tanur (Ed.),
  International Encyclopedia of Statistics. New York: Free Press.
  Edmundson, H.P. (1969). New methods in automatic extracting. Journal of the Association for Com-puting
  Machinery, 16 (2), 264-285.
  Fagan, J.L. (1987). Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison
  of Syntactic and Non-Syntactic Methods. Ph.D. Dissertation, Department of Computer Science,
  Cornell University, Report #87-868, Ithaca, New York.
  Feelders, A. and Verkooijen, W. (1995). Which method learns the most from data? Methodological
  issues in the analysis of comparative studies. Fifth International Workshop on Artificial Intelli-gence
  and Statistics, Ft. Lauderdale, Florida, pp. 219-225.
  Field, B.J. (1975). Towards automatic indexing: Automatic assignment of controlled-language index-ing
  and classification from free indexing. Journal of Documentation, 31 (4), 246-265.
  Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., and Nevill-Manning, C.G. (1999). Domain-specific
  keyphrase extraction. Proceedings of the Sixteenth International Joint Conference on Artificial
  Intelligence (IJCAI-99), pp. 668-673. California: Morgan Kaufmann.
  Fraser, D.A.S. (1976). Probability and Statistics: Theory and Applications. Massachusetts: Duxbury
  Press.
  Freund, Y., and Schapire, R.E. (1996). Experiments with a new boosting algorithm. Machine Learn-ing:
  Proceedings of the Thirteenth International Conference (ICML-96), pp. 148-156. California:
  Morgan Kaufmann.
  Ginsberg, A. (1993). A unified approach to automatic indexing and information retrieval. IEEE
  Expert, 8, 46-56.
  Grefenstette, J.J. (1983). A user’s guide to GENESIS. Technical Report CS-83-11, Computer Science
  Department, Vanderbilt University.
  Grefenstette, J.J. (1986). Optimization of control parameters for genetic algorithms. IEEE Transac-tions
  on Systems, Man, and Cybernetics, 16, 122-128.
  Gutwin, C., Paynter, G.W., Witten, I.H., Nevill-Manning, C.G., and Frank, E. (1999). Improving
  browsing in digital libraries with keyphrase indexes. Decision Support Systems. In press.
  Jang, D.-H., and Myaeng, S.H. (1997). Development of a document summarization system for effec-tive
  information services. RIAO 97 Conference Proceedings: Computer-Assisted Information
  Searching on Internet, pp. 101-111. Montreal, Canada.
  Johnson, F.C., Paice, C.D., Black, W.J., and Neal, A.P. (1993). The application of linguistic process-ing
  to automatic abstract generation. Journal of Document and Text Management, 1, 215-241.
  Krovetz, R. (1993). Viewing morphology as an inference process. Proceedings of the Sixteenth Annual
  International ACM SIGIR Conference on Research and Development in Information Retrieval,
  SIGIR'93, 191-203.
  Krulwich, B., and Burkey, C. (1996). Learning user information interests through the extraction of
  semantically significant phrases. In M. Hearst and H. Hirsh, editors, AAAI 1996 Spring Sympo-sium
  on Machine Learning in Information Access. California: AAAI Press.
  Krupka, G. (1995). SRA: Description of the SRA system as used for MUC-6. Proceedings of the Sixth
  Message Understanding Conference. California: Morgan Kaufmann.
  Kubat, M., Holte, R., and Matwin, S. (1998). Machine learning for the detection of oil spills in satel-lite
  radar images. Machine Learning, 30 (2/3), 195-215.
  Kupiec, J., Pedersen, J., and Chen, F. (1995). A trainable document summarizer. In E.A. Fox, P. Ingw-ersen,
  and R. Fidel, editors, SIGIR-95: Proceedings of the 18th Annual International ACM SIGIR
  Conference on Research and Development in Information Retrieval, pp. 68-73, New York: ACM.
  Leung, C.-H., and Kan, W.-K. (1997). A statistical learning approach to automatic indexing of con-trolled
  index terms. Journal of the American Society for Information Science, 48 (1), 55-66.
  Lovins, J.B. (1968). Development of a stemming algorithm. Mechanical Translation and Computa-tional
  Linguistics, 11, 22-31.
  Luhn, H.P. (1958). The automatic creation of literature abstracts. I.B.M. Journal of Research and
  Development, 2 (2), 159-165.
  Maclin, R., and Opitz, D. (1997). An empirical evaluation of bagging and boosting. Proceedings of the
  Fourteenth National Conference on Artificial Intelligence (AAAI-97), pp 546-551. AAAI Press.
  Marsh, E., Hamburger, H., and Grishman, R. (1984). A production rule system for message summari-zation.
  In AAAI-84, Proceedings of the American Association for Artificial Intelligence, pp. 243-
  246. Cambridge, MA: AAAI Press/MIT Press.
  Mathieu, J. (1999). Adaptation of a keyphrase extractor for Japanese text. Proceedings of the 27th
  Annual Conference of the Canadian Association for Information Science (CAIS-99), Sherbrooke,
  Quebec, pp. 182-189.
  MUC-3. (1991). Proceedings of the Third Message Understanding Conference. California: Morgan
  Kaufmann.
  MUC-4. (1992). Proceedings of the Fourth Message Understanding Conference. California: Morgan
  Kaufmann.
  MUC-5. (1993). Proceedings of the Fifth Message Understanding Conference. California: Morgan
  Kaufmann.
  MUC-6. (1995). Proceedings of the Sixth Message Understanding Conference. California: Morgan
  Kaufmann.
  Muñoz, A. (1996). Compound key word generation from document databases using a hierarchical
  clustering ART model. Intelligent Data Analysis, 1 (1), Amsterdam: Elsevier.
  Nakagawa, H. (1997). Extraction of index words from manuals. RIAO 97 Conference Proceedings:
  Computer-Assisted Information Searching on Internet, pp. 598-611. Montreal, Canada.
  Paice, C.D. (1990). Constructing literature abstracts by computer: Techniques and prospects. Informa-tion
  Processing and Management, 26 (1), 171-186.
  Paice, C.D., and Jones, P.A. (1993). The identification of important concepts in highly structured tech-nical
  papers. SIGIR-93: Proceedings of the 16th Annual International ACM SIGIR Conference on
  Research and Development in Information Retrieval, pp. 69-78, New York: ACM.
  Porter, M.F. (1980). An algorithm for suffix stripping. Program; Automated Library and Information
  Systems, 14 (3), 130-137.
  Quinlan, J.R. (1987). Decision trees as probabilistic classifiers. In Langley, P. (Ed.), Proceedings of
  the Fourth International Workshop on Machine Learning, pp. 31-37. California: Morgan Kauf-mann.
  Quinlan, J.R. (1990). Probabilistic decision trees. In Y. Kodratoff and R.S. Michalski, (Eds.), Machine
  Learning: An Artificial Intelligence Approach, Volume III, pp. 140-152, California: Morgan Kauf-mann.
  Quinlan, J.R. (1993). C4.5: Programs for machine learning. California: Morgan Kaufmann.
  Quinlan, J.R. (1996). Bagging, boosting, and C4.5. In Proceedings of the Thirteenth National Confer-ence
  on Artificial Intelligence (AAAI-96), pp. 725-730. AAAI Press.
  Salton, G. (1988). Syntactic approaches to automatic book indexing. Proceedings of the 26th Annual
  Meeting of the Association for Computational Linguistics, pp. 120-138. New York: ACM.
  Salton, G., Allan, J., Buckley, C., and Singhal, A. (1994). Automatic analysis, theme generation, and
  summarization of machine-readable texts. Science, 264, 1421-1426.
  Soderland, S., and Lehnert, W. (1994). Wrap-Up: A trainable discourse module for information
  extraction. Journal of Artificial Intelligence Research, 2, 131-158.
  Sparck Jones, K. (1973). Does indexing exhaustivity matter? Journal of the American Society for
  Information Science, September-October, 313-316.
  Steier, A. M., and Belew, R. K. (1993). Exporting phrases: A statistical analysis of topical language.
  In R. Casey and B. Croft, editors, Second Symposium on Document Analysis and Information
  Retrieval, pp. 179-190.
  Turney, P.D. (1997). Extraction of Keyphrases from Text: Evaluation of Four Algorithms. National
  Research Council, Institute for Information Technology, Technical Report ERB-1051.
  Turney, P.D. (1999). Learning to Extract Keyphrases from Text. National Research Council, Institute
  for Information Technology, Technical Report ERB-1057.
  Whitley, D. (1989). The GENITOR algorithm and selective pressure. Proceedings of the Third Inter-national
  Conference on Genetic Algorithms (ICGA-89), pp. 116-121. California: Morgan Kauf-mann.
relation_type: []
relation_uri: []
reportno: ~
rev_number: 14
series: ~
source: ~
status_changed: 2007-09-12 16:40:37
subjects:
  - comp-sci-lang
  - comp-sci-mach-learn
  - comp-sci-stat-model
succeeds: ~
suggestions: ~
sword_depositor: ~
sword_slug: ~
thesistype: ~
title: Learning algorithms for keyphrase extraction
type: journalp
userid: 2175
volume: 2
<script src='https://archive-bar.soton.ac.uk/archive-bar.js'></script>
<script src='https://archive-bar.soton.ac.uk/google-analytics.js'></script>