---
abstract: |-
Keyphrases are useful for a variety of purposes,
including summarizing, indexing, labeling,
categorizing, clustering, highlighting, browsing, and
searching. The task of automatic keyphrase extraction
is to select keyphrases from within the text of a given
document. Automatic keyphrase extraction makes it
feasible to generate keyphrases for the huge number of
documents that do not have manually assigned
keyphrases. A limitation of previous keyphrase
extraction algorithms is that the selected keyphrases are
occasionally incoherent. That is, the majority of the
output keyphrases may fit together well, but there may
be a minority that appear to be outliers, with no clear
semantic relation to the majority or to each other. This
paper presents enhancements to the Kea keyphrase
extraction algorithm that are designed to increase the
coherence of the extracted keyphrases. The approach is
to use the degree of statistical association among
candidate keyphrases as evidence that they may be
semantically related. The statistical association is
measured using web mining. Experiments demonstrate
that the enhancements improve the quality of the
extracted keyphrases. Furthermore, the enhancements
are not domain-specific: the algorithm generalizes well
when it is trained on one domain (computer science
documents) and tested on another (physics documents).
altloc: []
chapter: ~
commentary: ~
commref: ~
confdates: 9-15 August 2003
conference: Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03)
confloc: 'Acapulco, Mexico'
contact_email: ~
creators_id:
- 2175
creators_name:
- family: Turney
given: Peter
honourific: ''
lineage: ''
date: 2003
date_type: published
datestamp: 2003-08-27
department: ~
dir: disk0/00/00/31/22
edit_lock_since: ~
edit_lock_until: ~
edit_lock_user: ~
editors_id: []
editors_name: []
eprint_status: archive
eprintid: 3122
fileinfo: /style/images/fileicons/application_pdf.png;/3122/1/NRC%2D46496.pdf
full_text_status: public
importid: ~
institution: ~
isbn: ~
ispublished: pub
issn: ~
item_issues_comment: []
item_issues_count: 0
item_issues_description: []
item_issues_id: []
item_issues_reported_by: []
item_issues_resolved_by: []
item_issues_status: []
item_issues_timestamp: []
item_issues_type: []
keywords: ~
lastmod: 2011-03-11 08:55:20
latitude: ~
longitude: ~
metadata_visibility: show
note: ~
number: ~
pagerange: 434-439
pubdom: FALSE
publication: ~
publisher: ~
refereed: TRUE
referencetext: |
[Barzilay and Elhadad, 1997] Barzilay, R., and Elhadad, M. Using lexical chains for text summarization. In Proceedings of the ACL'97/EACL'97 Workshop on Intelligent Scalable Text Summarization, 10-17, 1997.
[Church and Hanks, 1989] Church, K.W., and Hanks, P. Word association norms, mutual information and lexicography. Proceedings of the 27th Annual Conference of the Association of Computational Linguistics, pp. 76-83, 1989.
[Church et al., 1991] Church, K.W., Gale, W., Hanks, P., and Hindle, D. Using statistics in lexical analysis. In Uri Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, pp. 115-164. New Jersey: Lawrence Erlbaum, 1991.
[Domingos and Pazzani, 1997] Domingos, P., and Pazzani, M. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103-130, 1997.
[Dumais et al., 1998] Dumais, S., Platt, J., Heckerman, D. and Sahami, M. Inductive learning algorithms and representations for text categorization. Proceedings of the Seventh International Conference on Information and Knowledge Management, 148-155. ACM, 1998.
[Fayyad and Irani, 1993] Fayyad, U.M., and Irani, K.B. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of 13th International Joint Conference on Artificial Intelligence (IJCAI-93), pp. 1022-1027, 1993.
[Feelders and Verkooijen, 1995] Feelders, A., and Verkooijen, W. Which method learns the most from data? Methodological issues in the analysis of comparative studies. Fifth International Workshop on Artificial Intelligence and Statistics, Ft. Lauderdale, Florida, pp. 219-225, 1995.
[Frank et al., 1999] Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., and Nevill-Manning, C.G. Domain-specific keyphrase extraction. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99), pp. 668-673. California: Morgan Kaufmann, 1999.
[Gutwin et al., 1999] Gutwin, C., Paynter, G.W., Witten, I.H., Nevill-Manning, C.G., and Frank, E. Improving browsing in digital libraries with keyphrase indexes. Journal of Decision Support Systems, 27, 81-104, 1999.
[Halliday and Hasan, 1976] Halliday, M.A.K., and Hasan, R. Cohesion in English. London: Longman, 1976.
[Jones and Paynter, 2001] Jones, S. and Paynter, G.W. Human evaluation of Kea, an automatic keyphrasing system. First ACM/IEEE-CS Joint Conference on Digital Libraries, Roanoke, Virginia, June 24-29, 2001, ACM Press, 148-156.
[Jones and Paynter, 2002] Jones, S. and Paynter, G.W. Automatic extraction of document keyphrases for use in digital libraries: Evaluation and applications. Journal of the American Society for Information Science and Technology (JASIST), 53 (8), 653-677, 2002.
[Leung and Kan, 1997] Leung, C.-H., and Kan, W.-K. A statistical learning approach to automatic indexing of controlled index terms. Journal of the American Society for Information Science, 48, 55-66, 1997.
[Manning and Schütze, 1999] Manning, C.D., and Schütze, H. Foundations of Statistical Natural Language Processing. Cambridge, Massachusetts: MIT Press, 1999.
[Morris and Hirst, 1991] Morris, J., and Hirst, G. Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computational Linguistics, 17(1), 21-48, 1991.
[Turney, 1999] Turney, P.D. Learning to Extract Keyphrases from Text. National Research Council, Institute for Information Technology, Technical Report ERB-1057, 1999.
[Turney, 2000] Turney, P.D. Learning algorithms for keyphrase extraction. Information Retrieval, 2, 303-336, 2000.
[Turney, 2001] Turney, P.D. Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001), Freiburg, Germany, pp. 491-502, 2001.
[van Rijsbergen, 1979] van Rijsbergen, C.J. Information Retrieval. 2nd edition. London: Butterworths, 1979.
[Witten et al., 1999] Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C. and Nevill-Manning, C.G. KEA: Practical automatic keyphrase extraction. Proceedings of Digital Libraries 99 (DL'99), pp. 254-256. ACM Press, 1999.
[Witten et al., 2000] Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., and Nevill-Manning, C.G. KEA: Practical Automatic Keyphrase Extraction. Working Paper 00/5, Department of Computer Science, The University of Waikato, 2000.
relation_type: []
relation_uri: []
reportno: ~
rev_number: 12
series: ~
source: ~
status_changed: 2007-09-12 16:48:30
subjects:
- comp-sci-stat-model
- comp-sci-lang
- comp-sci-mach-learn
succeeds: ~
suggestions: ~
sword_depositor: ~
sword_slug: ~
thesistype: ~
title: Coherent Keyphrase Extraction via Web Mining
type: confpaper
userid: 2175
volume: ~