--- abstract: |- Keyphrases are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyphrase extraction is to select keyphrases from within the text of a given document. Automatic keyphrase extraction makes it feasible to generate keyphrases for the huge number of documents that do not have manually assigned keyphrases. A limitation of previous keyphrase extraction algorithms is that the selected keyphrases are occasionally incoherent. That is, the majority of the output keyphrases may fit together well, but there may be a minority that appear to be outliers, with no clear semantic relation to the majority or to each other. This paper presents enhancements to the Kea keyphrase extraction algorithm that are designed to increase the coherence of the extracted keyphrases. The approach is to use the degree of statistical association among candidate keyphrases as evidence that they may be semantically related. The statistical association is measured using web mining. Experiments demonstrate that the enhancements improve the quality of the extracted keyphrases. Furthermore, the enhancements are not domain-specific: the algorithm generalizes well when it is trained on one domain (computer science documents) and tested on another (physics documents). altloc: [] chapter: ~ commentary: ~ commref: ~ confdates: 9-15 August 2003 conference: Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03) confloc: 'Acapulco, Mexico' contact_email: ~ creators_id: - 2175 creators_name: - family: Turney given: Peter honourific: '' lineage: '' date: 2003 date_type: published datestamp: 2003-08-27 department: ~ dir: disk0/00/00/31/22 edit_lock_since: ~ edit_lock_until: ~ edit_lock_user: ~ editors_id: [] editors_name: [] eprint_status: archive eprintid: 3122 fileinfo: /style/images/fileicons/application_pdf.png;/3122/1/NRC%2D46496.pdf full_text_status: public importid: ~ institution: ~ isbn: ~ ispublished: pub issn: ~ item_issues_comment: [] item_issues_count: 0 item_issues_description: [] item_issues_id: [] item_issues_reported_by: [] item_issues_resolved_by: [] item_issues_status: [] item_issues_timestamp: [] item_issues_type: [] keywords: ~ lastmod: 2011-03-11 08:55:20 latitude: ~ longitude: ~ metadata_visibility: show note: ~ number: ~ pagerange: 434-439 pubdom: FALSE publication: ~ publisher: ~ refereed: TRUE referencetext: | [Barzilay and Elhadad, 1997] Barzilay, R., and Elhadad, M. Using lexical chains for text summarization. In Proceedings of the ACL'97/EACL'97 Workshop on Intelligent Scalable Text Summarization, 10-17, 1997. [Church and Hanks, 1989] Church, K.W., and Hanks, P. Word association norms, mutual information and lexicography. Proceedings of the 27th Annual Conference of the Association of Computational Linguistics, pp. 76-83, 1989. [Church et al., 1991] Church, K.W., Gale, W., Hanks, P., and Hindle, D. Using statistics in lexical analysis. In Uri Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, pp. 115-164. New Jersey: Lawrence Erlbaum, 1991. [Domingos and Pazzani, 1997] Domingos, P., and Pazzani, M. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103-130, 1997. [Dumais et al., 1998] Dumais, S., Platt, J., Heckerman, D. and Sahami, M. Inductive learning algorithms and representations for text categorization. Proceedings of the Seventh International Conference on Information and Knowledge Management, 148-155. ACM, 1998. [Fayyad and Irani, 1993] Fayyad, U.M., and Irani, K.B. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of 13th International Joint Conference on Artificial Intelligence (IJCAI-93), pp. 1022-1027, 1993. [Feelders and Verkooijen, 1995] Feelders, A., and Verkooijen, W. Which method learns the most from data? Methodological issues in the analysis of comparative studies. Fifth International Workshop on Artificial Intelligence and Statistics, Ft. Lauderdale, Florida, pp. 219-225, 1995. [Frank et al., 1999] Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., and Nevill-Manning, C.G. Domain-specific keyphrase extraction. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99), pp. 668-673. California: Morgan Kaufmann, 1999. [Gutwin et al., 1999] Gutwin, C., Paynter, G.W., Witten, I.H., Nevill-Manning, C.G., and Frank, E. Improving browsing in digital libraries with keyphrase indexes. Journal of Decision Support Systems, 27, 81-104, 1999. [Halliday and Hasan, 1976] Halliday, M.A.K., and Hasan, R. Cohesion in English. London: Longman, 1976. [Jones and Paynter, 2001] Jones, S. and Paynter, G.W. Human evaluation of Kea, an automatic keyphrasing system. First ACM/IEEE-CS Joint Conference on Digital Libraries, Roanoke, Virginia, June 24-29, 2001, ACM Press, 148-156. [Jones and Paynter, 2002] Jones, S. and Paynter, G.W. Automatic extraction of document keyphrases for use in digital libraries: Evaluation and applications. Journal of the American Society for Information Science and Technology (JASIST), 53 (8), 653-677, 2002. [Leung and Kan, 1997] Leung, C.-H., and Kan, W.-K. A statistical learning approach to automatic indexing of controlled index terms. Journal of the American Society for Information Science, 48, 55-66, 1997. [Manning and Schütze, 1999] Manning, C.D., and Schütze, H. Foundations of Statistical Natural Language Processing. Cambridge, Massachusetts: MIT Press, 1999. [Morris and Hirst, 1991] Morris, J., and Hirst, G. Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computational Linguistics, 17(1), 21-48, 1991. [Turney, 1999] Turney, P.D. Learning to Extract Keyphrases from Text. National Research Council, Institute for Information Technology, Technical Report ERB-1057, 1999. [Turney, 2000] Turney, P.D. Learning algorithms for keyphrase extraction. Information Retrieval, 2, 303-336, 2000. [Turney, 2001] Turney, P.D. Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001), Freiburg, Germany, pp. 491-502, 2001. [van Rijsbergen, 1979] van Rijsbergen, C.J. Information Retrieval. 2nd edition. London: Butterworths, 1979. [Witten et al., 1999] Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C. and Nevill-Manning, C.G. KEA: Practical automatic keyphrase extraction. Proceedings of Digital Libraries 99 (DL'99), pp. 254-256. ACM Press, 1999. [Witten et al., 2000] Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., and Nevill-Manning, C.G. KEA: Practical Automatic Keyphrase Extraction. Working Paper 00/5, Department of Computer Science, The University of Waikato, 2000. relation_type: [] relation_uri: [] reportno: ~ rev_number: 12 series: ~ source: ~ status_changed: 2007-09-12 16:48:30 subjects: - comp-sci-stat-model - comp-sci-lang - comp-sci-mach-learn succeeds: ~ suggestions: ~ sword_depositor: ~ sword_slug: ~ thesistype: ~ title: Coherent Keyphrase Extraction via Web Mining type: confpaper userid: 2175 volume: ~