Learning to Extract Keyphrases from Text

Turney, Peter (1999) Learning to Extract Keyphrases from Text. [Departmental Technical Report] (Unpublished)

Full text available as:

[img] PDF


Many academic journals ask their authors to provide a list of about five to fifteen key words, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a surprisingly wide variety of tasks for which keyphrases are useful, as we discuss in this paper. Recent commercial software, such as Microsoft?s Word 97 and Verity?s Search 97, includes algorithms that automatically extract keyphrases from documents. In this paper, we approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. The second set of experiments applies the GenEx algorithm to the task. We developed the GenEx algorithm specifically for this task. The third set of experiments examines the performance of GenEx on the task of metadata generation, relative to the performance of Microsoft?s Word 97. The fourth and final set of experiments investigates the performance of GenEx on the task of highlighting, relative to Verity?s Search 97. The experimental results support the claim that a specialized learning algorithm (GenEx) can generate better keyphrases than a general-purpose learning algorithm (C4.5) and the non-learning algorithms that are used in commercial software (Word 97 and Search 97).

Item Type:Departmental Technical Report
Keywords:machine learning, summarization, indexing, keywords, keyphrase extraction.
Subjects:Electronic Publishing > Archives
Computer Science > Language
Computer Science > Machine Learning
Computer Science > Statistical Models
ID Code:1802
Deposited By: Turney, Peter
Deposited On:17 Sep 2001
Last Modified:11 Mar 2011 08:54

References in Article

Select the SEEK icon to attempt to find the referenced article. If it does not appear to be in cogprints you will be forwarded to the paracite service. Poorly formated references will probably not work.

Brandow, R., Mitze, K., and Rau, L.R. (1995). The automatic condensation of electronic publica-tions

by sentence selection. Information Processing and Management, 31 (5), 675-685.

Breiman, L. (1996a). Arcing Classifiers. Technical Report 460, Statistics Department, University

of California at Berkeley.

Breiman, L. (1996b). Bagging predictors. Machine Learning, 24 (2), 123-140.

Croft, B. (1991). The use of phrases and structured queries in information retrieval. SIGIR-91:

Proceedings of the 14th Annual International ACM SIGIR Conference on Research and

Development in Information Retrieval, pp. 32-45, New York: ACM.

Edmundson, H.P. (1969). New methods in automatic extracting. Journal of the Association for

Computing Machinery, 16 (2), 264-285.

Fagan, J.L. (1987). Experiments in Automatic Phrase Indexing for Document Retrieval: A Com-parison

of Syntactic and Non-Syntactic Methods. Ph.D. Dissertation, Department of Com-puter

Science, Cornell University, Report #87-868, Ithaca, New York.

Ginsberg, A. (1993). A unified approach to automatic indexing and information retrieval. IEEE

Expert, 8, 46-56.

Grefenstette, J.J. (1983). A user?s guide to GENESIS. Technical Report CS-83-11, Computer Sci-ence

Department, Vanderbilt University.

Grefenstette, J.J. (1986). Optimization of control parameters for genetic algorithms. IEEE Trans-actions

on Systems, Man, and Cybernetics, 16, 122-128.

Jang, D.-H., and Myaeng, S.H. (1997). Development of a document summarization system for

effective information services. RIAO 97 Conference Proceedings: Computer-Assisted Infor-mation

Searching on Internet, pp. 101-111. Montreal, Canada.

Johnson, F.C., Paice, C.D., Black, W.J., and Neal, A.P. (1993). The application of linguistic pro-cessing

to automatic abstract generation. Journal of Document and Text Management, 1,


Krovetz, R. (1993). Viewing morphology as an inference process. Proceedings of the Sixteenth

Annual International ACM SIGIR Conference on Research and Development in Information

Retrieval, SIGIR?93, 191-203.

Krulwich, B., and Burkey, C. (1996). Learning user information interests through the extraction

of semantically significant phrases. In M. Hearst and H. Hirsh, editors, AAAI 1996 Spring

Symposium on Machine Learning in Information Access. California: AAAI Press.

Krupka, G. (1995). SRA: Description of the SRA system as used for MUC-6. Proceedings of the

Sixth Message Understanding Conference. California: Morgan Kaufmann.

Kubat, M., Holte, R., and Matwin, S. (1998). Machine learning for the detection of oil spills in

satellite radar images. Machine Learning, 30 (2/3), 195-215.

Kupiec, J., Pedersen, J., and Chen, F. (1995). A trainable document summarizer. In E.A. Fox, P.

Ingwersen, and R. Fidel, editors, SIGIR-95: Proceedings of the 18th Annual International

ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 68-73,

New York: ACM.

Leung, C.-H., and Kan, W.-K. (1997). A statistical learning approach to automatic indexing of

controlled index terms. Journal of the American Society for Information Science, 48 (1), 55-


Lewis, D.D. (1995). Evaluating and optimizing autonomous text classification systems. In E.A.

Fox, P. Ingwersen, and R. Fidel, editors, SIGIR-95: Proceedings of the 18th Annual Interna-tional

ACM SIGIR Conference on Research and Development in Information Retrieval, pp.

246-254, New York: ACM.

Lovins, J.B. (1968). Development of a stemming algorithm. Mechanical Translation and Com-putational

Linguistics, 11, 22-31.

Luhn, H.P. (1958). The automatic creation of literature abstracts. I.B.M. Journal of Research and

Development, 2 (2), 159-165.

Marsh, E., Hamburger, H., and Grishman, R. (1984). A production rule system for message sum-marization.

In AAAI-84, Proceedings of the American Association for Artificial Intelligence,

pp. 243-246. Cambridge, MA: AAAI Press/MIT Press.

MUC-3. (1991). Proceedings of the Third Message Understanding Conference. California: Mor-gan


MUC-4. (1992). Proceedings of the Fourth Message Understanding Conference. California:

Morgan Kaufmann.

MUC-5. (1993). Proceedings of the Fifth Message Understanding Conference. California: Mor-gan


MUC-6. (1995). Proceedings of the Sixth Message Understanding Conference. California: Mor-gan


Muñoz, A. (1996). Compound key word generation from document databases using a hierarchi-cal

clustering ART model. Intelligent Data Analysis, 1 (1), Amsterdam: Elsevier.

Nakagawa, H. (1997). Extraction of index words from manuals. RIAO 97 Conference Proceed-ings:

Computer-Assisted Information Searching on Internet, pp. 598-611. Montreal, Canada.

Paice, C.D. (1990). Constructing literature abstracts by computer: Techniques and prospects.

Information Processing and Management, 26 (1), 171-186.

Paice, C.D., and Jones, P.A. (1993). The identification of important concepts in highly structured

technical papers. SIGIR-93: Proceedings of the 16th Annual International ACM SIGIR Con-ference

on Research and Development in Information Retrieval, pp. 69-78, New York: ACM.

Porter, M.F. (1980). An algorithm for suffix stripping. Program; Automated Library and Infor-mation

Systems, 14 (3), 130-137.

Quinlan, J.R. (1993). C4.5: Programs for machine learning. California: Morgan Kaufmann.

Quinlan, J.R. (1996). Bagging, boosting, and C4.5. In Proceedings of the Thirteenth National

Conference on Artificial Intelligence (AAAI?96), pp. 725-730. AAAI Press.

Salton, G. (1988). Syntactic approaches to automatic book indexing. Proceedings of the 26th

Annual Meeting of the Association for Computational Linguistics, pp. 120-138. New York:


Salton, G., Allan, J., Buckley, C., and Singhal, A. (1994). Automatic analysis, theme generation,

and summarization of machine-readable texts. Science, 264, 1421-1426.

Soderland, S., and Lehnert, W. (1994). Wrap-Up: A trainable discourse module for information

extraction. Journal of Artificial Intelligence Research, 2, 131-158.

Steier, A. M., and Belew, R. K. (1993). Exporting phrases: A statistical analysis of topical lan-guage.

In R. Casey and B. Croft, editors, Second Symposium on Document Analysis and

Information Retrieval, pp. 179-190.

van Rijsbergen, C.J. (1979). Information Retrieval. Second edition. London: Butterworths.

Whitley, D. (1989). The GENITOR algorithm and selective pressure. Proceedings of the Third

International Conference on Genetic Algorithms (ICGA-89), pp. 116-121. California: Morgan



Repository Staff Only: item control page