Corpus Linguistics

Author: Tony McEnery

Abstract

This paper discusses the matching of corpora to answer research questions. Programmes for annotating a corpus are examined as well as the use of corpora in teaching. Some useful links are provided for those interested in using corpora.

Table of contents

Assembling a corpus

A language corpus (pl. corpora) is a collection of language data selected according to some organising principle. This organising principle is enshrined in the sampling frame which is used to select materials for the corpus. The materials gathered are typically stored in the form of machine readable texts to facilitate rapid searching of the data.

For example, the sampling frame may be newspaper materials of late twentieth century Britain. This sampling frame both aids the builder of the corpus begin to decide what may or may not be included in the corpus. It also aids the eventual end user of the corpus, as if there is a mismatch between the sampling frame of the corpus and the research question the user wishes to investigate then the corpus will not allow the pursuit of that research question. So, for example, if a researcher looked at our putative corpus of late twentieth century newspapers and was interested in exploring academic discourse practices, the newspaper corpus would not allow that research question to be pursued directly.

The nature of the research question

Assuming that the researcher's sampling frame and the corpus match, the ease with which a corpus can be used, at this moment in time, depends largely on the nature of the research question. Corpus retrieval tools, such as concordance programmes, are typically very good at searching for words and groups of words. Hence if ones research question can be expressed via simple lexical searches, the corpus will prove a relatively easy to use research tool. However, where linguistic processing is required in order to exploit a corpus, the research question may be much harder to pursue using corpus data. Some corpora include linguistic analyses, encoded in the corpus by means of some markup language such as SGML. Such corpora are called annotated corpora. Corpus annotation may aid the process of corpus exploitation. For example, consider a researcher who is interested in looking for all modal verbs in a corpus of modern British English. Rather than looking for all of the word forms associated with modal verbs in English, the researcher can instead look for all words which have the part of speech 'modal verb' associated with them. Corpus annotation comes in many forms and in principle any linguistic analysis could be encoded in a corpus.

Annotating a corpus

What happens if a researcher finds a corpus with the right sampling frame but it is unannotated? Where the annotation is necessary to the pursuit of the research question, the researcher may add the annotations themself or seek to add the annotations using a computer program which can undertake the annotation automatically. At present, some forms of linguistic automated annotation, such as part-of-speech analysis, is available for a wide range of languages. These programs are typically quite accurate, reporting success rates in the region of 90%+. Other forms of annotation, such as automated word-sense analysis are now becoming available also. However, for many forms of linguistic annotation automated analysis will not be readily available in the foreseeable future. Hence to pursue certain research questions with specific un-annotated corpora a great deal of time may have to be invested by the researcher in annotating the corpus.

So far I have only considered monolingual corpora. Yet corpus data is becoming increasingly multilingual. More and more corpus data is becoming available in an ever wider range of languages. Corpora are also being developed which encode an original text and its translation into one or more other languages. These so-called parallel corpora are increasingly being used in contrastive language studies, language pedagogy and translation studies.

Using corpora for teaching

The use of corpora in teaching allows students to test linguistic hypotheses against large bodies of naturally occurring language data. The capacity to do this may, for example, guide students in developing a description of some linguistics feature, studying language variation or contrasting two languages. Beyond linguistics, students of modern languages may find that corpora have a role to play in the language teaching classroom, where corpora can be used as a guide to curriculum planning or act as a resource for students.

The use of corpora in the classroom presents a challenge - does one teach students to exploit corpora to allow them to undertake discovery learning using corpora, or does the teacher exploit the corpus in order to inform their own teaching? While taking both approaches in combination is a possibility, many teachers simply exploit corpora to teach. However, where teaching students to exploit is a preferred option, the use of corpora is best introduced early in the teaching curriculum so that students can use corpora, on their own initiative, across the curriculum of a degree scheme from as early a stage in their degree as possible. If teaching students to exploit corpora, one should minimally teach them how to select corpus data to match a research question and how to use corpus retrieval software to interrogate a corpus appropriately. It may also be desirable to teach students about how to construct and annotate their own corpora.

Bibliography

There are now a number of basic textbooks covering corpus-based approaches to the study of language. Of the following suggestions, Biber et al is of most interest to those wishing to pursue an approach to corpus linguistics based upon Biber's multi-feature/multi-dimension approach to corpus data. Kennedy is of most interest to those readers interested in the use of corpora in ELT. McEnery and Wilson is probably of most interest to those approaching corpora from computational linguistics, or readers with an interest in multilingual corpora. Meyer's book is slim, informative and a good introduction to English corpus linguistics. Finally, Stubbs volume is a neo-Firthian account of the use of corpus data in linguistics.

Biber, D., S. Conrad & R. Reppen (1998) Corpus Linguistics: investigating language structure and use. Cambridge: CUP.

Kennedy, G.D. (1998) An Introduction to Corpus Linguistics. London: Longman.

McEnery, T. & A. Wilson (2001, 2nd ed) Corpus Linguistics. Edinburgh : Edinburgh University Press.

Meyer, C. (2002) English Corpus Linguistics: an introduction. Cambridge: Cambridge University Press.

Stubbs, M. (1996) Text and Corpus Analysis: computer assisted studies of language and culture. Oxford: Blackwell.

Related links

Two generally useful URLs to follow are:
http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/contents.htm
http://devoted.to/corpora
The first of these URLs is an on-line accompaniment to the McEnery & Wilson Corpus Linguistics textbook. The second URL is a collection of useful URLs covering a wide range of topics of interest to people using, or interested in using corpora.

For those interested in exploring the role of corpora in teaching, Tim Johns' data driven learning page (http://web.bham.ac.uk/johnstf/timconc.htm) is a valuable resource. Michael Barlow's page (http://www.ruf.rice.edu/~barlow/corpus.html) is also valuable, both because it contains some information regarding teaching and language corpora and because it contains a host of links to corpora in a number of languages, amongst other things.

Mike Scott's homepage is a good place to visit to explore a popular concordancer, WordSmith (http://www.liv.ac.uk/~ms2928/homepage.html) while the possibility of using the Sara programme, released with the British National Corpus, is best explored by visiting http://www.hcu.ox.ac.uk/BNC/.

Referencing this article

Below are the possible formats for citing Good Practice Guide articles. If you are writing for a journal, please check the author instructions for full details before submitting your article.

  • MLA style:
    Canning, John. "Disability and Residence Abroad". Southampton, 2004. Subject Centre for Languages, Linguistics and Area Studies Guide to Good Practice. 7 October 2008. http://www.llas.ac.uk/resources/gpg/2241.
  • Author (Date) style:
    Canning, J. (2004). "Disability and residence abroad." Subject Centre for Languages, Linguistics and Area Studies Good Practice Guide. Retrieved 7 October 2008, from http://www.llas.ac.uk/resources/gpg/2241.