Linguistics and the Arts and Humanities Data Service

Author: Martin Wynne

Abstract

The explosion of access to electronic texts and information about languages and cultures on the Internet offers wonderful new resources for linguists. However, the texts available often present themselves to the researcher as a bewildering choice of unfiltered data. The Oxford Text Archive (OTA) is centrally funded as the centre of expertise in the creation and use of electronic texts for languages, literature and linguistics in the UK academic community, as part of the Arts and Humanities Data Service (AHDS). This paper describes the ways in which the OTA (http://www.ota.ahds.ac.uk ) is currently working in particular to improve the service which it provides specifically for people working in the subject field of linguistics in the UK Higher and Further Education communities. The AHDS is a UK national service funded by the Joint Information Systems Committee (JISC) and the Arts and Humanities Research Board (AHRB). Organised via an Executive at King's College London, and five service providers from various Higher Education institutions, the AHDS aids the discovery, creation and preservation of digital collections in the arts and humanities.

This article was added to our website on 20/12/02 at which time all links were checked. However, we cannot guarantee that the links are still valid.

Table of contents


This paper was originally presented at the Setting the Agenda: Languages, Linguistics and Area Studies in Higher Education conference, 24-26 June 2002.

Linguistics and digital data in the UK

Linguistics is quite advanced in the use of information technology, and particularly in the exploitation of electronic texts for empirical study of language. However, while there are undoubtedly areas of high expertise in sub-fields such as corpus linguistics and in some branches of English language teaching, there is a need for the resources, techniques and best practices of these advanced users of digital resources to be disseminated more widely in the community, to those less experienced in the technical aspects.

The OTA does not see its role as an evangelist for the use of electronic data such as language corpora. The recent emergence of corpus linguistics is in general well recognised and understood and researchers appear to be aware of the opportunities open to them. A more useful role, and that which the OTA is trying to fill, is to help with the technical aspects of creating, finding, evaluating and using resources for researchers who are coming to this type of technology for perhaps the first time, or who remain relatively inexperienced in some aspect.

At the same time, in order to be able to advise on best practice, it is necessary for the OTA and its staff to keep up and participate in the development of this practice, and to be at the cutting edge of new developments. While these two roles may be separated conceptually in order to clarify the situation, in practice the situation in relation to any given project is likely to be a rather more complex mixture of the two.

The basic situation in which electronic text resources are used (which the OTA mainly caters for now) is the following.

  1. Text corpora and electronic texts are stored in the archive.
  2. Users can consult the catalogue, identify what they are interested in using and then download it to their computer.
  3. The user then has to work out how to store and manipulate the files, find software tools to work with it, install these tools and work out how to get the tools to work with the corpus, probably by tweaking both corpus and software.

Such a model involves a lot of work and a lot of expertise, simply in order to get something simple like a concordance from a text or corpus. Given the advances in communications technology which now mean that such services could be carried out on-line, this model looks increasingly complex and old-fashioned.

More advanced, cutting-edge applications are being developed whereby texts, software to run on texts and analyses of texts are being delivered on-line. The static, carefully-crafted corpus is also under threat from the wealth of freely available texts available on the Internet.

There is a complex (but not unusual) situation, where there are large sections of the linguistics communities starting to use digital data for the first time, and in many cases coming in with a low level of computational expertise. At the same time the more experienced and proficient practitioners are moving ahead and moving in new directions. The old 'corpus on the desktop' model is therefore both new and exciting for some sections of the community, and old hat for others. This can in some cases be explained as differential speed of take-up of new practices and techniques, but more often it is best seen as different technologies being appropriate for different sorts of tasks. While basic analysis of a corpus for simple concordances, collocations and statistical properties is not new or cutting edge, it is by no means obsolete, and the introduction of such techniques into new areas of language study, and for new purposes, is constantly bringing new insights. On the other hand, at the cutting edge, things have moved on and researchers are developing new ways to capture data and extract meaningful information from it by automatic means.

The OTA therefore finds itself in the comfortable position whereby it provides help from a fairly basic level in finding, creating and using language resources, while at the same time researching and developing cutting edge applications and services.

In the light of this situation, the following sections detail the services currently offered by the OTA, plus a short consideration of future plans.

The Archive

The Oxford Text Archive archives, preserves and distributes high-quality digital resources created by and useful for the UK academic community. The holdings amount to several thousand electronic resources in a variety of languages, and include electronic editions of works by individual authors, standard reference works, and a range of language corpora. All of the OTA holdings are distributed for free. The catalogue of holdings, information on depositing and downloading resources are all available at http://www.ota.ahds.ac.uk.

The OTA does not produce digital resources, and we rely upon deposits from the wider community as the primary source of high-quality materials.

The Oxford Text Archive is keen to increase its collection of language resources. There are several reasons why researchers might want to deposit their resources in the archive:

  • Preservation: there are procedures in place for the long-term preservation of resources, taking advantage of the facilities at Oxford University Computing Services for the physical preservation of the data, and following best practice in the digital archives community for the future usability of the data;
  • Resource Discovery: resources deposited with the OTA are entered in the archive's catalogue, and so that more people can find out about it; we are also allowing the catalogue to be shared with other portals and search engines which serve people looking for academic resources;
  • Distribution: the resource is advertised and made available to others through the OTA website, subject to licensing conditions;
  • Non-exclusivity: depositing the resource with us will not in any way infringe the depositer's rights to do what they want with it. Researchers who have resources which they may wish to deposit are encouraged to get in touch with the OTA.

Advisory services

The OTA can be consulted by UK academics for free on a range of topics concerning technical aspects of creating, developing and using electronic resources. As well as offering a general advisory service, we act as assessors for the Arts and Humanities Research Board (AHRB) on the technical aspects of research grant proposals. Researchers who are planning to make an application to the AHRB for funding for a project which involves the creation of an electronic resource can get in touch with OTA for advice on planning the project, making the proposal and completing the application form.

The OTA plays a key role in the AHDS Digitisation Workshops, and runs its own events to spread best practice in developing language resources.

Guides to Good Practice

The Arts and Humanities Data Service has produced a series of Guides to Good Practice, to help establish and disseminate the use of best practice in the creation, preservation, distribution and use of electronic data in arts and humanities research. The following guides are produced by the Oxford Text Archive. Print versions are published by Oxbow Books plus they are all available free on the Web at http://www.ota.ahds.ac.uk/.

Creating Electronic Resources, Alan Morrison, Michael Popham and Karen Wikander.

The aim of this guide is to take users through the basic steps involved in creating and documenting an electronic text or similar digital resource. Detailed guidance is given on the issues of document analysis, digitisation, markup, the use of SGML (Standard Generalised Mark Up Language), intellectual property rights, XML (eXtensible Markup Language) and the TEI (Text Encoding Initiative) guidelines, plus documentation and metadata. The final chapter offers a step-by-step summary of the important issues in a electronic text creation project. This title was published in 2000.

Developing Linguistic Corpora, Martin Wynne (ed.)

This Guide offers a practical overview of the key issues in designing and constructing a linguistic corpus. It covers issues relating to design, representativeness, metadata, annotation, multilingual text encoding, archiving and distribution. The guide outlines the choices that a researcher has to make when designing a corpus, capturing the data, encoding, storing and distributing the resource. The chapters deal with different stages of the corpus design and building process, and with different types of data, such as speech and multilingual corpora. The Guide is designed for those working with language who want to build or develop their own corpus data. It is also of interest to researchers looking to evaluate existing resources, and who would profit from understanding how corpora are constructed, so that they can better evaluate the resources available to them. We can expect the reader to know about language and linguistics, but no computational or corpus linguistic knowledge should be assumed. The authors are Lou Burnard, Pernilla Danielsson, Patrick Hanks, Geoffrey Leech, Tony McEnery, John Sinclair, Wolfgang Teubert, Paul Thompson and Martin Wynne. This title is forthcoming in 2002.

Finding and Using Electronic Texts, Ylva Berglund, Greg Colley, Alan Morrison, Michael Popham, Rowan Wilson and Martin Wynne.

A practical guide to discovering, evaluating and using electronic text resources. Issues covered include:

  • a taxonomy of electronic text formats
  • a survey of available sources
  • how to construct queries for portals and search engines
  • a guide to using texts, covering reading, searching, printing, citing, sharing, copying, annotating and analysis
  • evaluating the 'fitness for purpose' of available texts.

This title is forthcoming in 2002.

Future plans

A new service under development at the OTA will allow users to make a virtual corpus tailored for their own needs.

In the traditional model, the corpus is carefully prepared, by taking a sample of the population of texts of which it aims to be representative, and is encoded and annotated in ways which make it amenable for linguistic research. The value and the reusability of the resource are therefore dependent on a bundle of factors, such as the validity of the design criteria, the quality and availability of the documentation, the quality of the metadata and the validity and generalisability of the research goals of the resource creator.

A more general model might be the archive of electronic texts whereby the user creates a collection of texts (the corpus) on an ad hoc basis according to the values of one or more metadata categories, such as 'all 17th century English fiction' or 'all Bulgarian newspaper texts' In this way a large archive, which need not contain not only corpora, but electronic texts, can be exploited as a corpus linguistic resource. It is considered that this type of virtual corpus would be of great value to the OTA's users and a system that will make this possible is under development.

With the forthcoming first version the user will be able to select texts from the archive to make a virtual corpus.

Summary: Services offered by the OTA

Services to users:

  • free access to an extensive collection of high-quality electronic resources;
  • expert information on text availability and usefulness.

Services to data creators and depositors:

  • free archival and distribution management service;
  • facilitating long-term preservation and reusability of electronic resources;
  • promotion of good practice through publications and events.

Services to funding bodies and other agencies:

  • maximizing investment in resource creation by ensuring usability and accessibility of resources;
  • providing expert technical advice to those seeking funding;
  • expertise in the development and application of key standards.