Transcription conventions

Here we describe some of the general decisions we have taken in the transcribing of spoken L2 French and L2 Spanish, using the CHAT system developed by the CHILDES project. We also describe below some of the adaptations we have made to the CHAT system, in the context of L2 data. Detailed guides to the transcription of L2 French and L2 Spanish using CHAT conventions has been produced and the most recent versions are available on request to the research team.


All transcripts and sound files have been anonymised to eliminate personal details so that individual learners are not identifiable.

Filenames and headers

Filenames include a single capital letter code for the task, a 3-digit code for the participant, a single lower case letter code for the data collection occasion, plus the initials of the researcher who administered the task. Thus for example the filename “O114bKmcM” refers to the Oral Interview task undertaken by Participant 114 during the second data collection round (Visit 1), and administered by researcher Kevin McManus.

Headers for all transcribed files follow CHAT conventions. Here is an example:

  1. @Begin
  2. @Languages: fra
  3. @Participants: 114 Participant, KMcM Investigator
  4. @ID: fra|langsnap|114||female|V1|interview|Participant||
  5. @ID: fra|langsnap|KMcM||male||interview|Investigator||
  6. @Media: O114bKMcM, audio
  7. @Date: 15-NOV-2011
  8. @Location: France
  9. @Situation: Oral interview
  10. @Transcriber: CD
  11. @Time Duration: 00:21:02

Orthographic Transcription

The data has been transcribed orthographically. This is necessary in order to use the morphosyntactic parsers provided by CHILDES/ CLAN for French and Spanish on the completed transcripts. In the interests of automatic part of speech (POS) tagging at times the transcription is somewhat deviant from the actual phonological shape of the words produced by learners. However other researchers interested in e.g. L2 phonology, can refer to the soundfiles and add their own level of coding to the transcripts provided.

Limited use of the Error Tier (%err)

We have not consistently used an error tier, which was not necessary for our research agenda, as the syntactic and morphological errors made by our L2 learners can be retrieved more systematically from the POS tagged output.

*114: et euh la plupart de mes élèves sont des garçons parce+que c' est un lycée technique euh.
*114: et ils sont très difficiles .
*114: et donc c' est difficile de faire les exercices oraux [*] parce+que ils veulent toujours parler en français.
%err: oraux = orals


All pauses are indicated with (.) and have not been timed.


Overlapping of speech turns in the written transcripts is indicated using standard CHAT conventions.

Mean Length of Utterance

The speech turns for the L2 learner(s) in every file have been separated into distinct utterances as per CHILDES conventions, so MLU calculations can be carried out. However this has not been done for the researcher speech turns, so MLU calculations on the researchers' length of utterance will not be accurate.

L2 adaptations

A number of codes have been added to the CHAT system for the specific purposes of second language research. These codes cover the following issues:

Use of L1 English (complete utterances and/ or codeswitching at word or phrase level)
Codeswitching at word or phrase level. We mark that by adding "@s:" followed by an individual code corresponding to different word classes (e.g. noun (d), verb (v) etc.; see Transcription Guidelines document for detail):
*P63: y cómo se dice scuba@s:d diving@s:v ?
Complete utterances. Marked between square brackets starting with the code "^eng:":
*P04: [^ eng: I don't know what that means ].
Direct learner imitations of investigator utterances, marked with "@g" at the end of the imitated word.
*P51: no están en el sol están en shade@s:d.
*MJA: la sombra.
*P51: la@g sombra@g.
Use by learners of indeterminate forms and idiosyncratic neologisms, marked with :"@n" at the end of the word.
*P54: um ehm detrás de lo eh pictura@n eh hay [/] hay un número de turistas . *

LANGSNAP is powered by EPrints 3 which is developed by the School of Electronics and Computer Science at the University of Southampton. More information and software credits.