French Learner Language Oral Corpora flloc
Home About us Team members Ground rules

The FLLOC project

Corpora

Other resources

Search Site

Valid XHTML 1.0 Strict

Valid CSS!

CHILDES

Child Language Data Exchange System

This set of tools was originally conceived for first language acquisition data, but it has also been used, in a much more limited way, by second language researchers. CHILDES tools have been used in more than 1300 published studies ranging from computational linguistics, language disorders, narrative structures, literacy development, phonological analyses and adult sociolinguistics (for a useful introduction to CHILDES see MacWhinney 1999).

CHILDES consists of three integrated components:

This very brief introduction to CHILDES is only meant to present the tools very succinctly, and any researcher wishing to understand how the tools work will need to consult the manuals, which are also available in hardback (MacWhinney, 2000).

The following section outlines very briefly some of the defining characteristics of the transcription conventions and tagged programmes.

Headers

Every file has a set of ‘headers’ so that the computer can recognise the file. Anything that the researchers feel could potentially influence the findings (e.g. participants, elicitation task, date, researcher and transcriber) can be recorded here. Warnings are included in the file headers so that other researchers wishing to use the data know what decisions have been made. The headers used in each of the corpora in this database are specified for each corpus in the 'headers used' section. For an illustration, see the sample transcript in the following section.

Transcription conventions

The data is transcribed on to a main line as a set of standard language word forms. Each utterance is transcribed on to a separate line and starts with * followed by the speaker code; this line shows what was actually said, by contrast with lines starting with a % sign which contain linguistic tags. The CHAT manual (MacWhinney 2000) contains the codes that have been developed by various contributors addressing a wide variety of linguistic research agendas (including, for example, codes for Conversation Analysis and the analysis of written data). However, the system also allows new codes to be developed to address project-specific questions (e.g. see additional conventions).

A typical transcript will look like this:

@Begin
@Languages: fr
@Participants: L21 Subject, SAR Investigator
@ID: fr|flloc|L21||male|10F1||Subject|4 years, 1st FL, german|
@ID: fr|flloc|SAR|||||Investigator||
@Coder: EG
*SAR: [^ eng: student twenty one all right] .
*L21: le@n famille est arrivée dans le lac .
*SAR: très bien .
*L21: euh c' est un@n grand_mère +/.
*SAR: oui .
*L21: +, un@n mère et trois enfants .
*SAR: très bien .
*L21: et deux garçons et un@n fille .
*SAR: ok .
*L21: # euh deux enfants c' est à la pêche .
*SAR: pêche oui .
*L21: et le@n grand_mère et le garçon est paint@s:v .
*SAR: ok .
*L21: [la mère est] [//] # c' est une +/.
*SAR: [^ eng: xx read a book.] .
*L21: +, lire xx .
*SAR: lire ?
*L21: livre .
*SAR: lire un livre ok très bien .
*L21: # le@n grand_mère est # paint@s:v .
@end

For details of the meanings of the @ codes, please see 'additional conventions'.

The following is a summary of the main codes used. For more detailed information on the coding system, please consult the CHILDES manual.

xxx unintelligible speech not treated as a word
xx unintelligible speech treated as a word
[?] best guess
( ) noncompletion of a word e.g. gar(çon)
[*] after certain mistakes on the main tier (see 'additional conventions')
# pause marked by silence
++

completion of utterance by learner’s partner or by researcher:
e.g.

  • 32L: ils pêchent dans le # +…
  • ADR: ++ lac.
  • 32L: la mère regarde un livre.
+, interruption or self-completion of utterance
22P: I go + /.
ADR: yes.
22P: +, to school.
+//. (self-interruption)
+... (trailing off).
[/] retracing with NO change -
<dans la parc> [/] dans la parc
[//] retracing with a change -
<dans la parc> [//] dans le parc
grand_mère the _ sign must be used for compounds

Dependent tiers

In addition to the main line or tier, there can be multiple ‘dependent tiers’ that provide ancillary information. These tiers are preceded by a % sign to indicate they are strings of tags. Researchers can decide how many dependent tiers are appropriate for their own purposes. In the Linguistic Development and Progression corpora, we have used a %err tier (minimally, to code some errors, see additional conventions), a %mor tier (morphosyntax) and % com tier (for any additional comments), though researchers using our data in the future are free to add other coding tiers depending on their interests.

CLAN can carry out lexical, morphosyntactic, discourse and phonological analyses, amongst others, depending on how the data has been coded. Searches for complex strings can then be carried out on the output of any of these analyses. CLAN programmes such as FREQ, KWAL and COMBO can give e.g. the frequency and linguistic context of interlanguage features, by searching for specific words, combination of words and strings of particular morphological codes or ‘error’ codes. FREQPOS does a frequency analysis by sentence position and MLU calculates the mean length of utterance. In addition, the results of one analysis can be ‘piped’ through another analysis, allowing multiple analyses. For full details of the range of CLAN programs, consult the manual.

The following short excerpt illustrates a transcript after morphosyntactic tagging and disambiguating:

*48L: il y a une famille et ils sont en vacances.
%mor:

pro:subj|il&MASC&_3S pro:y|y v:poss|avoir&PRES&3SV det|une&FEM&SING n|famille&_FEM
conj|et pro:subj|ils&MASC&_3P v:exist|être&PRES&3PV prep:art|en n|vacance&_FEM-_PL .

*EMA: oui.
*48L: euh près de Loch Ness et il y a une grand+mère les trois enfants et euh le mère de les enfants euh.
%mor: co|euh adv:place|près prep|de n:prop|Loch n:prop|Ness conj|et pro:subj|il&MASC&_3S pro:y|y v:poss|avoir&PRES&3SV det|une&FEM&SING n|grand+mère&_FEM det|les&PL adj|trois&_PL n|enfant-_PL conj|et co|euh
det|le&MASC&SING n|mère&_FEM prep|de det|les&PL n|enfant-_PL co|euh.

Full details of all the programmes available are in the Manuals (MacWhinney 2000, volume 1), or in the online manual, http://childes.psy.cmu.edu/manuals/CLAN.pdf

Viewing files correctly

All the files are encoded using the UTF-8 (8-bit Unicode Transformation Format). If when you try and view a transcript file from this site, and you fail to see the correct rendering of the accents, try changing the character encoding on your browser to Unicode (UTF-8). When using Mozilla's FireFox browser this change can be made by accessing "View -> Character Encoding". When using Microsoft's Internet Explorer go to "View -> Encoding".

References

MacWhinney, B. 1999, The CHILDES System, in Handbook of Child Language Acquisition, Academic Press, pp. 457-494.

MacWhinney, B. 2000, The CHILDES project: tools for analyzing talk. Volume 1: Transcription format and programs. Volume 2: The database. 3rd ed. Lawrence Erlbaum.