CHILDES
Child Language Data Exchange System
This set of tools was originally conceived for first language acquisition data, but it has also been used, in a much more limited way, by second language researchers. CHILDES tools have been used in more than 1300 published studies ranging from computational linguistics, language disorders, narrative structures, literacy development, phonological analyses and adult sociolinguistics (for a useful introduction to CHILDES see MacWhinney 1999).
CHILDES consists of three integrated components:
- The database (Talkbank, http://childes.psy.cmu.edu/data/), consisting primarily of child speech recordings and transcriptions, but also including some language disorder data and bilingual data. It is a condition of using CHILDES tools that any data becomes part of the Talkbank database, and is thus easily available in anonymised form for an international research audience.
- CHAT (Codes for the Human Analysis of Transcripts) are the transcription procedures, a system for notation and coding which has been developed to be compatible with the analysis programmes. This ‘tagging system’ is now XML compatible. The manual containing all the transcription conventions is regularly updated, and is available on http://childes.psy.cmu.edu/manuals/CHAT.pdf.
- CLAN (Computerized Language Analysis) is a set of 38 computer programs designed to carry data analyses. It includes morphosyntactic taggers in 12 languages, as well as sophisticated searching tools enabling to interrogate directly the output of any of its programs; for example, searches can be carried out straight onto the morphosyntactic output of any batch of files. It is designed to recognise the tagging conventions of CHAT, and is available on http://childes.psy.cmu.edu/manuals/CLAN.pdf
This very brief introduction to CHILDES is only meant to present the tools very succinctly, and any researcher wishing to understand how the tools work will need to consult the manuals, which are also available in hardback (MacWhinney, 2000).
The following section outlines very briefly some of the defining characteristics of the transcription conventions and tagged programmes.
Headers
Every file has a set of ‘headers’ so that the computer can recognise the file. Anything that the researchers feel could potentially influence the findings (e.g. participants, elicitation task, date, researcher and transcriber) can be recorded here. Warnings are included in the file headers so that other researchers wishing to use the data know what decisions have been made. The headers used in each of the corpora in this database are specified for each corpus in the 'headers used' section. For an illustration, see the sample transcript in the following section.
Transcription conventions
The data is transcribed on to a main line as a set of standard language word forms. Each utterance is transcribed on to a separate line and starts with * followed by the speaker code; this line shows what was actually said, by contrast with lines starting with a % sign which contain linguistic tags. The CHAT manual (MacWhinney 2000) contains the codes that have been developed by various contributors addressing a wide variety of linguistic research agendas (including, for example, codes for Conversation Analysis and the analysis of written data). However, the system also allows new codes to be developed to address project-specific questions (e.g. see additional conventions).
A typical transcript will look like this:
@Begin
@Languages: fr
@Participants: L21 Subject, SAR Investigator
@ID: fr|flloc|L21||male|10F1||Subject|4 years, 1st FL, german|
@ID: fr|flloc|SAR|||||Investigator||
@Coder: EG
*SAR: [^ eng: student twenty one all right] .
*L21: le@n famille est arrivée dans le lac .
*SAR: très bien .
*L21: euh c' est un@n grand_mère +/.
*SAR: oui .
*L21: +, un@n mère et trois enfants .
*SAR: très bien .
*L21: et deux garçons et un@n fille .
*SAR: ok .
*L21: # euh deux enfants c' est à la pêche .
*SAR: pêche oui .
*L21: et le@n grand_mère et le garçon est paint@s:v .
*SAR: ok .
*L21: [la mère est] [//] # c' est une +/.
*SAR: [^ eng: xx read a book.] .
*L21: +, lire xx .
*SAR: lire ?
*L21: livre .
*SAR: lire un livre ok très bien .
*L21: # le@n grand_mère est # paint@s:v .
@end
For details of the meanings of the @ codes, please see 'additional conventions'.
The following is a summary of the main codes used. For more detailed information on the coding system, please consult the CHILDES manual.
xxx | unintelligible speech not treated as a word |
xx | unintelligible speech treated as a word |
[?] | best guess |
( ) | noncompletion of a word e.g. gar(çon) |
[*] | after certain mistakes on the main tier (see 'additional conventions') |
# | pause marked by silence |
++ | completion of utterance by learner’s partner or by researcher:
|
+, | interruption or self-completion of utterance 22P: I go + /. ADR: yes. 22P: +, to school. |
+//. | (self-interruption) |
+... | (trailing off). |
[/] | retracing with NO change - <dans la parc> [/] dans la parc |
[//] | retracing with a change - <dans la parc> [//] dans le parc |
grand_mère | the _ sign must be used for compounds |
Dependent tiers
In addition to the main line or tier, there can be multiple ‘dependent tiers’ that provide ancillary information. These tiers are preceded by a % sign to indicate they are strings of tags. Researchers can decide how many dependent tiers are appropriate for their own purposes. In the Linguistic Development and Progression corpora, we have used a %err tier (minimally, to code some errors, see additional conventions), a %mor tier (morphosyntax) and % com tier (for any additional comments), though researchers using our data in the future are free to add other coding tiers depending on their interests.
CLAN can carry out lexical, morphosyntactic, discourse and phonological analyses, amongst others, depending on how the data has been coded. Searches for complex strings can then be carried out on the output of any of these analyses. CLAN programmes such as FREQ, KWAL and COMBO can give e.g. the frequency and linguistic context of interlanguage features, by searching for specific words, combination of words and strings of particular morphological codes or ‘error’ codes. FREQPOS does a frequency analysis by sentence position and MLU calculates the mean length of utterance. In addition, the results of one analysis can be ‘piped’ through another analysis, allowing multiple analyses. For full details of the range of CLAN programs, consult the manual.
The following short excerpt illustrates a transcript after morphosyntactic tagging and disambiguating:
*48L: | il y a une famille et ils sont en vacances. |
%mor: | pro:subj|il&MASC&_3S pro:y|y v:poss|avoir&PRES&3SV
det|une&FEM&SING n|famille&_FEM |
*EMA: | oui. |
*48L: | euh près de Loch Ness et il y a une grand+mère les trois enfants et euh le mère de les enfants euh. |
%mor: | co|euh adv:place|près prep|de n:prop|Loch n:prop|Ness conj|et
pro:subj|il&MASC&_3S pro:y|y v:poss|avoir&PRES&3SV det|une&FEM&SING
n|grand+mère&_FEM det|les&PL adj|trois&_PL n|enfant-_PL
conj|et co|euh det|le&MASC&SING n|mère&_FEM prep|de det|les&PL n|enfant-_PL co|euh. |
Full details of all the programmes available are in the Manuals (MacWhinney 2000, volume 1), or in the online manual, http://childes.psy.cmu.edu/manuals/CLAN.pdf
Viewing files correctly
All the files are encoded using the UTF-8 (8-bit Unicode Transformation Format). If when you try and view a transcript file from this site, and you fail to see the correct rendering of the accents, try changing the character encoding on your browser to Unicode (UTF-8). When using Mozilla's FireFox browser this change can be made by accessing "View -> Character Encoding". When using Microsoft's Internet Explorer go to "View -> Encoding".
References
MacWhinney, B. 1999, The CHILDES System, in Handbook of Child Language Acquisition, Academic Press, pp. 457-494.
MacWhinney, B. 2000, The CHILDES project: tools for analyzing talk. Volume 1: Transcription format and programs. Volume 2: The database. 3rd ed. Lawrence Erlbaum.