Additional conventions
Standard CHAT transcription conventions have been used, supplemented with the additional conventions described below. Transcription was orthographic, error (%err) and commentary (%com) tiers have been used, and all data has been anonymised.
In this section, we describe some of the general decisions we have taken in the transcribing of French interlanguage oral data, as well as some of the adaptations we have made to the CHILDES system, in the context of L2 data. As will become obvious, many of the decisions were dictated by our research agenda in both the Linguistic Development and the Progression projects, and our wish to use the automatic morphosyntactic tagger. Because the tagger has to recognise the words entered on the main tier in order to be able to tag them, any mispronounced form is entered as the target French word followed by an error code [*]. And although it means that sometimes, the transcription is somewhat deviant from the actual phonological shape of the words produced by learners, we feel it is not too much of a problem as other researchers interested in e.g. phonology, can listen to the soundfiles as they read the transcripts, and add their own level of coding.
General Decisions
Orthographic Transcription
The data has been transcribed orthographically. This is necessary in order to use the French morphosyntactic tagger on the completed transcripts, as it will not recognise non-words.
Limited use of the Error Tier (%err)
We have not consistently used an error tier, which was not necessary for our research agenda, as the syntactic and morphological errors can be retrieved more systematically from the %mor output. As mentioned above, it is important when using the MOR programme, that all words are recognised by the tagger so that they can be analysed morphosyntactically.
However, there were some instances where the word produced was very deviant from the target but nonetheless, we could easily recognise it from the context. In this case we use the error tier, as exemplified below:
*L32: il y a un monstre [*] dans le lac
>
%err: montre = monstre
Pauses
All pauses in the Linguistic Development and Progression corpora have been indicated with # and have not been timed.
Overlapping
We do not show overlapping of interlocutors in the written transcripts, as this can be heard in the digital soundfiles.
Mean Length of Utterance
The speaker turns for the researcher in every file have not been separated into distinct utterances as per CHILDES conventions, so any MLU calculation on the researchers’ length of utterance will not be accurate. They can be calculated, however, on the learner tier.
Adaptations
A number of codes were added to the CHAT system for the specific purpose of second language research, particularly early interlanguage grammars.
We would like to acknowledge Christophe Parisse's expert guidance in making some of these adaptations to the French MOR programme.
Imitations
If a participant repeats exactly what has been said by the researcher or another participant in the case of pairtasks, it has been coded as follows:
*N32: [^eng: how do you say he goes]
>
*ADR: il va
>
*N32: il@g va@g au cinema
@g is added after every repeated word. @g has been added to the special form marker file sf.cut file in the French MOR programme. @g is used to ensure the imitation is not included for tagging by the French morphosyntactic tagger, as this could give misleading information about the current grammar of the learner.
Use of English (whole utterances)
In early emerging second language grammars code switching between English and French is used consistently. In order for the French MOR programme to ignore the English we coded whole utterances as follows:
*SAR: [^eng: yes you begin by asking questions]
>
*P43P: [^eng: how do you say dog?]
Use of a single English word to complete a French Phrase
If an English word has been used to complete a French phrase, then we have coded the words as follows:
Noun | @s:d |
Adjective | @s:a |
Adverb | @s:adv |
Preposition | @s:pre |
Verb | @s:v |
Pronoun | @s:pro |
Determiner | @s:det |
Conjunction | @s:con |
For example:
>
*L28: il achète le skirt@s:d
These forms are then analysed by the morphosyntactic tagger as ‘English N, or V, or A etc., rather than just ignoring them and producing outputs which do not correspond to the learner’s grammar (e.g. in this example, suggesting that this learner’s grammar allows a determiner to be followed by nothing, as the tagger would not recognise ‘skirt’).
These special form markers have been added to the sf.cut file in MOR and they have also been added to the depfile in CLAN (so the files pass through check).
Indeterminate forms
In beginner datasets, it is often difficult to determine which form a learner has intended, as learners often produce something very approximate. There are four examples of this use of indeterminate forms which occur consistently in our data and we coded them as follows:
Definite articles which sound like something between le and la: le@n
Indefinite articles which sound like something between un and une: un@n
First person subject pronoun which sound like something between je and j’ai: je@n
A verb form which sounds like something between a and est: a@n
These forms have been added to the neo.cut file (see below), and are analysed by the parser as e.g. definite article, without specifying the gender.
Neologistic verb endings
Our learners also used neologistic verb forms, which were usually non-finite. Each of these new forms is written on the main tier then added to the MOR programme in a neo.cut file, created, then saved as part of the MOR lexicon
For example:
prener {[scat neo:v:inf]} “prendre”
will be transcribed as prener on the main tier, and analysed by the tagger as neo:v:inf (neologism:verb:infinitive)
Changes to MOR
Adding words to the MOR lexicon
We have also added a number of words, particularly nouns, to the MOR lexicon, For example, le shopping, le jogging, le badminton, le t_shirt etc, so that they can be recognised and therefore tagged by the parser.
Changes to the MOR programme
Because of some problematic MOR outputs, we are modifying the French MOR and POST programmes quite substantially and retraining them on our large database. This work is still in progress and there are still anomalies in the public tagged files which we are aware of. Researchers interested in this work to produce improved versions of the French MOR and POST programmes should contact us.