The adventure of international learner corpora: Implications and applications

0
588

The advent of learner corpora on the linguistics scene some twenty years ago made it possible to examine large quantities of learner language (or ‘interlanguage’) in a systematic way for the first time ever. Since then, learner corpora have continued to be collected and exploited all over the world, and there is no doubt that nowadays they have secured their place in the study of interlanguage. While some learner corpora have been compiled by individual teams (e.g. the HKUST Learner Corpus) or publishing houses (e.g. the Cambridge Learner Corpus), the compilation of large, non-commercial learner corpora containing data produced by learners from different mother tongue backgrounds almost necessarily implies the collaboration of several teams internationally, each of them contributing a subcorpus of data produced by their national students. The Centre for English Corpus Linguistics (CECL), at the University of Louvain, Belgium, has initiated two large-scale international projects aimed at collecting learner corpus data. The International Corpus of Learner English (ICLE) brings together written essays produced by higher intermediate to advanced EFL students from almost twenty different mother tongue backgrounds. The Louvain International Database of Spoken English Interlanguage (LINDSEI), which may be described as ICLE’s talkative sister, contains transcripts of informal interviews with higher intermediate to advanced EFL students from eleven mother tongue backgrounds. Both projects are the result of a collaborative effort among teams of linguists around the world. In this talk, the two above-mentioned corpora will be presented, with special emphasis on the recently completed LINDSEI corpus, and the challenges that are inherent in such large-scale international projects will be outlined. Thus, one challenge lies in the fact that strict corpus design criteria have to be adhered to by all the international teams, in order for the subcorpora to be comparable. In the LINDSEI project this involved, among other things, following transcription guidelines that were identical for everyone. Besides this focus on the challenges facing the compilation of large international learner corpora, we will also explore ways in which such corpora can be exploited. In particular, we will see that the design of ICLE and LINDSEI is such that they can easily be compared with one another, which opens up a whole series of possibilities for investigating the impact of medium on learner productions. Case studies will be reviewed which illustrate the differences that may exist between written and spoken learner English. The presentation will also show how each of the two corpora may be exploited individually to study written or spoken interlanguage and, eventually, produce pedagogical materials for learners of English. In this respect, I will for example describe the collaboration between the CECL and Macmillan Education, which resulted in 100 ‘Get it right’ boxes and an ‘Improve your writing skills’ section in the second edition of the Macmillan English Dictionary for Advanced Learners. Finally, I will briefly introduce some new learner corpora that the CECL has started compiling in collaboration with several universities internationally.