Creating corpora from spoken legacy materials: variation and change meet corpus linguistics

0
380

Contrasting the aims and methodologies of corpus linguists and variationists, Charles Meyer writes that the latter ‘have been more interested in spoken language’ and ‘have tended to collect data for private use and have not generally made public their data sets’ (2006: 169). Since the advent of sociolinguistics in the 1960’s, individual scholars and research teams have been amassing recordings of spoken data, often for the purpose of investigating variation across a limited number of linguistic features. Surprisingly little of this material has, however, been made accessible to the wider community of scholars. As John Widdowson points out, ‘much of this data remains hidden and inaccessible, scattered in numerous, often obscure, repositories’ (2003: 81). What is more, these valuable legacy materials are often kept in inadequate storage facilities, and in obsolescent media, leading to the danger of them being lost forever.The Newcastle Electronic Corpus of Tyneside English (NECTE) was created with the aid of a Resource Enhancement Grant from the then AHRB with the primary objective of ‘rescuing’ legacy materials from the Tyneside Linguistic Survey collected c.1969 and creating an accessible corpus by combining these with more recently-collected data from the Phonological Variation and Change project, collected c.1994. More specifically, the resultant corpus was designed to be of use to as wide a range of end-users as possible and therefore available in a number of formats: sound, phonetic transcription, orthographic transcription and grammatical mark-up. The challenges posed by this project, and the ways in which the project team overcame them, will be the main focus of this paper, and should provide useful pointers to anybody intending to embark on creating a corpus of spoken language, whether from legacy materials or from newly-collected data. The topics to be covered are: (i) ethical and legal issues surrounding the making accessible of data collected in an era before ethics review or the UK’s 1998 Data Protection Act; (ii) the challenges involved in gathering metadata and digitising ‘old’ audio material; (iii) standards of transcription and mark-up. Finally, there will be some discussion of plans to process other ‘legacy’ materials, and progress made towards developing common standards, as set out in Kretzschmar et.al. (2006).