Tei-conformant Structural Markup of a Trilingual Parallel Corpus in the Eci Multilingual Corpus 1 1. Overview of the Eci Corpus 1.1. Brief History and Acknowledgements

0
518

In this paper we provide an overview of the ACL European Corpus Initiative (ECI) Multilingual Corpus 1 (ECI/MC1). In particular, we look at one particular subcorpus in the ECI/MC1, the trilingual corpus of International Labour Organisation reports, and discuss the problems involved in TEI-compliant structural markup and preliminary alignment of this large corpus. We discuss gross structural alignment down to the level of text paragraphs. We see this as a necessary rst step in corpus preparation before detailed (possibly automatic) alignment of texts is possible. We try and generalise our experience with this corpus to illustrate the process of preliminary markup of large corpora which in their raw state can be in an arbitrary format (eg printers tapes, proprietary word-processor format); noisy (not fully parallel, with structure obscured by spelling mistakes); full of poorly documented formatting instructions; and whose structure is present but anything but explicit. We illustrate these points by reference to other parallel subcorpora of ECI/MC1. We attempt to deene some guidelines for the development of corpus annotation toolkits which would aid this kind of structural preparation of large corpora. The ECI arose as a result of a concern shared by a number of European researchers in computational linguistics that waiting for fully funded support for collection and distribution of non-English corpus material would mean waiting too long. This concern crystalised into action, modelled on the Association for Computational Linguistics (ACL) Data Collection Initiative, following a meeting in Pisa sponsored by the Network for European Reference Corpora (NERC) in 1992. The original call for contributions to the ECI described it as follows: The European Corpus Initiative was founded to oversee the acquisition and preparation of a large multi-lingualcorpus to be made available in digital form for scientiic research at cost and without royalties. We believe that widespread easy access to such material would be a great stimulus to scientiic research and technology development as regards language and language technology. We support existing and projected national and international eeorts to carefully design, collect and publish large-scale multi-lingualwritten and spoken corpora, but also believe it will be some time before the scientiic and material resources necessary to bring these projects to fruition will be found. In the interim, a small and rapid eeort to collect and distribute existing material can serve to show the way. No amount of abstract argument as to the value of corpus material is as powerful