Word Usage Examples in an Electronic Dictionary

0
326

This paper describes a project in which the Tanaka corpus of matched Japanese-English sentence pairs has been linked to the WWWJDIC online Japanese-English dictionary. The process of linking the corpus is described in detail, as well as an analysis of the word coverage, and the editing of the corpus to remove some of the errors it contains. The paper concludes that the Tanaka corpus can successfully provide a source of example sentences for a Japanese-English dictionary 1. Dictionary Examples and the Electronic Corpus The practice of incorporating sentences or sentence fragments as part of a dictionary entry appears to have originated with Latin and Greek dictionaries compiled in the 16th and 17th centuries, where such passages served as citations from classical texts establishing the provenance of the words. The incorporation of such citations was adopted in later major English dictionaries by lexicographers such as Johnson and Webster, and is now regarded as an essential feature of authoritative mono-lingual dictionaries. The development of comprehensive bilingual dictionaries from the mid-19th century, and more recently mono-lingual “learner’s dictionaries”, extended this practice to include selected or composed examples illustrating the usage of the words. Such examples are considered to be an essential component of such dictionaries. In one English-Japanese dictionary [1] in the author’s possession, the body of each entry consists entirely of parallel English and Japanese sentences utilizing the headwords. The development of extensive electronic corpora such as COBUILD [2] and BNC [3] has brought corpus linguistics to a prominent position in lexicography. In the context of learners’ or bilingual dictionaries, such corpora tend to be used as an aid to the construction of examples, rather than as a direct source. Landau [4] comments that “What a corpus can do above all else even when it cannot provide verbatim examples that can be used in a dictionary is to give examples at the right level of complexity and in a framework that is typical so that the lexicographer can devise examples that are not silly, stilted, or clearly artificial.” One of the editors of Taishukan’s Unabridged Genius EnglishJapanese Dictionary, Kosei Minamide [5], writing about corpora and the examples used in that dictionary, states “Such corpora is (sic) liable to drown us in data”, and adds “Because of the complicated problems concerning copyright and the extreme difficulty of finding entirely suitable examples in the corpus, we had most of the illustrative examples invented by native speakers.” There are no reported cases of electronic corpora being used directly for the provision of dictionary examples. The difficulty of using such corpora for this purpose can be seen from examination of some of the text samples from the online COBUILD collection for the word swimming: against Douglas Stern, Doug Stern’s Swimming Clinic Inc., the United States no-touch sex with clothes on [p] swimming especially nude [p] smiling [p] induced cloud or magical blackness swimming in the air; it was simply likely to keep busy playing games, swimming, jeeping, or making crafts such as historic feat of winning Olympic swimming medals 12 years apart. Janet Silk Cup Derby (Hickstead) 1435b Swimming: National Champs & Euro Trials ( End. The quieter spots and the best swimming on one-mile Long Bay beach are at suitable physical exercise such as swimming or cycling. He will find that any in such a way that you feel you are swimming outdoors in an open-air pavilion. sun-splashed conservatory even a swimming pool. An unforgettably exotic or and telephone. There’s an indoor swimming-pool, sauna, solarium and Clearly only one or two extracts in this sample contain useful material for example sentences, and in both cases some rewriting would be appropriate. It is only one sample, but it supports the views of Landau and Minamide. 2. Project Background When the author began compiling a Japanese-English dictionary file as part of the EDICT [6] project in 1991, there were immediate calls from users of the file and software for example sentences to be associated with the dictionary entries. The initial dictionary format file did not readily allow for the inclusion of such examples, so a structure for such examples was implemented, involving a simple marker in the text of the English translation which indicated the availability of further explanatory information and examples in a linked adjunct file. As the early stages of the EDICT project benefited from considerable voluntary effort, a call was made for the preparation and submission of examples and other explanatory material. None was forthcoming; it appeared that while the user community had sufficient interest and enthusiasm to submit lexical material, preparation of examples were not such a high priority. In 1999 the JMdict project, which involved an expanded dictionary structure, was launched. From the beginning of the project it was intended to incorporate example sentences within entries, with elements reserved in the DTD for this purpose. <!ELEMENT sense (stagk*, stagr*, xref*, ant*, gram*, field*, misc*, gloss*, example*, s_inf*)> ….. <!ELEMENT example (#PCDATA)> Although the structure allowed for examples, there was no ready source which could be employed, and no voluntary contributions were forthcoming. 3. The Tanaka Corpus As reported at the PACLING2001 conference in a paper on the compilation of multilingual corpora [7], Professor Yasuhito Tanaka at Hyogo University had assembled over several years a collection of over 200,000 Japanese-English sentence pairs. The technique he employed was to encourage a number of students each to enter approximately 300 items, drawn from instructional texts and other available sources. The resulting corpus, which he stated was in need of considerable editing, was placed in the Public Domain. At the 2002 Papillon Workshop, Professor Christian Boitet provided a copy of the corpus to participants, with a view to it possibly being used as the foundation for a set of examples within the Papillon dictionary project. The author examined the corpus and concluded that it did indeed have excellent potential for providing such examples, but that it also had a large number of errors which would need eventually correction. It was decided to conduct a trial in which the corpus would be used to provide usage examples for entries in the author’s WWW Japanese-English dictionary server (WWWJDIC). [8] The broad purpose of the trial was: a. to determine if such a sentence collection could effectively be used to provide example sentences in an electronic dictionary application. This of course extends into such matters as: i. experimentation with techniques for achieving the integration of a dictionary and a corpus of example sentences; ii. detection and resolution of related problems, such as the capability to handle issues of polysemy and homonymy. b. to determine if the Tanaka corpus could be edited to an adequate standard in a timely and costeffective manner. 4. Initial Processing As provided, the corpus was a text file with alternating Japanese and English sentences. After code conversion, the sentence pairs were aggregated into tab-delimited single lines to aid sorting and inspection. It was immediately apparent that there were a large number of duplicate or near-duplicate pairs, differing only by such things as punctuation, or spelling errors in the English portion. After some simple harmonization of the punctuation, mainly consisting of ensuring that the punctuation in the Japanese sentences used “JIS” characters, and in the English sentences used ASCII characters, occurrences of examples which duplicated another example with regard to the Japanese sentence were removed. Whilst this may on occasions have removed an example with a “correct” English sentence in favour of an incorrect sentence, it was considered that this could eventually be corrected at a later stage. The removal of this variety of duplicated example reduced the file from an initial 203,000 sentence pairs to approximately 183,000. Further inspection at this stage revealed that a considerable number of errors and near-duplicates remained, however it was considered that the file was in a state that permitted at least a trial of its application to the role of providing example sentences for a dictionary. Further editing could, and did, take place in parallel with the implementation of the dictionary association. 5. Linking Examples to Dictionary Entries The process of associating example sentences with dictionary entries, had it followed the same approach as with printed dictionaries (which was also the approach allowed for in the JMdict data structure), would have meant selecting one or two sentence pairs for each of approximately 20,000 words, and embedding them in the appropriate part of the dictionary database. This approach clearly has a number of problems: a. it would inevitably limit the number of examples available for each word, when the corpus often contained a much larger number; b. it would lead to the breaking-up of the corpus; c. it would significantly increase the size of the dictionary file. Not all applications of the file can, or would, use the examples; d. the process of selecting, editing and moving the example pairs would be very large. Instead, an approach was adopted that achieved the same effect, i.e. the association of examples with dictionary entries, but which avoided the problems outlined above. The approach involved: a. leaving the corpus intact, thus enabling continued edit and revision; b. establishing dynamic links as required from dictionary entries to the sentence(s) that contained the entries’ head-words. Given the size of the file, it was not considered efficient to search it each time a link was required. Also the fact that many of the words involved were verbs, adjectives, etc.,