Designing Lexical Databases

0
718

Linguists and lexicographers are latecomers to the field of database applications. Some early projects have used design features which were influenced by the lay-out of conventional printed dictionaries. But a linguistic analysis of underlying lexical information structures leads to a more adequate design. Examples from the SHAKESPEARE DICTIONARY project illustrate some design options for lexical databases. 1. More than twelve years ago, LAURENCE URDANG based his assessment of technological potentials in lexicography on the premise “that it is easier for a linguist to learn to use computers than for a computer specialist to be taught linguistics” (1975: 282). This may be even more valid today in the age of personal computing. Gone is the monopoly of ‘closed shop’ mainframe computers and with it much of the mystique of the high priests attending them. But it has also become obcious that there are limits to the level of expertise which can be achieved by a non-specialist. Linguists and lexicographers are beginning to use computers in more advanced applications where professional advice and expert design are indispensable. The design of lexical databases is such an area, and there is much to learn for a non-specialist. Equally important, though, is the necessity of explicit linguistic models and of rigid linguistic concepts to start with. Lexical database design without an explicit structural theory of lexical relations is a waste of time and effort. The growing availability of database systems is thus also a chance for advances in the theoretical understanding of the lexicon of natural languages. Beginning with the early 1980’s lexical symposia and conferences had an ample share of papers reporting on ongoing research which used the database concept in a variety of ways. In 1981 VAN DYKE PARUNAK reported on “Data Base Design for Biblical Texts” (1982), and NAGAO et. al. on “An Attempt to Computerize Dictionary Data Bases” (1982). At the same conference a University of Bonn group (BRUSTKERN and HESS 1982) presented “The Bonnlex Lexicon System”, which two years later evolved into a “Cumulated Word Data Base for the German Language” (BRUSTKERN and SCHULZE 1983a). This list could easily be extended, but quite important design issues can be discussed using these specific database projects as examples. Linguists and lexicographers are latecomers to the field of database applications. Database software has been available since the early 1960’s. The early 1970’s brought a wide variety of commercial products and a consolidation on the conceptual side, which ultimately led to standardization, elaborated design philosophies, and specifications of “normal forms”. At that time lexicographers still used the concept of an archive when LEXICOGRAPHICA 4/1988 Designing Lexical Databasis 61 talking about new technologies, such as BARNHART (1975), CHAPMAN (1975), and LEHMANN (1975) at the 1972 International Conference on Lexicography in English. Similarly, in the late 1970’s, we witnessed preparations for a Stanford Computer Archive of Language Materials. There is nothing wrong with the idea of an archive. But a database is something different. By now, the expression database should only be used as a technical term. Perhaps data bank may be used instead of database when talking about files of data, or archives in a loose sense. The Association for Literary and Linguistic Computing may have had this clarification in mind when naming its specialist group “Structured Data Bases”. Apparently there is a current fashion, shared with other social sciences, to talk about “data bases”. A typical symptom is special interest conferences, such as the 1985 International Conference on Data Bases in the Humanities and Social Sciences, whose contributions offer a fairly reprensentative picture of current developments. 2. A convenient and viable context for discussing databases has been established by the “Interim Report from the Study Group on Data Base Management Systems” (ANSI/X3/ SPARC 1975), or, for short, the ANSI SPARC Report. This report proposes a multilevel database architecture and a corresponding terminology. The terminology allows a discussion of databases issues regardless of specific systems or implementations. It is therefore possible to disregard the more technical problems of particular database types and to concentrate on the underlying design problems. As always, good design guarantees a straightforward implementation. The report distinguishes three classes of users and introduces three levels of external, conceptual, and internal schemata. A schema provides a concise description of a database. A conceptual schema of a database uses just three primitive notions: entity, attribute, and relationship. An entity is a discrete concept, event, person, place, or thing of interest to the database system. An attribute is elementary information describing an entity. Each attribute is defined by the set of admissible attribute values. Each occurrence of an entity can therefore be viewed as a set of attribute values. One attribute has to be unique for each entity occurrence and serves as its primary identifier. The set of attributes may be further subdivided into groups. A relationship is a binary mapping between entities. Its definition establishes both the range and the domain in the set of entity occurrences. All this can, of course, be elaborated in greater detail and precision. A linguist, and especially a structural linguist, will have recognized the basic notion of a structure, which underlies the ANSI SPARC terminology. Structural linguistics thus seems to offer a congenial perspective on data and it may be a natural candidate for database applications. A cursory glance at lexical database proposals shows a slightly different picture. NAGAO et al. (1982) start with a “dictionary description”. In Figure 1 their structural diagram is reproduced. It is explicitly said to be “merely a simple tree structure” (1982: 63). In the diagram there are four levels with “Headword” on top followed by seven items on the level below. One of these (“Part of Speech”) dominates five items on the next level. Here again, one item (“Japanese Translations”) dominates the final bottom level of four items. From a linguistic point of view such a structural design is rather unusual. But linguistic considerations did not seem to have played a role in the design. Instead, the authors simulate conventional lay-out and LEXICOGRAPHICA 4/1988 62 H. -Joachim Neuhaus typesetting arrangements of printed bilingual dictionaries. It is indeed a widespread dictionary usage to print one “Headword” in bold type and then use special symbols, such as the tilde, to refer to the headword, or part of it, thus saving space for the treatment of further lexical items with the same spelling. The authors very faithfully copied this conventional lay-out. Even minor details (“the pronunciation may change depending on the part of speech” NAGAO et al. 1982: 53) become part of the design. But should such a “Headword” and its dependencies be a serious candidate for a database entity? Are the reasons that led dictionary publishers to accept certain lay-out techniques at all relevant for an electronic database. These questions seem not to have been raised. The whole design is thus a paradigm case of an ‘imitation design’, where a new technology replicates design features of an older technology. The basic misunderstanding is the false identification of a mere presentation in a printed dictionary with an underlying lexical information structure: “[…] the description for a dictionary entry acquires a certain structure and the various parts of the dictionary descriptions are related to each other. In the printed dictionaries, these relationships are expressed implicitly in linearized form. […] in order to utilize these relationships in computer programs, we have first to identify them in the printed versions, and then to reorganize them appropriately so that the programs can manage them effectively. We call this form of translation from the printed versions to computer-oriented formats data translation.” (NAGAO et al 1982: 53) Such a “data translation” does not create a database in any interesting sense. It is probably much more convenient to use the printed dictionary itself, even more so when the only example for retrieving database information, which the authors discuss, is “Headword” look-up: “[…] the average time for retrieving and displaying the results on the screen is 320 msec. About half of the time is spent on the display control… To access a headword from its spelling requires only 3 msec. The remainder is spent on retrieving the other records such as pronunciation, P.O.S. idioms, etc.” (NAGAO et al. 1982) One major difference between a printed dictionary and a lexical database is the direct access to data other than “Headwords”. One may, for example, start with a given phonetic transcription and be interested in the corresponding orthographic form or forms. Traditional lexicography has produced such “phonetic” dictionaries (e.g. MICHAELIS and JONES 1913) by simple inversion. But a database would presuppose an analysis of the relationship involved. 3. Database relationships are formally classified according to conditionality, connectivity, and cardinality. Do all entity occurrences participate in the relationship? Is there a one-to-one, one-to-many, or many-to-many relationship between the two respective entity types? How many relations are expected? These questions should get definite answers. In the hiearchical design “Headword” “Part of Speech” — “Pronunciation” probably cases such as [‘intadikt] versus [.mts’dikt], one being a verb the other one a noun, served LEXICOGRAPHICA 4/1988 Designing Lexical Databasis 63 as a model. Here we have one “Headword” item related to two “Part of Speech” items, each of which is again related to a “Pronunciation” item. But a database design