Formal Representation of Language Structures

0
820

Building treebanks is a prerequisite for various experiments and research tasks in the area of NLP. Under a recently awarded grant, we are developing (i) a formal definition of a (dependency based) tree, and (ii) a midsize treebank based on this definition. The annotated corpus is designed to have three layers: morphosyntactic (linear) tagging, syntactic dependency annotation, and the tectogrammatical annotation. The project is being carried out jointly at the authors’ Institutes. 1 The Current State and Motivation Recent decades have seen a shift towards expressing linguistic knowledge in ways which allow its verification and processing by formal means. Tools originating in mathematics, logic and computer science have been applied to human language to model its structure and functioning. Various aspects of different languages are being described within formally defined frameworks proposed by a number of interacting linguistic theories. The proposals deal with various levels of linguistic description, starting from the level of sounds (phonetics) up to the level of meaning. Partial grammars and lexicons now exist for many languages within various formal frameworks and collections of linguistic analyses of text and speech are accumulated to be employed both in theoretical research and applications. Besides approaches relying on symbolic means and ‘rationalist’ efforts which result in language models consisting of grammar rules and lexical entries, alternative methods employ statistics computed from input text or its analysis to produce a stochastic model. However, a common and crucial issue cutting across all types of enterprise in this domain is the need to adopt or design an adequate formal representation of language structures in order to accommodate relevant linguistic knowledge in its relation to the actual language data. There is a number of tasks which typically require soundly defined formal representation of language structures: 1. analysis (parsing) of input text or speech into a representation, tagging of text or speech collections; 2. synthesis (generation) of output text or speech from a representation; 3. mapping of one representation onto another -transfer (typically in machine translation systems). These are the elementary tasks which are parts of many natural language processing applications, some of which are listed below: • machine translation systems; • natural language interface to knowledge bases, question answering systems; • automatic abstracting and knowledge acquisition systems; • automatic acquisition of linguistic data and its integration into a language model. 1 Grant of the Grant Agency of the Czech Republic No. 405/96/0198, which has now become an integral part of a newly awarded long-term grant of the same agency No. 405/96/K214 2 When a linguistic description is implemented on computers, the usual goal is to parse sentences and produce representations of their analyses, thereby verifying the framework, the linguistic theory and the description itself. Another way to obtain (morphological and syntactic) analysis of sentences is by employing statistical methods on large samples of (already analyzed) texts in order to process a new text afterwards, performing some degree of linguistic analysis on the basis of the data acquired in the `learning’ phase. Both these kinds of efforts converge and their increasing potential is reflected in the growing amount of text and speech data analyzed to a different degree for various purposes. Formal representations of language structures which have been proposed by different linguistic theories and/or used in natural language processing applications reflect their context in many respects and suitable candidates for an intended more general use are difficult to find. This is due to various aspects of their design, such as (i) specific theoretical commitment, (ii) limited expressive power in partial coverage of language phenomena and restriction to certain levels of linguistic analysis, (iii) difficulties in expressing relationships between different levels of analysis, (iv) hard-wired reliance on some characteristics of a certain language or language group and the resulting difficulty in adapting the framework to a typologically different language, and, finally, (v) application-specificity. Thus, it is difficult to express a full-fledged syntactic analysis of a ‘free word-order’ language by means of word-class labels and constituent brackets used for tagging (mostly English) texts. Although it is not likely that a single framework could become a universally accepted vehicle of linguistic knowledge, we believe that a higher degree of generality and flexibility can be achieved for the benefit of both theoretical studies and application-oriented projects. 2 Characteristics of a Satisfactory Solution From the conceptual point of view, an adequate design of formal representation should be able to express linguistic facts related to the following levels of description: 1. level of phonetics, phonology, graphemics: specification of phonemes, stress and prosodic patterns, etc.; 2. level of morphology: morphemes, morphological categories; 3. level of syntax: syntactic categories, syntactic structure (trees); 4. level of (linguistic) meaning: disambiguation of lexical meaning, specification of underlying structure and function, communicative dynamism and topic -focus articulation, anaphora resolution. There are several important features that should be reflected in the design to make it really useful: • It should be possible to describe a language structure in all its aspects simultaneously, i.e., to be able to relate facts from all levels of linguistic analysis in a straightforward fashion. At the same time, the design should permit access to specific aspects of the description without other aspects intervening. Thus, a user interested only in syntactic structure should be able to filter out any other information. • If a certain aspect of linguistic description can be structured and viewed differently depending on theoretical commitments, the design should provide an option to derive the desired way of presenting the linguistic facts from a common representation. Thus, both phrase-structure and dependency trees could be derived from the description. • The design should be capable of accommodating typologically different languages without substantial modifications, especially, it should provide space for stating the relation between word-order variations and higher levels and for the interplay between morphology and syntax in the case of complex expressions. • A related requirement concerns the possibility to express links between parallel structures and their analyses in different languages. This feature is important if parallel bior multilingual data are to be analyzed and studied as contrastive language structures. • The design should provide space for as little or as much linguistic facts concerning a language structure as is possible or practical to collect or express. This feature would permit to integrate text or speech samples with their analyses in a stepwise fashion, possibly starting with a bare text/speech string and leaving some levels unspecified. • It should be possible to represent at least some linguistic facts in an underspecified form. Wherever possible, an option to use a quantitative measure should accompany such cases. Disjunctions restricted to local domains, underspecified descriptions and weights could be the means to achieve this requirement. • The formal representation should be convertible to another format, as required by an application or desired by another specification covering compatible conceptual issues. • The design should be flexible in the sense that it should contain as few inherent restrictions to its extensions and modifications as possible. 3 Background, Methods and Problems Without attempting to preview the results, the following points can be made to sketch the starting point situation, the outlines of the goal, and the path towards its achievement: 1. The project will be able to profit from theoretical results and practical experience gained in the field of formal description of natural language at our sites. The fruitful results concerning word-order variations and their relation to meaning, as well as the richness of syntactic studies based on a dependency-oriented model, both widely acknowledged and faithful to the high standards of the Prague School linguistic tradition, provide a wealth of stimulating material. At both sites, a number of application-oriented research projects have been at least in some respects tackling the problems of an adequate representation of language structures. The projects include machine translation, natural language interface to knowledge bases, automatic abstracting, automatic knowledge acquisition from texts and grammar checking. 2. The smallest piece of information (typically, a linguistic category) is expressed as an attribute and its value (i.e., a ‘feature’). A collection (conjunction) of such pairs is used to describe a linguistic object (typically an aspect of linguistic description of a word or a collocation), allowing for partial information (underspecification) and entering into more complex structures, where some attribute values are not atoms but structures. Through the recursive nature of such a representation, linguistic structures of arbitrary complexity can be described. Two or more attributes can share a single value, which is a possible way to implement relations between linguistic facts at different levels of description. As structures of this type have become a kind of standard in modern linguistic research, the issues of compatibility with other approaches will be substantially simplified on many levels.