Recent developments in Machine Translation a review of the last five years

0
385

multilevel tree representations which combine syntactic, logical and semantic relationships), lexical transfer (lexical substitution with some structural changes), structural transfer (tree transduction), syntactic generation, morphological generation (trees to strings). The long-term aim of the GETA project is a multilingual system producing ‘good enough’ results, i.e. accepting the need for post-editing. The system is essentially, like Eurotra, a linguisticsoriented system; it does not claim to use any ‘deep understanding’ or ‘intelligence’, and hence no AItype explicit ‘expertise’ is incorporated in GETA-ARIANE although the possibility of grafting on an ‘expert’ error correction mechanism was investigated by Boitet and Gerber (1986). However, unlike other linguistics-based systems, Ariane extends translation analysis to sequences of several sentences or paragraphs, in order to deal with problems of anaphora and tense/aspect agreement. For practical production the system permits optional pre-editing, primarily the marking lexical ambiguities; post-editing can be done using the REVISION program developed for ARIANE-78. It is a mainframe batch system with no human interaction during processing. However, Zajac (1986) has investigated an interactive analysis module for GETA, somewhat on the lines of Tomita’s research at Carnegie-Mellon (Tomita 1986). One important development has been the refinement of the theoretical basis, particularly the clarification of the distinction and the relationship between dynamic and static grammars in the system. Static grammars (or SCSG ‘structural correspondence static grammars’) record the correspondences between NL strings and their equivalent interface structures in a formalism which is neutral with respect to analysis and synthesis. The processes of analysis and generation are handled by ‘dynamic grammars’ written in appropriate ‘special languages’ (SLLPs or Special Languages for Linguistic Programming): ATEF for morphological analysis, ROBRA for structural analysis, structural transfer and syntactic generation, EXPANS for lexical transfer, and SYGMOR for morphological generation. (The distinction between ‘static’ and ‘dynamic’ grammars is now found in many advanced transfer systems; the GETA project has been a leading force in this theoretical development.) Equally important have been the improvements to the research environment, in tools for the development of systems, such as ATLAS for lexicographic work and VISULEX for viewing complex dictionary entries. Such tools are components of a ‘linguistic workstation’ for MT research (an idea also being developed by the Saarbrücken and the Kyoto groups, 15.and below). Within this environment the work of the Calliope project has taken place: the compilation of the static grammars for English and French during 1983-84, their corresponding dynamic grammars, and the substantial lexicographic work. The Grenoble group has always encouraged and supported other MT projects using GETA software, and thereby helped to train MT researchers. ARIANE is regarded above all as “an integrated programming environment” for the development and building of “a variety of linguistic models, in order to test the general multilingual design and the various facilities for lingware preparation…” The ARIANE software has been tested on an impressive range of languages, often in small-scale experiments (Vauquois and Boitet 1985/1988; Hutchins 1986: 247-8; Boitet 1987a), but sometimes in larger projects, e.g. the English-Malay project mentioned elsewhere in this survey. The largest GETA-ARIANE system has been for Russian-French translation, which built upon previous experience with CETA. Since 1983 this system has been extensively and regularly tested in an experimental ‘translation unit’; large corpora of text have been translated, including some 200,000 running words during one 18-month period (Boitet 1987b). Another large-scale system was the German-French system developed by Guilbaud and Stahl, using the same generator programs as in the Russian-French system. Its principal features were the attention given to morphological derivation and inflection, and the restriction of structural analysis almost wholly to morphological and syntactic data, with little or no use of semantic information. The system has been described by Guilbaud (1984/1987), but there has been little development of the system since 1984 (Boitet 1987b). The most important practical application of a large-scale system has, however, been through GETA’s involvement in the French national computer-assisted translation project (NCATP). Launched in November 1983 (after a preparatory stage in 1982-83, the ESOPE project), the Calliope project has been financed 50% from public funds (administered by the Agence d’Informatique) and 50% from private sources. One source has been B’VITAL, founded in 1984 by the Grenoble group, which is responsible for the machine-readable dictionaries and for the ‘static grammars’ (Joscelyne 1987). Another has been Sonovision, which was to provide the aeronautics terminology for the major French-English system Calliope-Aero. After a demonstration of a prototype of Calliope-Aero at Expolangues in February 1986, it was decided to develop also an English-French system for the translation of computer science and data processing materials, Calliope-Info. In addition to these MT systems, both batch systems, the project was also to produce a translator’s workstation (Calliope-Revision, organised around a Bull Questar 400 microcomputer) for preparing and post-editing texts and for access to remote term banks and including OCR and desk-top publishing facilities. This was essential if the systems were to be fully integrated into an industrial documentation environment. However, given the expected delays there have been plans by SG2 (one of the backers) to develop a terminology aid with split-screen word processing, Calliope-Manuel. Whatever the commercial feasibility of the Calliope project, which came to a formal end in February 1987 (Boitet 1987a), the experience will no doubt be put to good use by the GETA project, in particular the experience of dealing with complex dictionaries and the type of scientific and technical sublanguage presented by aeronautics. Boitet (1986), for example, mentions the successful treatment of complex noun phrases (e.g. la jonction bloc frein et raccord de tuyauterie) and complex adjectival phrases (e.g. comprise entre les deux index noir). Other problems did not occur in the sublanguage and thus were put aside, e.g. interrogatives, relative clauses introduced by dont, imperatives, certain comparatives, nominal groups which do not only consist of nouns, and so forth. The NCATP has had other consequences. It stimulated the conversion of the ARIANE-85 to run on IBM PC AT (with a minimum 20MB hard disk), adequate for MT development but not for a production system. It also encouraged the writing of new software in a French dialect of LISP (Boitet 1986; Boitet 1987a, 1987b), with the aim of creating a fully multilingual system with a single ‘special language’ for processing strings and trees (TETHYS). Clearly, GETA has continued to advance the boundaries of MT research. 14. While GETA is the main MT research centre in France, there are other MT projects in Nancy and Poitiers. At Nancy, Chauché (1986; Rolf & Chauché 1986) continues his research, begun at Grenoble, on algorithms for tree manipulation which are suitable for MT systems. Tests of the algorithms have been applied to Spanish-French and Dutch-French experiments (in collaboration with Rolf of Nijmegen University). From Poitiers, Poesco (1986) reports a smallscale knowledge-based MT experiment for translating Rumanian texts on three dimensional geometry into French. The ATN parser produces a conceptual frame-slot representation from which the generator devises a ‘plan’ for producing TL output. The restricted language system TITUS, designed for multilingual treatment of abstracts in the textile industry, has expanded in its latest version TITUS IV (Ducrot 1985) in order to deal with a wider range of subjects and to allow somewhat freer expression of contents. As elsewhere, there is commercial interest in translators’ workstations: Cap Sogeti Innovations is proposing a “language engineering workshop”, providing ‘intelligent’ language tools, a dedicated multilingual word processor, a natural language knowledge base, a technical summary writer, and a ‘text analyzer’ which will produce abstract meaning representations. Details are necessarily vague at present (Joscelyne 1987). Attitudes to MT in France are most likely to be changed by the provision of MT services on Minitel. The availability of Systran has already been mentioned (sect.1 above). Other services include a number of dictionaries and term banks: the Harrap French and English slang dictionary, the Dictionary of Industries, Normaterm (the term bank of the French standards organisation AFNOR), the DAICADIF lexicon for telecommunications, and (next year) FRANTEXT the historical dictionary Tresor de la Langue Française. 15. The largest and most long-established MT group in Germany is based at Saarbrücken. It began in the mid-1960’s with research on Russian-German translation, sponsored from 1972 to 1986 by the Deutsche Forschungsgemeinschaft. The SUSY project expanded into a multilingual system, based on the transfer approach, with the source languages German, Russian, English, French, and Esperanto, and the target languages German, English and French. Detailed descriptions of the latest version SUSY II as at the end of 1984 are given by Maas (1984/1987) and by Blatt et al. (1985), and summarised by Hutchins (1986: 233-239). The most recent developments of MT research at Saarbrücken are to be found in Zimmermann et al. (1987).