Italian arabic linguistic tools

0
560

This paper concerns our participation in the research project: ‘Corpus bilingue Italiano Bilingual Italian – Arabic corpus) funded by law 488/92. The purpose of this project is to develop some linguistic tools and resources for bilingual Italian/Arabic corpora; its background and starting point are tools that have already been developed by the Computational Linguistics Institute. As far as IT tools are concerned, the project consists of four basic elements: a) morphological engine for the Arabic language; b) aligning system for Italian and Arabic parallel texts; c) automatic tagging system for Italian and Arabic texts; d) access tools (and relevant query systems) for the texts of the bilingual corpora at each text-processing step. Introduction In the framework of the comprehensive “Linguistica Computazionale: ricerche monolingui e multilingui” (Computational Linguistics: monolingual and multilingual research) project funded by law 488/1999, the Istituto di Linguistica Computazionale has taken part in the study and development of tools and resources for the Arabic language, as part of the “Corpus bilingue Italiano – Arabo” (Italian – Arabic bilingual corpus) objective. This objective involves the development of a bilingual linguistic work environment, consisting of Italian and Arabic tools and resources, with special attention to the contrastive aspect of it. Bilingual corpora are innovative researching tools that work by comparing relevant languages and/or cultures, that are essential to develop computer-assisted teaching methods and acquire most of the knowledge on which the development of the most promising multilingual IT applications is based (translating aids, information retrieval, data mining, etc.). The objective has been developed in co-operation with the Istituto Universitario Orientale of Naples and the “Dipartimento di Scienze Storiche del Mondo Antico” of Pisa University, which have taken care of developing its linguistic aspect, while we developed all its software features. Linguistic Tools Textual analysis procedures Morphological engines Taggers Aligner Linguistic resources Monolingual reference corpora Automatic lexicons Bilingual aligned corpora Tagged corpora As a background contribution, the Istitituto di Linguistica Computazionale provided the PiSystem, an integrated linguistic analysis system developed by Eugenio Picchi, which has become the standard for many projects based on the study and analysis of different types of texts, and the basic engine of which is the DBT (Data Base Testuale – Textual Data Base) system for the analysis and use of textual resources. The PiSystem features used in the project were its existing Italian modules, such as PiMorfo (Italian morphological engine), PiTagger (automatic Italian morpho-syntactic disambiguator) and Synchro (procedure for the automatic “synchronisation” of parallel texts, already used in Italian-English and Italian-Latin bilingual applications). In addition, such tools have been the basis for the development of matching features in an Italian-Arabic bilingual system. The project in its entirety involves the development of some linguistic resources: generic corpus (8 million words) aligned parallel corpus (4 million words) tagged corpus (2 million words) -morphological lexical resources (20,000 entries) The Arabic textual analysis system and relevant “query system” The 256-type encoding system provided by ISO 88596 (Arabic) charset has been used all through the project, for potential interchange with other partners, acquisition of existing texts and materials, and development of software tools. The Arabic alphabet is composed of 28 letters, which are differently shaped depending on their position (initial, middle, final or isolated), since these letters have to be linked to each other (except a group of six letters) to make words. Extremely important was the decision to adopt one encoding system as much for the acquisition and entry of linguistic materials as for internal representations and processing. Due to the bilingual nature of the project and with a view to being able to use the materials and tools independent of the availability of native Arabic computers and operating systems, the strategy chosen was to develop a proprietary system for the interaction with Arabic materials, i.e. a system that can be interactively used through the keyboard and that gives a correct representation, event without using a specialised Arabic computer or operating system (the development environment is Windows). The keys on the keyboard have been made to match the Arabic alphabet, by selecting it in a manner that matched a standard Arabic keyboard (Fig. 1). Each program was provided with a double function: the above-mentioned matching of the keyboard for normal typing, and the development of a virtual keyboard to be worked on with the mouse to compose a text, queries in particular. The DBT (Data Base Testuale Textual Data Base) system was the basic tool used in the Arabic language project. Such system, however equipped to manage a whole series of non-Latin alphabets, required substantial changes in order to properly work on Arabic texts. It can display all or part of the text, search words, calculate frequencies, define research functions with several words associated in different ways using logic operators, and retrieve all the contexts that fulfil specific search conditions, generate orderly concords, define specific conditions for concord generation, search by regular phrases, etc. The Arabic-alphabet DBT version keeps the characteristics of such language (such as the text displayed from the rightto the left-side), has been instructed through special descriptive tables on how to read the input text encoding: both for a proper display on screen and in print, and for the determination of its proper alphabetic order. These resources have been designed to comply with ISO-8859-6 standard. Morphological engine The morphological engine has been designed to perform a double function: on one side, to generate the inflexion and, from one Arabic entry, automatically generate all its forms (including the their morpho-syntactic classification), while, on the other side, to allow the morphological analysis, that goes back from one form to the entry (or entries) Figure 2: working session using Arabic DBT query system Figure 1: data-entry keyboard to which such form belongs, as well as identify its potential, theoretically valid, morpho-syntactic classifications. To develop such component, we had to: 1. Define the encoding system to be used for a representation of lexical data; definition of the composition, dimension and structure of the Lemmario (entries dictionary); definition of the encoding system, syntax and structure of the “morphological rules” file; 2. Identify groups of entries having the same morphological behaviour and draw up morphological rules based on defined encoding and syntax; 3. Develop a “Lemmario ” file and enter suitable inflexion codes in there. 4. Develop software modules for the development and management of supporting files (lemmario and inflexion rules); 5. Develop software modules for generation and