Evaluating Translation Quality as Input to Product Development

0
490

In this paper we present a corpus-based method to evaluate the translation quality of machine translation (MT) systems. We start with a shallow analysis of a large corpus and gradually focus the attention on the translation problems. The method constitutes an efficient way to identify the most important grammatical and lexical weaknesses of an MT system and to guide development towards improved translation quality. The evaluation described in the paper was carried out as a cooperation between an MT technology developer, Sail Labs, and the Computational Linguistics group at the University of Zurich. 1. Different types of evaluation for different purposes Sail Labs does various types of translation quality (TQ) evaluations (absolute, comparative, text and sentence-based) and uses different methods (glass-box, black-box evaluations, preand postrelease, using linguistic test suites and real text corpora). Most of the evaluations are from a developer’s rather than a user’s point of view. Please note that these evaluations were carried out with earlier product versions and the results were used in the development of Sail Labs’ current MT technology. For this reason, the concrete results included in this paper (statistics and phenomena) do not reflect the current status of Sail Labs’ technology. Before designing an evaluation method it is crucial to answer the following questions (King, 1997): • What is the purpose of the evaluation? • What exactly is being evaluated? In this paper we focus on a TQ evaluation to answer the question: In which linguistic areas does the evaluated MT system have the most problems? Thus, the purpose of our evaluation is to identify the most costly grammatical and lexical weaknesses so that by concentrating development on these areas, we can most effectively improve the TQ of our systems. We did not want to evaluate the overall TQ of our systems, but rather the problems encountered by the worst translations. 2. Our Evaluation method ’Survival of the Weakest’ We chose a corpus-based approach as we wanted to measure the performance of the MT system with minimal user involvement (e.g., no prior adaptation of bad texts nor lexical coding of unknown words). This means that we checked the ‘performance’ rather than the ‘competence’ of the system (Falkedahl, 1998). We were not merely interested in determining which linguistic problems the system could handle and to which degree, but rather in which areas the system encounters the most severe problems when translating real texts. To achieve a realistic distribution of linguistic phenomena, it is best to use a collection of test sentences covering various linguistic phenomena proportional to their frequency of occurrence in corpus texts. However, this is very difficult if not impossible to obtain. Constructed test suites for linguistic phenomena for which the real occurrence frequency is unknown would also be of little use to us. For these reasons, we selected texts from the Internet and from a corpus CD and considered all phenomena occurring in the test corpus. To this end, the corpus must be big enough to yield representative frequencies of linguistic phenomena. Another advantage of real texts is that they also contain interactions between various linguistic phenomena, which is another important aspect in evaluating the performance of a system. The evaluation method described here adopts, on the whole, a black-box approach. The advantage is that the evaluation can be outsourced to an institution not involved with the system development. This ensures a more objective evaluation. After the final step of the black-box evaluation, the external evaluators from the University of Zurich passed the results to Sail Labs system developers who carried out the more time-consuming glass-box evaluation using standard methods (e.g., by isolating the suspected phenomenon, tracing the grammar rules etc.). In the black-box evaluation we applied a 4-step filtering mechanism, where each step involved narrowing down the set of sentences for the next step according to certain criteria. This allowed us to start the evaluation with an extensive data set while continually reducing the data set for the more costly subsequent steps. Each metric and its rating scale was defined in written form, where possible also with reference to quantitative assignment criteria (e.g., the sentence is bad if more than half is not understandable). From our experience with other evaluation projects and as reported by Sparck-Jones and Galliers (1995), it is crucial to define the evaluation criteria and the values for the text and sentence ratings in as much detail as possible. Among the evaluators, crosschecking and regular discussions helped to ensure that the metrics were applied consistently and subjectivity of ratings was kept to a minimum. The final result of the evaluation is a list of grammatical and lexical errors with their respective frequencies within the set of worst translations. This list documents the causes of the most frequent and severe translation problems with the corpus of real texts. 2.1. Selection of test material For each language direction, we selected between 100 and 140 texts totaling approximately 5500 to 6000 sentences (translation units), mainly from the Internet, some from the ACL/ECI Multilingual Corpus CD1. We chose texts from various subject areas but with little specialised terminology, a) to ensure a good general understanding of the topics by the evaluators, and b) because we develop general-purpose MT technology. The texts were short in order to get a broad variety for a given corpus size and contained sentences of varying linguistic style (simple and complex, short and long sentences, listings and other non-sentence structures). Texts were taken from different domains to suit the purposes of the particular evaluation. We used general texts as well as texts from data processing, car industry, economics, medicine, biology, geography & geology, recreation & sports, linguistics and art & literature. Where available, the texts were translated using the systems‘ relevant terminology lexica. In order to capture different linguistic styles pertinent to particular domains we selected texts that served various functions (newspaper, manual, internet, dialog). Due to the fact that we used the texts as we found them and no pre-evaluation changes were made, we decided to exclude texts with severe and multiple spelling errors or slang as we were not interested in evaluating the robustness of the MT system when facing bad input, but rather its performance with relatively wellformed texts. 2.2. Step 1: Evaluate TL texts after translation with the MT system After translation with our MT system, the TL texts were evaluated to identify bad translations. The SL sentences were not taken into consideration in this step as we wanted an evaluation of the generated TL as a standalone text. The texts were evaluated according to the following three parameters: • understandability (the amount of information that is understood by the reader). • grammaticality (syntactically ill-formed sentences and incorrect morphology). • lexical correctness (number of unknown, i.e. untranslated words and suitability of chosen words in the given context, not with regard to the SL sentence). These three parameters were chosen to capture the various purposes a machine translation may serve (information translation or input for postediting). Each criterion was rated on a 3-point value scale: 1. Bad 2. Neither bad nor good 3. Good The rating for the three parameters was done paragraph-wise to ensure each paragraph contributed equally to the overall score. The average was then computed for the whole text and this introduced decimal scores. Each text was evaluated by three persons to reduce subjectivity. All texts evaluated as generally not good 1 TL = target language; SL = source language (average point value below 2) progressed to evaluation step 2. The results were documented extensively including valuable additional information on the overall quality of the translated texts in various subject areas. We documented the grades for the 3 criteria for each text and computed the average across subject areas. Presuming that even texts that are translated well will contain their share of badly translated sentences, by excluding these texts, we are decreasing our set of badly translated sentences for subsequent steps. For more accurate data on the frequency of problematic phenomenon, we could have skipped this step and evaluated all sentences immediately. However we designed this step to exclude understandable texts with many well translated sentences in order to maximise the relevance of problems contained in the remaining sentences. This also has the advantage of excluding texts from certain genres that are generally translated well. 2.3. Step 2: Evaluate individual sentences In step 2, the goal was to identify within the ‘bad‘ texts those sentences that are translated the worst. This time, the SL sentences were taken into account for the assessment of the TL sentences to enable a more informed evaluation. This step was carried out for approx. 3500 – 4000 translation units per language direction. We used two metrics which were rated on a 10 pt scale • Preservation of meaning: Is the meaning of the TL sentence the same as the meaning of the SL sentence? 7 – 10 points (Good): meaning of SL and TL sentence is about the same.