Dialectal Arabic Telephone Speech Corpus : Principles , Tool design , and Transcription Conventions

0
558

The present paper presents the experience gained at LDC in the collection and transcription of a corpus of conversational telephone speech in dialectal Arabic. The paper will cover the following: (a) Arabic language background; (b) objectives, principles, and methodological choices of dialectal Arabic transcription, (c) conceptualization and design features of LDC’s ‘Arabic Multi-Dialectal Transcription Tool’ (AMADAT), and (d) a brief description of the conversational Levantine Arabic transcription guidelines and annotation conventions. 1.0 Introduction: Arabic Linguistic Background The Arabic language is a ‘linguistic continuum’ (Hymes, 1973) with two major poles representing an Arabic Standard, the language of most written and formal spoken discourse, and a collection of related Arabic dialects, which are mainly spoken and which present significant phonological, morphological, syntactic, and lexical differences among themselves and when compared to the standard written forms. This situation, usually referred to as ‘diglossia’ (Ferguson, 1959), presents some challenging issues for Arabic spoken language technologies, including corpus creation to support Speech-to-Text (STT) systems, since the spoken Arabic dialects are not officially written and have no standardized writing in spite of growing but still relatively small and not wholly conventionalized web activities. A significant amount of linguistic variation occurs and produces many variant forms which are difficult to identify and regroup. 1.1 Arabic Dialectal Variation The diglossic situation described above mainly represents a significant linguistic distance between all Arabic dialects and the ‘fusha,’ commonly identified as ‘Modern Standard Arabic’ or MSA, though the latter term does not cover all features of the former. This linguistic distance is characterized by substantial linguistic variation, mostly phonological, morphological, and lexical. The Arabic dialectal variation is significant not only between major dialects, for example, Egyptian, Levantine, Gulf, Maghreb, but also between the regional variants of a major dialect, for example, Northern and Southern Levantine. Sound change has occurred in all Arabic dialects. In Levantine Arabic (LA), for instance, the sound /q/ is pronounced /q/ but also /’/, /g/ and /k/. The glottal stop is mostly deleted in medial and word final position with compensatory lengthening of the word internally (ra?s ‘head’ becomes ra:s and bi?r ‘a well’ becomes bi:r). Moreover, interesting cases of chain shifts with counterfeeding rule interactions also occur as in MSA fa?r ‘mouse’ goes to dialectal fa:r while MSA faqr ‘poverty’ goes to faqr but also to fa?r -now meaning ‘poverty’ while it was ‘mouse’ earlier on. An important consequence of chain shifts is the multiplication of lexical ambiguity in the language. The complexity of the above situation is compounded by the existence of significant differences between the sound changes of the various Arabic dialects. In Egyptian Arabic, for instance, MSA /θ/ becomes both /t/ and /s/ while /g/ is used to replace /j/ and /?/ to replace /q/. In Sudanese Arabic, MSA /q/ is replaced by /g/ and the uvular [ Υ ]. All of the above creates an important amount of confusion which needs to be addressed and taken into account in any dialectal transcription task. 1.2 Pertinent Linguistic Features and the Dialectal Arabic Transcription Challenge The description of Arabic dialect differences above, which does not even consider linguistic variation conditioned by age, gender, urbanity, rurality or style, shows the complexity of any speech-to-text (STT) transcription task. It also predicts the challenges facing any linguistic transcription methodology which seeks to closely represent sound features without capturing the distinctions that matter to native speakers. In the case of a conversational Levantine Arabic corpus building, a Romanized orthography-based transcription can bypass the issues of phonemic sound shifts and the resulting variation by, for example, giving a faithful rendering of Levantine pronunciation characteristics. However, such a Romanized transcription would be machine readable and usable only for, and within the framework of, a single dialect system: LA. A Romanized transcription output will necessarily lead to the following tasks: (a) a long LArelated disambiguation process, (b) a comprehensive LAspecific lexicon and grammar, and (c) significantly longer annotators’ training periods for better familiarization with transcription symbols. Looking around us for examples of speech to text transcription practices which have been successfully used to support speech technologies (not just among linguists), one may ponder the wisdom of an orthography designed to write different spoken dialects (or different variants of one of them) more similarly than they sound, roughly as English orthography does world Englishes. The above idea may seem too far-fetched but the Arabic language continuum is similar in many ways to the English one and presents the following potentially useful features: (a) there exists an important core of mutual intelligibility between MSA and the dialects, (b) there is a high level of similarity in morphological form and syntactic structure similarity, and (c) there is also a significant common lexical core in spite of important semantic differentiation features. To help the above claim, one can assert the existence of an ‘underlying’ MSA cognate base with close structural similarities. The internalized knowledge of the above base by educated and even semi-literate Arabs and Arabic speakers is a potential that is available in the MSA writing and reading community in the Arab region and all over the world. Finally, there also exists a standard MSA graphemic knowledge base which could be put to good use to help with dialectal Arabic speech-to-text transcription. So, the question may very well be: How can we harness the native speaker’s knowledge of Arabic orthographic conventions and of the linguistic MSA common core to complete a quick, easy, and low-cost Speech-to-Text transcription of Dialectal Arabic? 2.0 Principles, Objectives, and Methodology of Dialectal Arabic Transcription 2.1 Objectives of Dialectal Arabic Transcription Our transcription specifications were developed in the context of a common task technology evaluation program in which the primary goal is the improvement of speechto-text technologies and in which systems building makes use of statistical machine learning techniques. In such an environment, large volumes of data with high quality human annotation are desirable both as training material for learning algorithms and as evaluation material for final systems. The speech for this project comes from the Linguistic Data Consortium’s Fisher Levantine Arabic project, in which more than 9400 speakers of the Northern, Southern and Bedwi dialects of Levantine Arabic (involving Jordan, Lebanon, and to a lesser extent Syria and the Palestinian territories) were recruited to participate in one to three telephone calls. Calls are up to ten minutes in duration and subjects speak to each other about assigned topics. A robot operator initiates most calls though subjects local to the robot operator may dial a toll-free number to initiate calls. Calls are recorded digitally from the native telephone network, and subjects are compensated for each successful call in which they participate. To date, LDC has collected 1670 calls from 1802 speakers. Because our goal is to produce transcripts which, first and foremost, support the development of STT systems, we adopt the following principles for a transcription system in order of priority: Friendly to writers and readers: easy to learn to write and read; lexically consistent: a given spoken form is always written the same way; lexically distinctive: different spoken forms will always be written differently; and acoustically consistent: transcription should represent pronunciation. 2.1.1 Rationale for and Advantages of an MSAbased Strategy for Dialectal Arabic Transcription The advantages of an MSA-based strategy for dialectal Arabic transcription come from the fact that while writing Arabs benefit from their knowledge of stable MSA sounds and forms which keep to standard orthographic writing conventions. Arabs can read the same MSA words with the same or a closely similar level of recognition and comprehension. When faced with the task of writing down a dialectal Arabic speech form, native Arabs use their knowledge of the ‘underlying’ sounds of the Arabic word in order to transcribe its MSA-reconstructed form with Arabic script letters. Native Arab transcribers’ knowledge of the Arabic language and their familiarity with the rules of Arabic script constitute the basis of a strategy for the transcription of Arabic dialects. This strategy uses practical MSA-based orthographic conventions and a reasonable reliance on MSA to produce an acceptable output and guarantee a high rate of consistency and an easy retrieval of meaning structures. A significant advantage of this strategy is that native transcribers do not need to go through long training periods to learn difficult and often complex symbols. 2.1.2 Pitfalls of an MSA-based Strategy for Dialectal Arabic Transcription An MSA-based Arabic orthographic script transcription faces three major challenges. The first is that there is little or no evidence of a dialectal Arabic text corpus with stable MSA-based writing conventions. In a concordance generated from a corpus of newswire written primarily in MSA, Levantine dialectal forms were found which attest written Arabic colloquial communication.