SynTagRus – a deeply annotated corpus of Russian

0
587

The Russian dependency treebank, SynTagRus, is a subcorpus of the National Corpus of the Russian Language and at the time of writing (Spring 2013) contains over 52,000 sentences (roughly 770,000 words). It is supplied with several types of annotation. First, it contains comprehensive morphological and syntactic annotation. The latter is presented in the form of a full dependency tree that uses about 75 distinct dependency labels. Second, SynTagRus partly contains lexical semantic annotation, which means that, for all cases of word sense ambiguity in the corpus, the concrete lexical meaning should be identified and explicitly marked. So far, the number of SYNTAGRUS sentences fully tagged for word senses is over 10,000, and it is constantly growing. Third, a part of SynTagRus is annotated for collocate Lexical Functions (in Igor Melčuk’s sense). SynTagRus is freely available for research and educational purposes. 1 Introductory Remarks Tagged corpora are primarily intended for providing the basis for linguistic research in all fields of the vocabulary and the grammar (including changes occurring in the language throughout its history). There are two significantly different areas of such research. On the one hand, there are traditional linguistic studies for which mass material of texts is needed: such demand is much easier met if good and deeply tagged corpora are available. On the other hand, modern computational linguistics itself becomes an eager and interested user of such corpora as these are used on an increasing scale as training sets in machine learning. As a result of such learning, computer programs enhance their capability for extracting sophisticated types of data, which are contained in training text sets, from new texts. Generally speaking, the deeper the level of corpus annotation, the more advanced types of information could be learned from the corpus. We describe the SynTagRus corpus developed by the Laboratory of Computational Linguistics (LCL) of the Institute for Information Transmission Problems of the Russian Academy of Sciences supplied with several types of annotation – morphological, syntactic, lexical semantic and lexical functional. This corpus serves both areas of linguistic research: traditional and computational. 2 Text Corpora of Russian The largest corpus resource available for Russian is the so called National Corpus of the Russian Language (“Национальный корпус русского языка”), abbreviated as НКРЯ or NCRL, and available through the portal www.ruscorpora.ru. It combines several independently created subcorpora that contain different texts and are supplied with different types of annotation. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!! 1 This work has been partially supported by the Russian Foundation of Basic Research (grant No. 13-06-00756) and a grant from the Russian Foundation for the Humanities (13-04-00343a). 1) the main corpus provided with morphological annotation, which comprises over 300 million words belonging to written texts of a variety of genres starting from the 18th century. Since Russian is a morphologically rich language, with large paradigms for nouns, adjectives and verbs, in most cases, the annotation is morphologically ambiguous: a word may have more than one set of morphological tags corresponding to different parts of speech and/or morphological features. The main corpus contains a subcorpus of texts with resolved morphological ambiguity, which counts over 7 million words. Lexical ambiguity is not resolved independently; words can be considered to be lexically disambiguated only if the word senses belong to different parts of speech, as in the adjective slepoj ‘blind’ vs. the noun slepoj ‘blind person’. The corpus now contains partial semantic annotation presented in the form of simple semantic features of words. 2) the syntactic corpus SYNTAGRUS provided with morphological and syntactic annotation. The corpus comprises about 770,000 words (over 52,000 sentences) and provides a syntactic dependency structure for all sentences. The corpus is fully disambiguated, both morphologically and syntactically, so that every word is supplied with one part-of-speech tag and a unique set of morphological features, while every sentence is has one, and only one dependency tree structure. Lexical semantic and lexical functional annotation exists in the standalone version of the corpus and is not displayed at the site of NCRL. 3) the newspaper corpus, built on the same principles as the main corpus and comprising the articles of seven mass media since the year 2000 (four newspapers published in Moscow and three electronic mass media), which now counts over 170 million words. 4) several aligned parallel corpora (English-Russian, Russian-English, German-Russian, Ukrainian-Russian, Russian-Ukrainian, Belarussian-Russian, Russian-Belarussian, and multilingual). 5) a dialectal corpus, composed of samples of dialect speech from various regions of Russia, presented in quasi-standard orthography. The corpus disregards phonetic variation but demonstrates morphological, lexical and syntactic peculiarities of regional and dialectal usage. 6) a poetry corpus, primarily comprising Russian poetic work of the 18th and 19th centuries supplemented by work of a number of 20th century poets. In addition to nondisambiguated morphological tagging built on the same principles as that of the main corpus, the texts are provided with information on the poetic meters used in them. 7) an educational corpus, intended for learners of Russian, which offers disambiguated morphological tagging for simple prosaic texts. 8) a corpus of Spoken Russian, which consists of transcripts of samples of public and private oral speech of the 20th and 21st centuries as well as film transcripts. The corpus is supplied with morphological and partial semantic annotation. 9) an accentological corpus (corpus of history of Russian word stresses). 10) a Church Slavonic corpus, which comprises modern lithurgical texts of the 19th and 20th centuries, as well as older religious and biblical texts. The corpus enables the search of words in three orthographic systems. 11) a multimedia corpus is composed of fragments of films released since 1930 and presented as video files, audio files and textual transcripts, as well as lists of gestures present in these fragments. 3 SYNTAGRUS Treebank The Russian dependency treebank, SynTagRus, developed and maintained by the LCL (Boguslavsky et al. 2000, Apresjan et al. 2005), currently contains over 52,000 sentences (roughly 770,000 words) belonging to texts from a variety of genres (contemporary fiction, popular science, newspaper, magazine and journal articles dated between 1960 and 2013, texts of online news, etc.) and is steadily growing. It is an integral but fully autonomous part of the Russian National Corpus developed in a nationwide research project and can be freely consulted on the Web ). Since Russian, as other Slavic languages, has a relatively free word order, SynTagRus adopted a dependency-based annotation scheme, in many respects parallel to the Prague Dependency Treebank (Hajič et al., 2001). So far, SYNTAGRUS is the only corpus of Russian supplied with comprehensive morphological and syntactic annotation. The latter is presented in the form of a full dependency tree provided for every sentence. In the dependency tree, nodes represent words annotated with parts of speech and morphological features, while arcs are labeled with syntactic dependency types. There are about 75 distinct dependency labels in the treebank, half of which are taken from Igor Mel’čuk’s Meaning ⇔ Text Theory (see e.g. Mel’čuk, 1988). Fig.1 below is a sample dependency structure for the sentence: Наибольшее возмущение участников митинга вызвал MostNEUT,SG,ACC indignationSG,ACC participantPL,GEN meetingSG,GEN causePAST,PERF,,SG,MASC,GEN продолжающийся рост цен на бензин, continue PART,PRES,,IMPERF,,SG,MASC,NOM growth SG,NOM price PL,GEN on PREP petrolSG,ACC устанавливаемых нефтяными компаниями setPART,PRES,,IMPERF,,PASS,PL,GEN oil-Adj PL,INSTR company PL,INSTR ‘It was the continuing growth of petrol prices set by oil companies that caused the greatest indignation of the participants of the meeting’. Fig.1. A syntactically tagged sentence. Dependency types used in Fig. 1 include: 1. предик (predicative), which, prototypically, represents the relation between the verbal predicate as head and its subject as dependent; 2. 1-компл (first complement), which denotes the relation between a predicate word as head and its direct complement as dependent; 3. агент (agentive), which introduces the relation between a predicate word (verbal noun or verb in the passive voice) as head and its agent in the instrumental case as dependent; 4. квазиагент (quasi-agentive), which relates any predicate noun as head with the word implementing its first syntactic valency as dependent, if such a word is not eligible for being qualified as the noun’s agent; 5. опред (modifier), which connects a noun head with an adjective/participle dependent if the latter serves as an adjectival modifier to the noun; 6. предл (prepositional), which accounts for the relation between a preposition as head and a noun as dependent. Dependency trees in SYNTAGRUS may contain non-projective dependencies. Normally, one token of the sentence (roughly, a word taken from space to space) corresponds to one node in the dependency tree. There are however a noticeable number of exceptions, the most important of which are the following: 1. compound words like пятидесятиэтажный ‘fifty-storied’, стопятидесятипятимилиметровый ‘one hundred fifty five millimeter wide’, where one token corresponds to two or more nodes; 2. so-called phantom nodes for the representation of hard cases of ellipsis, which do not correspond to any particular token in the sentence; for example, я купил рубашку, а он галстук ‘I bou