CORPUS LINGUISTICS IN FINLAND: A RESOURCE SURVEY

0
373

Finnish corpus linguistics and computer linguistics generally has an ancient tradition, which gives it authority in the world community and has produced solid results in various areas. The first projects for electronic corpuses appeared in the 1960s, as in many other countries [1, 2]. From the start, this line in Finland was closely related to the writing of original computer programs for processing text, as well as close international links and devotion to current topics in lexicography and grammatical description ([3-6]; see also the round-table material on corpus linguistics in “Korpuslingvistiikan työpaja l: Korpukset ja ohjelmat”, pp. 126-134 of [7]). The major feature of computer linguistics in Finland has become the close connection with the writing of end-user products, which has included collaboration with commercial firms [8; 1, pp. 50-54 and 62-64]. This paper is of information type and has particular purposes such as giving Russian linguists a conception of the main computer linguistic resources in Finland and determining the scope for them to use them. Each existing corpus is indicated as regards position at the present time, which is reduced in some cases to indicating the place of creation and initial storage. The characteristics of each are indicated by listing the places of detailed description (on the Internet and/or as a paper publication), in which full information can be obtained. Many of the resources described below provide remote access to the files (most of the servers work under the control of the Unix OS, which in general involves the user’s machine having Unix-Client, e.g., the program FSecure SSH-Client). I do not discuss in detail the technical and organizational aspects of access and merely state that almost all of them are accessible for free use for research and teaching purposes. In most cases, this requires one to obtain permission from the administrator or owner of the corpus. Contact information is given on the corresponding Internet sites or in articles on the topic. The following comment is important. We are concerned with a definition of the corpus content. There are multiple meanings or uncertain use of this term, which lead to some general tendency for the name electronic corpus to be given to any collection of texts put into digital format. On the other hand, recently the term corpus has increasingly been used not simply for text (English running text) but linguistic material especially selected on ceratin principles. “So a corpus in modern linguistics, in contrast to being simply any body of text, might more accurately be described as a finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration” [9, p. 24]. However, in spite of the expansion of the new approach, old corpuses (i.e., simply electronic texts) still retain their linguistic value in many areas. This is dependent on the substantial differences in quantity and quality of the work done. For example, the last number of text collections in English poses substantially more complicated tasks (various types of annotation, parallel corpuses, speech records presented in electronic form, and so on). On the other hand, in many modern languages there are as yet no simple well-balanced representative corpuses, quite apart from annotated ones. Special and equally difficult problems arise for the creation of any corpus of ancient texts.