In the specification of the Conference aims, the following keywords appear in the LREC materials: availability of language resources, methods for evaluation of resources, comparing different approaches to a given problem, choosing the best solution etc. To meet these goals, we present here an overview of the state-of-art of Czech part-of-speech (PoS) tagging. We concentrate on the data creation and availability problems, then we discuss the results we obtained when using various methods to tag texts written in a highly inflectional language, and finally we conclude by an outline of future perspectives. 1 1 Natural Language Processing One of the meanings of the headword process speaks about ”the analysis (of information) using a computer”. That is exactly what we mean by natural language processing (NLP) an analysis of language information using a computer. However, the computer alone is not good eno ugh. We need an electronic database covering written and spoken language resources . The starting points for NLP are building a structured corpus and annotating corpus according to the needs of further processing. A corpus is a vast, electronically processed collection of language texts containing a variety of (as much explicit as possible) information the corpus might (implicitly) provide. If we look at any NLP conference proceedings from 80s and 90s that we can see at the first sight that the vast majority of frequently processed languages are English, French, German, Italian, Spanish. Why they are so few contributions on processing of some typologically different, i.e. Slavic or similar language? There are many 1 The results described herein have been obtained within various projects sponsored by the C zech Grant Agency ( 405/96/K214, 405/95/0190), by the Ministry of Education project No.VS96151, by the Charles University Grant Agency project No. 39/94 and by the individual grant of the OSF/HESP No. 195/1995. In what follows, we will concentrate on the processing of written language. reasons for that but the key reason lies in the absence of the main resources of NLP corpora for these languages. 2 Czech Language Processing One of the main tasks of the most recent Czech Language Processing (CzLP) project in the Czech Republic, so called ”Integrated Project: Czech in the Age of Computers ” (started in 1996), is an investigation of present-day Czech based on contemporary methods and techniques for computational linguistics. This task includes i.a. a development of a Part-of-Speech (PoS) tagging system. This is not trivial task in a view of the fact that most of the existing tagging systems have been developed for languages typologically different from Czech. 3 Czech PoS Tagging 3 3.1 Language Resources Is there any relationship between NLP terms PoS tagging, morphological disambiguation, morphological annotation and morphological analysis? The morphological analysis of a given wordform provides for all possibilities of a morphological annotation. For illustration, let’s assume wordform ”zdi” (‘walls’). One of the morphological annotations corresponds to the genitive singular for feminine nouns, other to the dative, vocative and locative singular, or nominative and accusative plural of the same word.. The other corresponds to the imperative of singular of the verb and so on. Each morphological category (case, gender, number,…) may take a set of possible values (gender -masculine animate, masculine inanimate, neuter, feminine). The morphological annotations of a wordform represent the combinations of morphological categories for the particular part of speech classes. In order to automatically process a morphological analysis it is very useful to mark the values of morphological categories and part of speech classes positively (gender -masculine animate (M), masculine inanimate (I), neuter (N), feminine (F), nouns (N), verbs (V),….). Afterwards we can rewrite the morphological annotations of wordform ”zdi” mentioned above in the following way NFS[2,3,5,6], NFP[1,4], VM. A task called morphological disambiguation or PoS tagging uses the context of the given wordform in the input text to select the correct tag from the list of all possible tags. For the experiments described herein, we have used two different corpora: one ”old” (texts from the 60s and early 70s), and one ”new” (smaller volume but modern Czech and technically compatible with our new morphological analysis system). Due to the technical incompatibility of these two resources we performed different experiments on them (see sect. 3.2 for the experiments using the ”old” corpus and sect. 3.3 for the description of experiments using the ”new” one.) Czech Corpus (CC ”old”) Thanks to the enthusiasm of a group of people from the Institute of Czech Language the main working material written and spoken Czech Corpus has been created during the 70s. The quantitative characteristics of present-day Czech were the main motivation for building CC. The corpus includes newspaper, magazine and scientific texts. The quantitative UHVHDUFK 7 ãLWHORYi HW DO KDV FRQFHQWUDWHG DPRQJ other things on the frequency of part of speech classes, frequency of morphological categories and syntactic phenomena. For these purposes CC was morphologically and syntactically manually tagged. The format of CC is exemplified in Table 1 as the only tagged corpus for a Slavic language then available. TOKEN POS TAG LEMMA SYNT. TAG ORDER (POSITION)
PLACE YOUR ADVERT HERE
- ACCOUNTING PROJECT TOPICS AND MATERIALS3553
- EDUCATION PROJECT TOPICS AND MATERIALS3486
- ENGLISH AND LINGUISTIC PROJECT TOPICS AND MATERIALS2939
- COMPUTER SCIENCE PROJECT TOPICS AND MATERIALS FINAL YEAR1274
- BANKING AND FINANCE PROJECT TOPICS AND MATERIALS1250
- BUSINESS ADMINISTRATION PROJECT TOPICS AND MATERIALS1236
- EDUCATION FOUNDATION GUIDANCE AND COUNSELLING TOPICS AND MATERIALS1045
- ZOOLOGY PROJECT TOPICS AND MATERIALS1002
- MASS COMMUNICATION PROJECT TOPICS AND MATERIALS1001
- ANIMAL SCIENCE PROJECT TOPICS AND MATERIALS978
- LAW PROJECT TOPICS AND MATERIALS896
- ARTS EDUCATION PROJECT TOPICS AND MATERIALS844
- MARKETING PROJECT TOPICS AND MATERIALS690
- AGRICULTURAL EXTENSION PROJECT TOPICS AND MATERIALS676
- PUBLIC ADMINISTRATION PROJECT TOPICS AND MATERIALS654
LATEST PROJECTS
STUDIES ON SOME ASPECTS OF ANTHRACNOSE-BLIGHT-DIEBACK COMPLEX OF CULTIVARS OF GRAPEVINES (VITIS SPP.) IN...
GENETIC VARIABILITY STUDIES OF TWENTY POTATO GENOTYPES
RELATIONSHIP OF HAEMOGLOBIN AND POTASSIUM POLYMORPHISM WITH CONFORMATION, MILK PRODUCTION AND BLOOD BIOCHEMICAL PROFILES...
ADOPTION OF AGRICULTURAL INNOVATIONS AMONG MEMBERS AND NON-MEMBERS OF WOMEN CO-OPERATIVE SOCIETIES IN OJU...
SMALL FARMER CREDIT WITH PARTICULAR REFERENCE TO NIGERIA
DISCLAIMER
All undertaking works, records and reports posted on this website, modishproject.com are the property/copyright of their individual proprietors. They are for research reference/direction purposes and the works are publicly supported. Do not present another person’s work as your own to maintain a strategic distance from counterfeiting its results. Use it as a guide and not to duplicate the work in exactly the same words (verbatim). modishproject.com is a vault of exploration works simply like academia.edu, researchgate.net, scribd.com, docsity.com, coursehero and numerous different stages where clients transfer works. The paid membership on modishproject.com is a method by which the site is kept up to help Open Education. In the event that you see your work posted here, and you need it to be eliminated/credited, it would be ideal if you call us on +2348053692035 or send us a mail along with the web address linked to the work, to [email protected]. We will answer to and honor each solicitation. Kindly note notification it might take up to 24 - 48 hours to handle your solicitation.