|
Lecturer(s)
|
-
Pořízka Petr, PhDr. Ph.D.
|
|
Course content
|
The course deals with developing of small corpora for linguistic and literary purposes according to the requirements and criteria defined by the developer. Building a corpus requires both knowledge of Philology as well as knowledge of technical aspects that will be discussed successively in the course: (1) Format: encoding (ASCII, ANSI and Unicode) and data format (structured - XML vs. unstructured, so called plain text ". txt"). (2) Annotation (= metadata): external vs. internal. (3) Tools: processing (and implementing data into the corpus manager), corpus and data mining (query language, annotation). There are freely available software tools that are used to build and manage corpora (freeware, GNU GPL or Open Source projects). The attention is also paid to the possibility of automatic data processing (segmentation: tokenization a vertical mode, format of conversion, etc.). From the methodological point of view the data and metadata are strictly distinguished as well as the types of annotation are discussed (technical, structural, and linguistic annotation), finally collecting and processing of data and their specificity (written vs. spoken form) are presented.
|
|
Learning activities and teaching methods
|
|
Monologic Lecture(Interpretation, Training), Dialogic Lecture (Discussion, Dialog, Brainstorming), Work with Text (with Book, Textbook), Demonstration
|
|
Learning outcomes
|
The aim of these lessons is to acquaint participants with the basic concepts, methods, and tools of corpus linguistics and to prepare them to work with corpora, which in recent years have become one of the fundamental tools for the scientific study of language.
Knowledge of basic corpus data mining methods Ability to build a small corpus of language data Ability to interpret corpus data
|
|
Prerequisites
|
unspecified
|
|
Assessment methods and criteria
|
Analysis of Activities ( Technical works), Seminar Work
- active work in lessons - work on a student's project (homework or exercises)
|
|
Recommended literature
|
-
Pražský akademický korpus (http://ufal.mff.cuni.cz/rest/CAC/doc/cac-guide/cz/html).
-
Pražský závislostní korpus (http://ufal.mff.cuni.cz/pdt2.0/index-cz.html).
-
Ústav Českého národního korpusu (http://ucnk.ff.cuni.cz).
-
Baker, P. - Hardie, A. - McEnery, T. A Glossary of Corpus Linguistics. Edinburgh 2006.
-
Benko, V. a kol. (2019). Webové korpusy Aranea. Bratislava.
-
Čermák - Klímová - Petkevič. Studie z korpusové lingvistiky. Praha 2000..
-
Čermák, F. - Blatná, R. (eds.). Jak využívat Český národní korpus. Praha 2005.
-
Čermák, F. - Blatná, R. Korpusová lingvistika: Stav a modelové přístupy. Praha 2006..
-
Čermák, F. - Křen, M. (eds.). (2004). Frekvenční slovník češtiny. Praha.
-
Čermák, F. (ed.). Frekvenční slovník mluvené češtiny. Praha 2007.
-
Kol. aut. (2007). Průvodce českým akademickým korpusem 1.0.. Praha.
-
Kol. Manuál práce s ČNK (wikidokumentace). .
-
McEnery, T.-Wilson, A. Corpus Linguistics. An Introduction. Edinburgh 2001.
-
Pořízka, P. (2014). Tvorba korpusů a vytěžování jazykových dat (metody, modely, nástroje). Olomouc.
-
Šonková, J. (2008). Morfologie mluvené češtiny: Frekvenční analýza. Praha.
-
Wynne Martin (ed.). (2005). Developing Linguistic Corpora: A Guide to Good Practice. Oxford.
|