The Electronic Corpus of the 17th and 18th c. Polish Texts (up to 1772). The final result Dorota Adamiec, Renata Bronikowska, Włodzimierz Gruszczyński, Emanuel Modrzejewski, Aleksandra Wieczorek Institute of Polish Language, Polish Academy of Sciences
Plan of presentation corpus main information content and size stages of development and tools searching the corpus plans for the future
Project factsheet title: Electronic corpus of 17th and 18th century Polish texts (up to 1772) cryptonym: KORBA (korpus barokowy baroque corpus ) funding: Polish Ministry of Science and Higher Education, National Programme for the Development of Humanities grant (contract number 0036/NPRH2/H11/81/2012) duration: 2013-2018 coordinating body: Institute of Polish Language, Polish Academy of Sciences cooperation: Institute of Computer Science, Polish Academy of Sciences principal investigator: Włodzimierz Gruszczyński content: >700 texts, c. 13.5M tokens
Chronological representation of texts 4% 15% 38% 15% 28% 1601-1650 1651-1700 1701-1750 1751-1772 another time period
Geographical representation of texts
Types of texts in the corpus epic 8.7% liryc 8.7% drama 1.8% syncretic texts 4.2% press releases & leaflets 1.5% scientific-didactic texts 24.4% persuasive texts 17.8% factual literature 21.3% official & secretarial texts 7.4% letters 1.8% Bible 2.4% 0 500000 1000000 1500000 2000000 2500000 3000000 3500000
Metadata ID: BohJProg title: Prognostyk Zły czy Dobry Komety Roku 1769 y 1770 author: Jan Bohomolec printing house: Drukarnia J.K.M. i Rzeczypospolitej w Kollegium Societatis Jesu place of publication: Warszawa region: Mazowsze type of text: prose literary type: scientific-didactic text genre: tractate topic: astronomy ironic: no date of publication: 1770
From the old edition to the corpus old edition transliteration transcription lemmatization & annotation
From the old edition to the corpus an example Tám hándluią kupcy Tam handlują kupcy [tam:adv] [handlować:fin:pl:ter:imperf] [kupiec:subst:pl:nom:manim1] There trade dealers Giovanni Botero, Relacje powszechne, cz. I, tłum. Paweł Łęczycki, Kraków 1609, p. 189.
Transliteration and structure annotation
Conversion to TEI XML
Transcription Transcription is based on rules that use regular expressions (by means of the transcriber: https://bitbucket.org/jsbien/pol). Left context Before replacement Right context After replacement Example before replacement Example after replacement.* é.* e potém, któré potem, które.* th.* t theatrum, Lutherani teatrum, Luterani ^ rown.* równ rownego, rowność równego, równość ^ iako $ jako iako jako A y $ j bardziey, zwyczay bardziej, zwyczaj.* any string of characters (also empty) ^ the beginning of a word $ the end of a word A any vowel (set defined in the template)
Morphological analyzer Korbeusz Zaczém przez dwie godzinie z nieprzyjacielem strzelali się [ ]. CORBEVS
Manual annotation and lemmatization Anotatornia 2
Taggers: Concraft & Toygger manual annotation 0.5M tokens tagger training automatic annotation of the whole corpus 13.5M tokens Manually annotated corpus is not included in the whole corpus; the same texts were automatically annotated.
Plans for the future enlargement of the corpus by 12M tokens enlargement up to the end of the 18th century tools improvement (transcriber, morphological analyzer, tagger) integration with The Electronic Dictionary of the 17th-18th c. Polish applying new tools (syntactic parsers)
Thank you!