49th International Philological Conference (IPC 2020) in Homage to Professor Ludmila Verbitskaya (1936-2019)

Tagging Historic Bulgarian texts: Experiments and Challenges

Петя Начева Осенова
Софийский университет им. Св. Климента Охридского
Цветана Иванова Димитрова
старший ассистент
Институт болгарского языка Болгарской академии наук

Ключевые слова, аннотация

Bulgarian language, diachronic texts, tagging.


The paper focuses on the automatic morphological analysis of a collection of Bulgarian text fragments excerpted from 17th-century-texts (the so-called damaskins). Since at the moment and to our knowledge no historic tagger is freely available for the task, we employed the Linguistic Processing Pipe with a POS tagger trained on Modern Bulgarian texts. This is the BTB Processing Pipe, with a tokenizer for Bulgarian, and with a POS tagger trained on media texts and literature, using the MATE Tool.
First, we run the tool over the original texts, with a big number of errors coming as early as the tokenization level and subsequently at the POS level. We started to normalize the texts where in the resulting texts: a) no diacritics were present; b) symbols that are non-existent in the present-day alphabet, were replaced with their ‘successors’ (such as о, ѡ, ѻ, ꙫ = о [о]; etc.); the abbreviations for frequently written words and letter under titlo were also normalized (бь = богь). We ‘deleted’ all small ers (ь) in word endings, while within the word we replaced them with schva (ъ).
We performed a manual error analysis to see whether there is an improvement of the automatic analyses. The tagger uses some constraints with the accuracy being 0,8494 when all analysis are considered.

Expectedly, errors are mostly found with proper nouns, imperatives, case inflected nouns and adjectives, while there is very good recognition of the syntactic functional elements of closed POS, such as conjunctions, subjunctions, prepositions, invariant relativisers (що, щото, дето), as well as pronouns.