Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of Address

Authors

  • Dolores Lemmenmeier-Batinić University of Zurich, Department of Slavonic Languages and Literatures, Switzerland

DOI:

https://doi.org/10.4312/slo2.0.2021.1.123-144

Keywords:

spoken Serbian, language biographical interviews, forms of address, data re-usability

Abstract

This paper describes the procedure of building a TEI-XML corpus of spoken Serbian starting from raw transcripts. The corpus consists of semi–structured interviews, which were gathered with the aim of investigating forms of address in Serbian. The interviews were thoroughly transcribed according to GAT transcribing conventions. However, the transcription was carried out without tools that would control the validity of the GAT syntax, or align the transcript with the audio records. In order to offer this resource to a broader audience, we resolved the inconsistencies in the original transcripts, normalised the semi-orthographic transcriptions and converted the corpus into a TEI-format for transcriptions of speech. Further, we enriched the corpus by tagging and lemmatising the data. Lastly, we aligned the corpus turns to the corresponding audio segments by using a force-alignment tool. In addition to presenting the main steps involved in converting the corpus to the XML-format, this paper also discusses current challenges in the processing of spoken data, and the implications of data re-use regarding transcriptions of speech. This corpus can be used for studying Serbian from the perspective of interactional linguistics, for investigating morphosyntax, grammar, lexicon and phonetics of spoken Serbian, for studying disfluencies, as well as for testing models for automatic speech recognition and forced alignment. The corpus is freely available for research purposes.

Downloads

Download data is not yet available.

References

Corpora, tools and tagsets

Aeneas. Retrieved from https://www.readbeyond.it/aeneas/

Classla 1.0.0 (CLASSLA Fork of Stanza for Processing Slovenian, Croatian, Serbian, Macedonian and Bulgarian). Retrieved from https://pypi.org/project/classla/

Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.1. Retrieved from Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1241

FOLKER. Retrieved from https://exmaralda.org/de/folker-de/

Inflectional lexicon srLex 1.3. Retrieved from http://hdl.handle.net/11356/1233

Serbian Corpus of Early Child Language (SCECL). Retrieved from https://sla.talkbank.org/TBB/childes/Slavic/Serbian/SCECL

Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.1. Retrieved from Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1240

Serbo-Croatian MULTEXT-East Specifications. Retrieved from http://nl.ijs.si/ME/V6/msd/html/msd-hbs.html

Spoken corpus of the Serbian minority in Hungary (SrMaCo). Retrieved from http://spokencorpus.eu/cms/bosco-2/

TEI Guidelines Version 4.2.1 (Transcriptions of Speech). Retrieved from https://tei-c.org/release/doc/tei-p5-doc/en/html/TS.html

Training corpus hr500k 1.0. Retrieved from Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1183

Training corpus SETimes.SR 1.0. Retrieved from Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1200

Universal POS tags. Retrieved from https://universaldependencies.org/u/pos/

ZuCoSlav: Zurich Corpora of Slavic Varieties. Retrieved from https://gitlab.uzh.ch/uzh-slavic-corpora

*****

Other

Anđelković, D., Ševa, N., & Moskovljević, J. (2001). Serbian Corpus of Early Child Language. Laboratory for Experimental Psychology, Faculty of Philosophy, and Department of General Linguistics, Faculty of Philology, University of Belgrade.

Batanović V., Ljubešić, N., & Samardžić, T. (2018). SETimes.SR – A Reference Training Corpus of Serbian. Proceedings of the Conference on Language Technologies & Digital Humanities 2018 (JT-DH 2018) (pp. 11–17). Ljubljana, Slovenia.

Batinić, J., Frick, E., & Schmidt, T. (in press). Accessing spoken language corpora: An overview of current approaches. Corpora. Edinburgh University Press.

Delić V., Sečujski, M., Jakovljević, N., Pekar, D., Mišković, D., Popović, B., Ostrogonac, S., Bojanić, M., & Knežević, D. (2013). Speech and Language Resources within Speech Recognition and Synthesis Systems for Serbian and Kindred South Slavic Languages. In M. Železný, I. Habernal, A. Ronzhin (Eds.), Speech and Computer. SPECOM 2013. Lecture Notes in Computer Science: Vol. 8113 (pp. 319–326). Springer, Cham. doi: 10.1007/978-3-319-01931-4_42

Dobrić N. (2012). Language Corpora in The West Balkans – History, Current State and Future Perspective. Slavistična revija, 60(4), 677–692.

Dobrovoljc, K., & Martinc, M. (2018). Er ... well, it matters, right? On the role of data representations in spoken language dependency parsing. Proceedings of the Second Workshop on Universal Dependencies (UDW 2018) (pp. 37–46). Brussels, Belgium.

Escher, A., & Sonnenhauser, B. (in press). Simple Past Tenses in the Timok dialect.

Halupka-Rešetar, S., & Radić-Bojanić. B. (2014). The discourse marker znači in Serbian: An analysis of semi-formal academic discourse. Pragmatics, 24(4), 785–798.

Kostić, A. (2003). Đorđe Kostić electronic corpus of the Serbian language. In Zbornik Matice srpske za slavistiku: Vol. 64 (pp. 260–264).

Krstev, C., & Vitas, D. (2005). Corpus and Lexicon – Mutual Incompleteness. In Proceedings of the Corpus Linguistics Conference, 14–17 July 2005, Birmingham. United Kingdom (hal-01108218).

Lemmenmeier-Batinić, D., Ljubešić, N., & Samardžić, T. (2020). XML-Encoding of a spoken Serbian corpus targeting forms of address. In D. Fišer in T. Erjavec (Eds.), Proceedings of the Conference on Language Technologies & Digital Humanities (pp. 127–130). Ljubljana: Institute of Contemporary History.

Ljubešić N., & Klubička. F. (2014). {bs,hr,sr}WaC – Web Corpora of Bosnian, Croatian and Serbian. Proceedings of the 9th Web as Corpus Workshop (WaC-9) (pp. 29–35). Gothenburg, Sweden.

Ljubešić, N., Klubička, F., Agić, Ž., & Jazbec. I. (2016). New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) (pp. 4264–4270). Portorož, Slovenia.

Ljubešić, N., & Dobrovoljc, K. (2019). What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian. Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing (pp. 29–34). Florence, Italy.

Miličević, M., & Ljubešić. N. (2016). Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets. Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research, 4(2), 156–188.

Plüss, M., Neukom, L., & Vogel, M. (2020). Swiss Parliaments Corpus, an Automatically Aligned Swiss German Speech to Standard German Text Corpus. Retrieved from https://arxiv.org/abs/2010.02810

Popović, Z. (2010). Taggers Applied on Texts in Serbian. INFOtheca, 11(2), 21–38.

Schmidt, T. (2016). Construction and Dissemination of a Corpus of Spoken Interaction – Tools and Workflows in the FOLK project. Corpus linguistic software tools, 31(1), 127–154.

Selting, M., Auer, P., Barden, B., Bergmann, J., Couper-Kuhlen, E., Günthner, S., Quasthoff, U., Meier, C., Schlobinski, P., & Uhmann, S. (1998). Gesprächsanalytisches Transkriptionssystem (GAT). Linguistische Berichte 173, 91–122.

Selting, M., Auer, P., Barth-Weingarten, D., Bergmann, J., Bergmann, P., Birkner, K., Couper-Kuhlen, E., Deppermann, A., Gilles, P., Günthner, S., Hartung, M., Kern, F., Mertzlufft, C., Meyer, C., Morek, M., Oberzaucher, F., Peters, J., Quasthoff, U., Schütte, W., Stukenbrock, A., & Uhmann, S. (2009). Gesprächsanalytisches Transkriptionssystem 2 (GAT 2). Gesprächsforschung – Online-Zeitschrift zur verbalen Interaktion, (10), 353–402.

Suzić, S., Ostrogonac, S., Pakoci, E., & Bojanić. M. (2014). Building a Speech Repository for a Serbian LVCSR System. Telfor Journal, 6(2), 109–114.

Štefanec, V., Ljubešić, N., & Kuvač Kraljević. J. (2016). Croatian Error-Annotated Corpus of Non-Professional Written Language. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 3220–3226). Portorož, Slovenia.

Ulrich, S. (2018). Anredeformen im Serbischen. Wiesbaden.

Utvić, M. (2011). Annotating the Corpus of Contemporary Serbian. INFOtheca 12(2), 36–47.

VOICE (2014). Part-of-Speech Tagging and Lemmatization Manual. With assistance of Barbara Seidlhofer, Stefan Majewski, Ruth Osimk-Teasdale, Marie-Luise Pitzl, Michael Radeka, Nora Dorn. The Vienna-Oxford International Corpus of English. Retrieved from http://www.univie.ac.at/voice/documents/VOICE_tagging_manual.pdf

Vuković, T. (2021). Representing variation in a spoken corpus of an endangered dialect: the case of Torlak. Language Resources and Variation. Springer Nature. doi: 10.1007/s10579-020-09522-4

Westpfahl, S., Schmidt, T., Jonietz, J., and Borlinghaus, A. (2017). STTS 2.0. Guidelines für die Annotation von POS-Tags für Transkripte gesprochener Sprache in Anlehnung an das Stuttgart Tübingen Tagset (STTS). Working paper. Mannheim: Institut für Deutsche Sprache.

Downloads

Published

01.07.2021 — Updated on 06.07.2021

Versions

How to Cite

Lemmenmeier-Batinić, D. (2021). Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of Address. Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research, 9(1), 123–144. https://doi.org/10.4312/slo2.0.2021.1.123-144 (Original work published July 1, 2021)