JANES v0.4: Corpus of Slovene User-Generated Content

  • Darja Fišer Faculty of Arts, University of Ljubljana
  • Tomaž Erjavec "Jožef Stefan" Institute
  • Nikola Ljubešić Faculty of Arts, University of Zagreb "Jožef Stefan" Institute
Keywords: corpus construction, computer-mediated communication, user-generated content, Internet Slovene, non-standard Slovene

Abstract

The paper presents the current version of the Slovene corpus of netspeak Janes which contains tweets, forum posts, news comments, blogs and blog comments, and user and talk pages from Wikipedia. First, we describe the harvesting procedure for each data source and provide a quantitative analysis of the corpus. Next, we present automatic and manual procedures for enriching the corpus with metadata, such as user type, gender and region, and text sentiment and standardness level. Finally, we give a detailed account of the linguistic annotation workflow which includes tokenization, sentence segmentation, rediacritisation, normalization, morphosyntactic tagging and lemmatization.

Downloads

Download data is not yet available.

References

Baron, N. (2008): Always On: Language in an Online and Mobile World. Oxford University Press.

Bartz, T.; Beißwenger, M.; Storrer, A. (2014): Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internetbasierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge. Journal for Language Technology and Computational Linguistics 28 (1): 157–198.

Beißwenger, M. (2013): Raumorientierung in der Netzkommunikation. Korpusgestützte Untersuchungen zur lokalen Deixis in Chats. Die Dynamik sozialer und sprachlicher Netzwerke, 207–258. Springer.

Beißwenger, M.; Ermakova, M.; Geyken, A.; Lemnitzer, L.; Storrer, A. (2012): A TEI Schema for the Representation of Computer-mediated Communication. Journal of the Text Encoding Initiative 3 (2012).

Čibej, J.; Fišer, D.; Erjavec, T. (2016): Normalisation, Tokenisation and Sentence Segmentation of Slovene Tweets. Proceedings of the workshop Normalisation and Analysis of Social Media Texts at LREC'16. Portorož, Slovenia, May 28 2016.

Čibej, J.; Ljubešić, N. (2015): »S kje pa si?« – Metapodatki o regionalni pripadnosti uporabnikov družbenega omrežja Twitter. Zbornik konference Slovenščina na spletu in v novih medijih. Ljubljana: Znanstvena založba Filozofske fakultete, 10–14.

Crystal, D. (2011): Internet Linguistics: A Student Guide. Routledge, New York.

Dobrovoljc, H.; Jakop, N. (2012). Sodobni pravopisni priročnik med normo in predpisom. Založba ZRC.

Dobrovoljc, K., Krek, S., Holozan, P., Erjavec, T., Romih, M. (2015): Morphological lexicon Sloleks 1.2. Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1039.

Erjavec, T. Fišer, D. (2013): Jezik slovenskih tvitov: korpusna raziskava. Družbena funkcijskost jezika: vidiki, merila, opredelitve, 109–116. Znanstvena založba Filozofske fakultete.

Erjavec, T.; Čibej, J.; Fišer, D. (2015): Pravna podlaga za zagotavljanje prostega dostopa korpusov spletnih besedil. Smolej, M. (ur.). OBDOBJA 34: Slovnica in slovar – aktualni jezikovni opis. Ljubljana: Znanstvena založba Filozofske fakultete, 193–199.

Fišer, D.; Erjavec, T. (2016): Analysis of sentiment labelling of Slovene user generated content. Proceedings of the 4th conference on CMC and Social Media Corpora for the Humanities, 27.-28.9. 2016, Ljubljana: Filozofska fakulteta.

Fišer, D.; Smailović, J.; Erjavec, T.; Mozetič, I.; Grčar, M. (2016): Sentiment Annotation of the Janes Corpus of Slovene User-Generated Content. Proceedings of the 10th Languate Technologies and Digital Humanities Conference, 29.9.-1.10. 2016, Ljubljana: Filozofska fakulteta.

Krek, S., Erjavec, T., Dobrovoljc, K., Može, S., Ledinek, N., Holz, N. (2013): Training corpus ssj500k 1.3, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1029.

Krippendorff, K. (2012). Content Analysis, An Introduction to Its Methodology. Sage Publications, Thousand Oaks, CA, 3rd edition.

Lebar, L.; Petrovčič, A.; Petrič, G. (2012): Analiza slovenskih spletnih forumov. Poročilo. http://www.nebojse.si/portal/Dokumenti/Analiza_slovenskih_spletnih_forumov.pdf

Liu, B. (2015): Sentiment analysis. Mining opinions, sentiments, and emotions. Cambridge University Press.

Ljubešić, N.; Erjavec, T. (2016): Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: The Case of Slovene. Proceedings of LREC'16 Conference, Portorož, Slovenija.

Ljubešić, N.; Erjavec, T. Fišer, D. (2014a): Standardizing Tweets with Character-Level Machine Translation. Lecture notes in computer science, 164–75. Springer.

Ljubešić, N.; Erjavec, T. in Fišer D. (2016): Corpus-Based Diacritic Restoration for South Slavic Languages. Proceedings of LREC'16 Conference, Portorož, Slovenija.

Ljubešić, N.; Fišer, D.; Erjavec, T. (2014): TweetCaT: a tool for building Twitter corpora of smaller languages. Proceedings of LREC’14 Conference, Reykjavik, Islandija.

Ljubešić, N.; Fišer, D.; Erjavec, T.; Čibej, J.; Marko, D.; Pollak, S.; Škrjanec, I. (2015): Predicting the level of text standardness in user-generated content. Proceedings of RANLP'15 Conference, 7-9 September 2015, Hissar, Bulgaria. Hissar: 371–378.

Michelizza, M. (2015): Spletna besedila in jezik na spletu. Primer blogov in Wikipedije v slovenščini. Lingua Slovenica 6. ZRC.

Mozetič, I.; Grčar, M.; Smailović, J. (2016). Multilingual Twitter sentiment classification: The role of human annotators. PLoS ONE, 11(5):e0155036.

Rychlý, P. (2007): Manatee/Bonito - A Modular Corpus Manager. Proceedings of the Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Masaryk University, 65-70.

Smailović, J.; Grčar, M.; Lavrač, N.; Žnidaršič, M. (2014): Stream-based active learning for sentiment analysis in the financial domain. Information sciences 285:181–203.

Statistični urad Republike Slovenije (2015): Uporaba interneta v gospodinjstvih in pri posameznikih v Sloveniji. http://www.stat.si/StatWeb/prikazi-novico?id=5509&idp=10&headerbar=8

TEI Consortium (2016): Guidelines for Electronic Text Encoding and Interchange. http://www.tei-c.org/P5/.
Published
2016-09-27
How to Cite
FišerD., ErjavecT., & LjubešićN. (2016). JANES v0.4: Corpus of Slovene User-Generated Content. Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research, 4(2), 67-99. https://doi.org/10.4312/slo2.0.2016.2.67-99