JANES v0.4: Corpus of Slovene User-Generated Content

Darja Fišer, Tomaž Erjavec, Nikola Ljubešić

Abstract


The paper presents the current version of the Slovene corpus of netspeak Janes which contains tweets, forum posts, news comments, blogs and blog comments, and user and talk pages from Wikipedia. First, we describe the harvesting procedure for each data source and provide a quantitative analysis of the corpus. Next, we present automatic and manual procedures for enriching the corpus with metadata, such as user type, gender and region, and text sentiment and standardness level. Finally, we give a detailed account of the linguistic annotation workflow which includes tokenization, sentence segmentation, rediacritisation, normalization, morphosyntactic tagging and lemmatization.

Keywords


corpus construction; computer-mediated communication; user-generated content; Internet Slovene; non-standard Slovene

Full Text:

PDF (Slovenščina)

References


Baron, N. (2008): Always On: Language in an Online and Mobile World. Oxford University Press.

Bartz, T.; Beißwenger, M.; Storrer, A. (2014): Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internetbasierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge. Journal for Language Technology and Computational Linguistics 28 (1): 157–198.

Beißwenger, M. (2013): Raumorientierung in der Netzkommunikation. Korpusgestützte Untersuchungen zur lokalen Deixis in Chats. Die Dynamik sozialer und sprachlicher Netzwerke, 207–258. Springer.

Beißwenger, M.; Ermakova, M.; Geyken, A.; Lemnitzer, L.; Storrer, A. (2012): A TEI Schema for the Representation of Computer-mediated Communication. Journal of the Text Encoding Initiative 3 (2012).

Čibej, J.; Fišer, D.; Erjavec, T. (2016): Normalisation, Tokenisation and Sentence Segmentation of Slovene Tweets. Proceedings of the workshop Normalisation and Analysis of Social Media Texts at LREC'16. Portorož, Slovenia, May 28 2016.

Čibej, J.; Ljubešić, N. (2015): »S kje pa si?« – Metapodatki o regionalni pripadnosti uporabnikov družbenega omrežja Twitter. Zbornik konference Slovenščina na spletu in v novih medijih. Ljubljana: Znanstvena založba Filozofske fakultete, 10–14.

Crystal, D. (2011): Internet Linguistics: A Student Guide. Routledge, New York.

Dobrovoljc, H.; Jakop, N. (2012). Sodobni pravopisni priročnik med normo in predpisom. Založba ZRC.

Dobrovoljc, K., Krek, S., Holozan, P., Erjavec, T., Romih, M. (2015): Morphological lexicon Sloleks 1.2. Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1039.

Erjavec, T. Fišer, D. (2013): Jezik slovenskih tvitov: korpusna raziskava. Družbena funkcijskost jezika: vidiki, merila, opredelitve, 109–116. Znanstvena založba Filozofske fakultete.

Erjavec, T.; Čibej, J.; Fišer, D. (2015): Pravna podlaga za zagotavljanje prostega dostopa korpusov spletnih besedil. Smolej, M. (ur.). OBDOBJA 34: Slovnica in slovar – aktualni jezikovni opis. Ljubljana: Znanstvena založba Filozofske fakultete, 193–199.

Fišer, D.; Erjavec, T. (2016): Analysis of sentiment labelling of Slovene user generated content. Proceedings of the 4th conference on CMC and Social Media Corpora for the Humanities, 27.-28.9. 2016, Ljubljana: Filozofska fakulteta.

Fišer, D.; Smailović, J.; Erjavec, T.; Mozetič, I.; Grčar, M. (2016): Sentiment Annotation of the Janes Corpus of Slovene User-Generated Content. Proceedings of the 10th Languate Technologies and Digital Humanities Conference, 29.9.-1.10. 2016, Ljubljana: Filozofska fakulteta.

Krek, S., Erjavec, T., Dobrovoljc, K., Može, S., Ledinek, N., Holz, N. (2013): Training corpus ssj500k 1.3, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1029.

Krippendorff, K. (2012). Content Analysis, An Introduction to Its Methodology. Sage Publications, Thousand Oaks, CA, 3rd edition.

Lebar, L.; Petrovčič, A.; Petrič, G. (2012): Analiza slovenskih spletnih forumov. Poročilo. http://www.nebojse.si/portal/Dokumenti/Analiza_slovenskih_spletnih_forumov.pdf

Liu, B. (2015): Sentiment analysis. Mining opinions, sentiments, and emotions. Cambridge University Press.

Ljubešić, N.; Erjavec, T. (2016): Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: The Case of Slovene. Proceedings of LREC'16 Conference, Portorož, Slovenija.

Ljubešić, N.; Erjavec, T. Fišer, D. (2014a): Standardizing Tweets with Character-Level Machine Translation. Lecture notes in computer science, 164–75. Springer.

Ljubešić, N.; Erjavec, T. in Fišer D. (2016): Corpus-Based Diacritic Restoration for South Slavic Languages. Proceedings of LREC'16 Conference, Portorož, Slovenija.

Ljubešić, N.; Fišer, D.; Erjavec, T. (2014): TweetCaT: a tool for building Twitter corpora of smaller languages. Proceedings of LREC’14 Conference, Reykjavik, Islandija.

Ljubešić, N.; Fišer, D.; Erjavec, T.; Čibej, J.; Marko, D.; Pollak, S.; Škrjanec, I. (2015): Predicting the level of text standardness in user-generated content. Proceedings of RANLP'15 Conference, 7-9 September 2015, Hissar, Bulgaria. Hissar: 371–378.

Michelizza, M. (2015): Spletna besedila in jezik na spletu. Primer blogov in Wikipedije v slovenščini. Lingua Slovenica 6. ZRC.

Mozetič, I.; Grčar, M.; Smailović, J. (2016). Multilingual Twitter sentiment classification: The role of human annotators. PLoS ONE, 11(5):e0155036.

Rychlý, P. (2007): Manatee/Bonito - A Modular Corpus Manager. Proceedings of the Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Masaryk University, 65-70.

Smailović, J.; Grčar, M.; Lavrač, N.; Žnidaršič, M. (2014): Stream-based active learning for sentiment analysis in the financial domain. Information sciences 285:181–203.

Statistični urad Republike Slovenije (2015): Uporaba interneta v gospodinjstvih in pri posameznikih v Sloveniji. http://www.stat.si/StatWeb/prikazi-novico?id=5509&idp=10&headerbar=8

TEI Consortium (2016): Guidelines for Electronic Text Encoding and Interchange. http://www.tei-c.org/P5/.




DOI: http://dx.doi.org/10.4312/slo2.0.2016.2.67-99

Refbacks

  • There are currently no refbacks.


Copyright (c) 2016 Darja Fišer, Tomaž Erjavec, Nikola Ljubešić

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Ljubljana University Press, Faculty of Arts and Trojina, Institute for Applied Slovene Studies
(Znanstvena založba Filozofske fakultete Univerze v Ljubljani in Trojina, zavod za uporabno slovenistiko) 

Online ISSN: 2335-2736