Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets

Maja Miličević, Nikola Ljubešić

Abstract


In this paper we discuss the parallel manual normalisation of samples extracted from Croatian and Serbian Twitter corpora. We describe the datasets, outline the unified guidelines provided to annotators, and present a series of analyses of standard-to-non-standard transformations found in the Twitter data. The results show that closed part-of-speech classes are transformed more frequently than the open classes, that the most frequently transformed lemmas are auxiliary and modal verbs, interjections, particles and pronouns, that character deletions are more frequent than insertions and replacements, and that more transformations occur at the word end than in other positions. Croatian and Serbian are found to share many, but not all transformation patterns; while some of the discrepancies can be ascribed to the structural differences between the two languages, others appear to be better explained by looking at extralinguistic factors. The produced datasets and their initial analyses can be used for studying the properties of non-standard language, as well as for developing language technologies for non-standard data.

Keywords


computer-mediated communication; CMC corpora; Twitter; normalisation

References


Benhardus, J., and Kalita, J. (2013): Streaming trend detection in Twitter. International Journal of Web Based Communities, 9(1): 122–139.

Biber, D., Conrad, S., and R. Reppen (1998): Corpus Linguistics. Investigating Language Structure and Use. Cambridge: Cambridge University Press.

Crystal, D. (2011): Internet Linguistics: A Student Guide. New York: Routledge.

Čibej, J., Fišer, D., and Erjavec, T. (2016): Normalisation, tokenisation and sentence segmentation of Slovene tweets. Proceedings of Normalisation and Analysis of Social Media Texts (NormSoMe) 2016, LREC 2016: 5–10. http://www.lrec-conf.org/proceedings/lrec2016/workshops/LREC2016Workshop-NormSoMe_Proceedings.pdf

Eisenstein, J. (2013): What to do about bad language on the Internet. Proceedings of HLT-NAACL 2013: 359–369. http://www.cc.gatech.edu/~jeisenst/papers/naacl2013-badlanguage.pdf

Fišer, D., Erjavec, T., Ljubešić, N., and Miličević, M. (2015): Comparing the nonstandard language of Slovene, Croatian and Serbian tweets. M. Smolej (Ed.): Simpozij Obdobja 34. Slovnica in slovar - aktualni jezikovni opis (1. del): 225–231. Ljubljana: Filozofska fakulteta.

Foster, J., Cetinoglu, O., Wagner, J., Le Roux, J., Nivre, J., Hogan, D., and van Genabith, J. (2011): From news to comment: Resources and benchmarks for parsing the language of web 2.0. Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011): 893–901. http://www.aclweb.org/anthology/I/I11/I11-1100.pdf

Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yotogama, D., Flanigan, J., and Smith, Noah A. (2011): Part-of-speech tagging for Twitter: annotation, features, and experiments. Proceedings of 49th Conference on Computational Linguistics (ACL 2011): 42–47. http://www.aclweb.org/ anthology/P/P11/P11-2008.pdf

Hu, Y., Talamadupula, K., and Kambhampati, S. (2013): Dude, srsly?: The surprisingly formal nature of Twitter’s language. Proceedings of The 7th International AAAI Conference on Weblogs and Social Media (ICWSM 2013). http://www.public.asu.edu/~ktalamad/papers/icwsm13.pdf

Kaufmann, J., and Kalita, J. (2010): Syntactic normalization of Twitter messages. International Conference on Natural Language Processing (ICON 2010): 149–158. Kharagpur, India.

Levenshtein, V. I. (1966): Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10 (8): 707–710.

Liu, F., Weng, F., Wang, B., and Liu, Y. (2011): Insertion, deletion, or substitution? Normalizing text messages without pre-categorization nor supervision. Proceedings of 49th Conference on Computational Linguistics (ACL 2011): 71–76. http://www.aclweb.org/anthology/P/P11/P11-2013.pdf

Ljubešić, N., Erjavec, T., and Fišer, D. (2014a): Standardizing tweets with character-level machine translation. A. Gelbukh (Ed.): Proceedings of the 15th International Conference CICLing 2014: 164–175. Lecture Notes in Computer Science. Berlin: Springer.

Ljubešić, N., Fišer, D., and Erjavec, T. (2014b): TweetCaT: a tool for building Twitter corpora of smaller languages. Proceedings of LREC 9: 2279–2283. http://www.lrec-conf.org/ proceedings/lrec2014/pdf/834_Paper.pdf

Ljubešić, N., Fišer, D., Erjavec, T., Čibej, J., Marko, D., Pollak, S. and Škrjanec I. (2015): Predicting the level of text standardness in user-generated content. Proceedings of Recent Advances in Natural Language Processing (RANLP 2015): 371-378. https://aclweb.org/anthology/R/R15/R15-1049.pdf

Ljubešić, N., Zupan, K., Fišer, D., Erjavec, T. Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016: in print.

Ljubešić, N., Klubička, F., Agić, Ž. and Jazbec I. (2016b): New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. Proceedings of LREC 10: 4264–4270. http://www.lrec-conf.org/proceedings/lrec2016/pdf/340_Paper.pdf

Mair, C., Hundt, M., Leech, G., and Smith, N. (2002): Short term diachronic shifts in part-of-speech frequencies. A comparison of the tagged LOB and F-LOB corpora. International Journal of Corpus Linguistics, 7(2): 245–264.

Noblia, M. V. (1998): The computer-mediated communication: A new way of understanding the language. Proceedings of the 1st Conference on Internet Research and Information for Social Scientists (IRISS’98): 10–12.

Oliva, J., Serrano, J. I., Del Castillo, M. D., and Igesias, A. (2013): A SMS normalization system integrating multiple grammatical resources. Natural Language Engineering, 19: 121–141.

Pešikan, M., Jerković, J., and Pižurica, M. (2010): Pravopis srpskoga jezika. Novi Sad: Matica srpska.

Petrov, S., and McDonald, R. (2012): Overview of the 2012 shared task on parsing the web. Notes of the First Workshop on SANCL 2012. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1. 261.2294&rep=rep1&type=pdf

Sidarenka, U., Scheffler, T., and Stede, M. (2013): Rule-based normalization of German Twitter messages. Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology. https://gscl2013.ukp.informatik.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/conferences/ gscl2013/workshops/sidarenka_scheffler_stede.pdf

Tagg, C. (2012): Discourse of Text Messaging. London: Continuum.




DOI: http://dx.doi.org/10.4312/slo2.0.2016.2.156-188

Refbacks

  • There are currently no refbacks.


Copyright (c) 2016 Nikola Ljubešić, Maja Miličević

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Ljubljana University Press, Faculty of Arts and Trojina, Institute for Applied Slovene Studies
(Znanstvena založba Filozofske fakultete Univerze v Ljubljani in Trojina, zavod za uporabno slovenistiko) 

Online ISSN: 2335-2736