Overcoming Legal Limitations in Disseminating Slovene Web Corpora

Tomaž Erjavec, Jaka Čibej, Darja Fišer


Web texts are becoming increasingly relevant sources of information, with web corpora useful for corpus linguistic studies and development of language technologies. Even though web texts are directly accessable, which substantially simplifies the collection procedure compilation of web corpora is still complex, time consuming and expensive. It is crucial that similar endeavours are not repeated, which is why it is necessary to make the created corpora easily and widely accessible both to researchers and a wider audience. While this is logistically and technically a straightforward procedure, legal constraints, such as copyright, privacy and terms of use severely hinder the dissemination of web corpora. This paper discusses legal conditions and actual practice in this area, gives an overview of current practices and proposes a range of mitigation measures on the example of the Janes corpus of Slovene user-generated content in order to ensure free and open dissemination of Slovene web corpora.


web texts; corpus dissemination; copyright; privacy; free and open access


Al-Sulaiti, L.; Atwell, E. (2004): Designing and developing a corpus of contemporary Arabic. Zbornik šeste konference TALC.

Baroni, M.; Bernardini, S.; Ferraresi, A.; Zanchetta, E. (2009): The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43/3. 209–226.

Beißwenger, M.; Ermakova, M.; Geyken, A.; Lemnitzer, L.; Storrer, A. (2012b): DeRiK: A German Reference Corpus of Computer-Mediated Communication. Zbornik konference Digital Humanities 2012. Alliance of Digital Humanities Organizations (ADHO).

Beißwenger, M.; Ermakova, M.; Geyken, A.; Lemnitzer, L.; Storrer, A. (2012b): A TEI Schema for the Representation of Computer-mediated Communication. V: Journal of the Text Encoding Initiative, Issue 3.

Beißwenger, M.; Storrer, A. (2008): Corpora of computer-mediated communication. V: A. Lüdeling and M. Kytö (ur.). Corpus linguistics: An international handbook. Vol. 1, 292–309. Berlin and New York: Walter de Gruyter.

Beurskens, M. (2014): Legal Questions of Twitter Research V: V: Weller, K.; Bruns, A.; Burgess, J.; Mahrt, M.; Puschmann, C.: Twitter and Society. Peter Lang.

Beurskens, M. (2014): Legal Questions of Twitter Research. V: Weller, K.; Bruns, A.; Burgess, J.; Mahrt, M.; Puschmann, C.: Twitter and Society. Peter Lang.

Corti, L.; Day, A.; Backhouse, G. (2000): Confidentiality and Informed Consent: Issues for Consideration in the Preservation of and Provision of Access to Qualitative Data Archives. Forum Qualitative Sozialforschung/Forum: Qualitative Social Research 1/3. http://www.qualitative-research.net/index.php/fqs/article/view/1024/2207

Čibej, J.; Arhar Holdt, Š.; Erjavec, T.; Fišer, D. (2016): Razvoj učne množice za izboljšano označevanje spletnih besedil. Zbornik konference Jezikovne tehnologije in digitalna humanistika.

Čibej, J.; Fišer, D.; Erjavec, T.; Arhar Holdt, Š. (2016): Razvoj učne množice za izboljšano označevanje spletnih besedil. JTDH 2016.

Dann, S. (2010): Twitter content classification First Monday, Volume 15, Number 12 http://firstmonday.org/ojs/index.php/fm/article/view/2745/2681

Dürscheid, C. (2015): Interaktionsräume ohne Grenzen? Texte in den neuen Medien. V: Dalmas, Martine idr. (ur.): Texte im Spannungsfeld von medialen Spielräumen und Normorientierung. Pisaner Fachtagung 2014 zu interkulturellen Perspektiven der internationalen Germanistik. München: Iudicum, 74–88.

Erjavec, T. (2013): Korpusi in konkordančniki na strežniku nl.ijs.si. Slovenščina 2.0, ISSN 2335-2736, letn. 1, št. 1, str. 24-49. http://www.trojina.org/slovenscina2.0/arhiv/2013/1/Slo2.0_2013_1_03.pdf.

Erjavec, T.; Čibej, J.; Fišer, D. (2015): Pravna podlaga za zagotavljanje prostega dostopa korpusov spletnih besedil. Smolej, M. (ur.). OBDOBJA 34: Slovnica in slovar – aktualni jezikovni opis. Ljubljana: Znanstvena založba Filozofske fakultete, 193–199.

Erjavec, T.; Javorše., J.; Krek, S. (2014): Raziskovalna infrastruktura CLARIN.SI. Zbornik Devete konference Jezikovne tehnologije. Ljubljana: Institut »Jožef Stefan«. 19–24.

Evropska komisija (2006): Evropska listina za raziskovalce. Kodeks ravnanja pri zaposlovanju raziskovalcev. http://ec.europa.eu/euraxess/pdf/brochure_rights/kina21620b7c_si.pdf

Evropska komisija (2012): Towards better access to scientific information: Boosting the benefits of public investments in research. Communication from the Commission to the European Parliament, the Council, the European Economic and Social Committee and the Committee of the Regions. https://ec.europa.eu/research/science-society/document_library/pdf_06/era-communica- tion-towards-better-access-to-scientific-information_en.pdf

Fišer, D., Erjavec, T., Ljubešić, N. (2016): JANES v0.4: Korpus slovenskih spletnih uporabniških vsebin. Slovenščina 2.0, 4 (2): 67–100.

Glaznieks, A.; Stemle, E. (2014): Challenges of building a CMC corpus for analyzing writer’s style by age: The DiDi project. Journal for Language Technology and Computational Linguistics 29/2. 31–57.

Goli, T.; Osrajnik, E.; Fišer, D. (2016): Analiza krajšanja slovenskih sporočil na družbenem omrežju Twitter. Zbornik konference Jezikovne tehnologije in digitalna humanistika.

Guevara, E.; Johannessen, J. (2014): NoWaC (Norwegian Web as Corpus), LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague, http://hdl.handle.net/11372/LRT-343.

Halacsy, P. (2014): Hungarian Web Corpus, LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague, http://hdl.handle.net/11372/LRT-348.

Hemming, C.; Lassi, M. (2002): Copyright and the web as corpus. http://hemming.se/gslt/copyrightHemmingLassi.pdf

Hladnik, M. (2016): Nova pisarija. WikiKnjige. https://sl.wikibooks.org/wiki/Nova_pisarija

King, B. (2009): Building and analysing corpora of computer-mediated communication. Contemporary corpus linguistics, 301-320.

Kotar, M. (2013): Odprti dostop v Evropski uniji in v Sloveniji. Knjižničarske novice 23/10. http://www.nuk.uni-lj.si/knjiznicarskenovice/v2/podrobnostClanek.aspx?id=778

Kupietz, M.; Lüngen, H. (2014): Recent Developments in DeReKo. Language Resources and Evaluation 43/3. 209–226.

Lee, C.; Woods, K. (2012): Automated Redaction of Private and Personal Data in Collections: Toward Responsible Stewardship of Digital Heritage. The Memory of the World in the Digital age: Digitization and Preservation, 2012. Vancouver, BC.

Lessig, L. (1999): Code and other laws of cyberspace. New York, NY: Basic Books.

Longhi, J.; Marinica, C.; Borzic, B.; Alkhouli, A. (2014): Polititweets : corpus de tweets provenant de comptes politiques influents 1. In Chanier T. (ed) Banque de corpus CoMeRe. Ortolang.fr: Nancy. http://hdl.handle.net/11403/comere/cmr-polititweets/cmr-polititweets-tei-v1

Majliš, M. (2011): W2C – Web to Corpus – Corpora, LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague, http://hdl.handle.net/11858/00-097C-0000-0022-6133-9.

Margaretha, E.; Lüngen, H. (2014): Building Linguistic Corpora from Wikipedia Articles and Discussions. JLCL, 29(2), 59-82.

Medlock, B. (2006): An introduction to NLP-based textual anonymisation. Zbornik pete mednarodne konference Language Resources and Evaluation (LREC).

Močnik, M.; Bogataj Jančič, M.; Kovačič, M.; Milohnić, A. (2008): Upravljanje avtorskih in sorodnih pravic v digitalnem okolju. Končno poročilo raziskovalnega projekta. http://www.uil-sipo.si/fileadmin/upload_folder/prispevki-mnenja/Raziskava_Upravljanje-ASP_2008.pdf

Olohan, M. (2004): Introducing corpora in translation studies. Routledge.

Olson, K. (2013): Intellectual Property. V: Stewart, Daxton (ur.). Social Media and the Law: A Guidebook for Communication Students and Professionals. New York: Routledge, 75-98.

Olutobi, O.; O’Connor, B.; Dyer, C.; Gimpel, K.; Schneider, N.; Smith, N. (2013): Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters. In Proceedings of NAACL 2013. http://www.cs.cmu.edu/~ark/TweetNLP/#pos

Östling, R.; Wirén, M. (2013): Compounding in a Swedish Blog Corpus. Computer mediated discourse across language. Stockholm: Stockholm University. 45–63.

Owoputi, O.; O'Connor, B.; Dyer, C.; Gimpel, K.; Schneider, N.; Smith, N. A. (2013): Improved part-of-speech tagging for online conversational text with word clusters. Association for Computational Linguistics.

Petrovič, S.; Osborne, M.; Lavrenko; V. (2010): The Edinburgh Twitter Corpus. Zbornik konference NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media. Los Angeles: Association for Computational Linguistics. 25–26.

Popič, D.; Fišer, D.; Zupan, K.; Logar, P. (2016): Raba vejice v uporabniških spletnih vsebinah. Zbornik konference Jezikovne tehnologije in digitalna humanistika.

Puschmann, C.; Burgess, J. (2014): The Politics of Twitter Data V: Weller, K.; Bruns, A.; Burgess, J.; Mahrt, M.; Puschmann, C.: Twitter and Society. Peter Lang.

Schäfer, R.; Bildhauer, F. (2012): Building Large Corpora from the Web Using a New Efficient Tool Chain. Zbornik konference Eighth International Conference on Language Resources and Evaluation (LREC’12).

Sodba Sodišča z dne 13. maja 2014 v zadevi C-131/12. http://curia.europa.eu/juris/document/docu-ment.jsf?text=&docid=152065&amppageIndex=0&doclang=sl&mode=lst&dir=&occ=first∂=1&cid=276332

Spooren, W.; van Charldorp, T. (2014): Challenges and experiences in collecting a chat corpus. Journal for Language Technology and Computational Linguistics 29/2. 1–15.

Spousta, M. (2006): Web as a Corpus. Zbornik konference WDS’06. Praga: Matfyzpress. 179–184.

Štebe, J; Bezjak, S.; Lužar, S. (2013): Odprti podatki: načrt za vzpostavitev sistema odprtega dostopa do raziskovalnih podatkov v Sloveniji. Ljubljana: FDV.

Teutsch, P.; Piat, F.; Reffay, C. (2009): Anonymizing and sharing corpora of online training courses. Zbornik konference Interaction Analysis and Visualization for Asynchronous Communication, Workshop CSCL’2009. International Society of the Learning Sciences. 1–6.

Twitter (2016a). Terms of service. http://twitter.com/tos

Twitter (2016b): Developer Display Requirements https://dev.twitter.com/overview/terms/agreement-and-policy

Twitter (2016c): Developer Rules of the Road https://dev.twitter.com/overview/terms/agreement-and-policy

Twitter (2016d): Privacy Policy https://twitter.com/privacy

Vintar, Š.; Fišer, D. (2009): Gradnja in analiza korpusov za prevodoslovne raziskave. V: Kocijančič-Pokorn, Nike (ur.). Sodobne metode v prevodoslovnem raziskovanju, (Zbirka Prevodoslovje in uporabno jezikoslovje). Ljubljana: Znanstvena založba Filozofske fakultete, 2009, str. 80-109.

Wiki Books. Legal framework of textual data processing for Machine Translation and Language Technology research and development activities/Open Data and Web crawling Case Studies. https://en.wikibooks.org/wiki/Legal_framework_of_textual_data_processing_for_Machine_Translation_and_Language_Technology_research_and_development_activities/Open_Data_and_Web_crawling_Case_Studies

Yang, J.; Leskovec, J. (2011): Temporal Variation in Online Media. ACM International Conference on Web Search and Data Mining (WSDM '11). http://snap.stanford.edu/data/twitter7.html

DOI: http://dx.doi.org/10.4312/slo2.0.2016.2.189-219


  • There are currently no refbacks.

Copyright (c) 2016 Tomaž Erjavec, Jaka Čibej, Darja Fišer

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Ljubljana University Press, Faculty of Arts and Trojina, Institute for Applied Slovene Studies
(Znanstvena založba Filozofske fakultete Univerze v Ljubljani in Trojina, zavod za uporabno slovenistiko) 

Online ISSN: 2335-2736