Overcoming Legal Limitations in Disseminating Slovene Web Corpora

Tomaž Erjavec, Jaka Čibej, Darja Fišer


Web texts are becoming increasingly relevant sources of information, with web corpora useful for corpus linguistic studies and development of language technologies. Even though web texts are directly accessable, which substantially simplifies the collection procedure compilation of web corpora is still complex, time consuming and expensive. It is crucial that similar endeavours are not repeated, which is why it is necessary to make the created corpora easily and widely accessible both to researchers and a wider audience. While this is logistically and technically a straightforward procedure, legal constraints, such as copyright, privacy and terms of use severely hinder the dissemination of web corpora. This paper discusses legal conditions and actual practice in this area, gives an overview of current practices and proposes a range of mitigation measures on the example of the Janes corpus of Slovene user-generated content in order to ensure free and open dissemination of Slovene web corpora.


web texts; corpus dissemination; copyright; privacy; free and open access


DOI: http://dx.doi.org/10.4312/slo2.0.2016.2.189-219


Copyright (c) 2016 Tomaž Erjavec, Jaka Čibej, Darja Fišer

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

