Automatization of lexicographic work

  • Iztok Kosem Trojina, zavod za uporabno slovenistiko
  • Polona Gantar Inštitut za slovenski jezik Frana Ramovša, ZRC SAZU
  • Simon Krek Institut “Jožef Stefan”, Laboratorij za umetno inteligenco; Fakulteta za družbene vede, Univerza v Ljubljani
Keywords: automatic extraction, Slovene Lexical Database, proposal for a dictionary of contemporary Slovene, Word Sketches, GDEX


A new approach to lexicographic work, in which the lexicographer is seen more as a validator of the choices made by computer, was recently envisaged by Rundell and Kilgarriff (2011). In this paper, we describe an experiment using such an approach during the creation of Slovene Lexical Database (Gantar, Krek, 2011). The corpus data, i.e. grammatical relations, collocations, examples, and grammatical labels, were automatically extracted from 1,18-billion-word Gigafida corpus of Slovene. The evaluation of the extracted data consisted of making a comparison between the time spent writing a manual entry and a (semi)-automatic entry, and identifying potential improvements in the extraction algorithm and in the presentation of data. An important finding was that the automatic approach was far more effective than the manual approach, without any significant loss of information. Based on our experience, we would propose a slightly revised version of the approach envisaged by Rundell and Kilgarriff in which the validation of data is left to lower-level linguists or crowd-sourcing, whereas high-level tasks such as meaning description remain the domain of lexicographers. Such an approach indeed reduces the scope of lexicographer’s work, however it also results in the ability of bringing the content to the users more quickly.


