Named entity recognition in Slovene text

  • Tadej Štajner Jožef Stefan Institute, Artificial Intelligence Laboratory, The Jožef Stefan International Postgraduate School
  • Tomaž Erjavec Jožef Stefan Institute, Department of Knowledge Technologies The Jožef Stefan International Postgraduate School
  • Simon Krek Jožef Stefan Institute, Artificial Intelligence Laboratory Faculty of Social Sciences, University of Ljubljana
Keywords: named entity extraction, natural language processing, Slovene language tools

Abstract

This paper presents an approach and an implementation of a named entity extractor for Slovene language, based on a machine learning approach. It is designed as a supervised algorithm based on Conditional Random Fields and is trained on the ssj500k annotated corpus of Slovene. The corpus, which is available under a Creative Commons CC-BY-NC-SA licence, is annotated with morphosyntactic tags, as well as named entities for people, locations, organisations, and miscellaneous names. The paper discusses the influence of morphosyntactic tags, lexicons and conjunctions of features of neighbouring words. An important contribution of this investigation is that morphosyntactic tags benefit named entity extraction. Using all the best-performing features the recognizer reaches a precision of 74% and a recall of 72%, having stronger performance on personal and geographical named entities, followed by organizations, but performs poorly on the miscellaneous entities, since this class is very diverse and consequently difficult to predict. A major contribution of the paper is also showing the benefits of splitting the class of miscellaneous entities into organizations and other entities, which in turn improves performance even on personal and organizational names. The software, developed in this research is freely available under the Apache 2.0 licence athttp://ailab.ijs.si/~tadej/slner.zip, while development versions are available at https://github.com/tadejs/slner.

Downloads

Download data is not yet available.

References

Štajner, T., Erjavec, T., Krek, S. (2013): Razpoznavanje imenskih entitet v slovenskem besedilu. Slovenščina 2.0, 1 (2): 58–81.
Published
2013-12-01
How to Cite
ŠtajnerT., ErjavecT., & KrekS. (2013). Named entity recognition in Slovene text. Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research, 1(2), 58-81. https://doi.org/10.4312/slo2.0.2013.2.58-81