Models for predicting the inflectional paradigm of Croatian words

  • Jan Šnajder University of Zagreb, Faculty of Electrical Engineering and Computing, Text Analysis and Knowledge Engineering Lab
Keywords: computational morphology, paradigm prediction, machine learning, feature selection, Croatian language

Abstract

Morphological analysis is a prerequisite for many natural language processing tasks. For inflectionally rich languages such as Croatian, morphological analysis typically relies on a morphological lexicon, which lists the lemmas and their paradigms. However, a real-life morphological analyzer must also be able to handle properly the out-of-vocabulary words. We address the task of predicting the correct inflectional paradigm of unknown Croatian words. We frame this as a supervised machine learning problem: we train a classifier to predict whether a candidate lemma-paradigm pair is correct based on a number of string- and corpus-based features. The candidate lemma-paradigm pairs are generated using a handcrafted morphology grammar. Our aim is to examine the machine learning aspect of the problem: we test a comprehensive set of features and evaluate the classification accuracy using different feature subsets. We show that satisfactory classification accuracy (92%) can be achieved with SVM using a combination of string- and corpus-based features. On a per word basis, the F1-score is 53% and accuracy is 70%, which outperforms a frequency-based baseline by a wide margin. We discuss a number of possible directions for future research.

Downloads

Download data is not yet available.

References

Šnajder, J. (2013): Models for predicting the inflectional paradigm of Croatian words. Slovenščina 2.0, 1 (2): 1–34.
Published
2013-12-01
How to Cite
ŠnajderJ. (2013). Models for predicting the inflectional paradigm of Croatian words. Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research, 1(2), 1-34. https://doi.org/10.4312/slo2.0.2013.2.1-34