Models for predicting the inflectional paradigm of Croatian words
AbstractMorphological analysis is a prerequisite for many natural language processing tasks. For inflectionally rich languages such as Croatian, morphological analysis typically relies on a morphological lexicon, which lists the lemmas and their paradigms. However, a real-life morphological analyzer must also be able to handle properly the out-of-vocabulary words. We address the task of predicting the correct inflectional paradigm of unknown Croatian words. We frame this as a supervised machine learning problem: we train a classifier to predict whether a candidate lemma-paradigm pair is correct based on a number of string- and corpus-based features. The candidate lemma-paradigm pairs are generated using a handcrafted morphology grammar. Our aim is to examine the machine learning aspect of the problem: we test a comprehensive set of features and evaluate the classification accuracy using different feature subsets. We show that satisfactory classification accuracy (92%) can be achieved with SVM using a combination of string- and corpus-based features. On a per word basis, the F1-score is 53% and accuracy is 70%, which outperforms a frequency-based baseline by a wide margin. We discuss a number of possible directions for future research.
Copyright (c) 2013 Jan Šnajder
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All content of Slovenščina 2.0 is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
Slovenščina 2.0 applies the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license to all published material. Under this license, authors retain ownership of the copyright for their content, but allow anyone to download, reuse, reprint, modify, distribute, copy, remix, transform and/or build upon the content for any purpose, even commercial, as long as the original authors and source are cited. No permission is required from the authors or the publishers. Appropriate attribution can be provided by simply citing the original article. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. For any reuse or redistribution of a work, users must also make clear the license terms under which the work was published.
No separate publishing agreements are signed between the author and the publisher. Authors retain copyright and the publishing rights of their work without any restrictions.
Authors are permitted and encouraged to post the journal’s published version of the work online (e.g., in institutional repositories, on their own websites), with an acknowledgement of its initial publication in Slovenščina 2.0.