MODELS FOR PREDICTING THE INFLECTIONAL PARADIGM OF CROATIAN WORDS

Morphological analysis is a prerequisite for many natural language processing tasks. For inflectionally rich languages such as Croatian, morphological analysis typically relies on a morphological lexicon, which lists the lemmas and their paradigms. However, a real-life morphological analyzer must also be able to handle properly the out-of-vocabulary words. We address the task of predicting the correct inflectional paradigm of unknown Croatian words. We frame this as a supervised machine learning problem: we train a classifier to predict whether a candidate lemma-paradigm pair is correct based on a number of stringand corpus-based features. The candidate lemma-paradigm pairs are generated using a handcrafted morphology grammar. Our aim is to examine the machine learning aspect of the problem: we test a comprehensive set of features and evaluate the classification accuracy using different feature subsets. We show that satisfactory classification accuracy (92%) can be achieved with SVM using a combination of stringand corpus-based features. On a per word basis, the F1-score is 53% and accuracy is 70%, which outperforms a frequency-based baseline by a wide margin. We discuss a number of possible directions for future research.


I N T R O D U C T I O N
Morphological analysis plays an important role in many natural language processing applications.Typical morphological analysis tasks include the recognition of morphologically related words, stemming and lemmatization, segmentation of words into morphemes, and the labeling of morphemes with grammatical features they express.Inflectionally rich languages, such as Slavic languages, are notoriously challenging for morphological analysis as they are highly fusional and abound with morphological syncretisms.For such languages, the wordand-paradigm approach to morphology (Hockett 1954) seems to be the only reasonable option.In traditional grammar, an inflectional paradigm is "a set of all the inflected forms that a lexeme assumes" (Aronoff and Fudeman 2011).
A paradigm is typically represented as a table (in general, an n-dimensional array, where n is the number of features) in which each cell corresponds to a particular combination of grammatical features (cf.Table 1).Paradigms with identical patterns of inflection can be grouped together and for each such group a single paradigm can be chosen as an exemplary paradigm.Calder (1989) was among the first to use paradigms in a computational model of morphology.In his work, and most subsequent work related to paradigmatic morphology, the word "paradigm" is used in a more technical sense (which we adopt here) to denote a formal description of an inflectional pattern.
Morphological analysis of inflectionally rich languages typically relies on some sort of morphological lexicon, which lists the stems or lemmas (the canonical forms of lexemes) and their associated paradigms.However, the unavoidable problem of lexicon-based morphological analysis is the limited lexicon coverage.
A real-life morphological analyzer must be able to deal in a satisfactory manner with out-of-vocabulary words.In paradigmatic morphology, this means being able to predict the correct inflectional paradigm of a given word-form.
In this article we address the task of predicting the lemma and the correct inflectional paradigm (the description of an inflectional pattern) of unknown Croatian words.We frame this as a supervised machine learning problem: we [ 2 ] train a model that decides which lemma and paradigm are correct based on a number of string-and corpus-based features.The model is used to disambiguate the output of a morphology grammar.Given an unknown word-form as input, we first generate the candidate lemma-paradigm pairs using the morphology grammar, and then use the classifier to decide which pair is correct.This is in contrast to most earlier approaches, which use handcrafted scoring functions to decide on the correct paradigm.The aim of this article is to examine the machine learning aspect of the problem: what the relevant features are and how well we can do on this classification task.We carry out feature analysis and evaluate the classification accuracy using different feature subsets.We show that a satisfactory level of accuracy can be achieved with a combination of string-and corpus-based features.Although our focus is on Croatian language, we believe our results are applicable to other languages, especially Slavic languages.
The rest of the article is structured as follows.In the next section we give a brief overview of related work.In Section 3 we define the problem of paradigm prediction, while in Section 4 we describe the features used for building the models.In Section 5 we analyze the features, evaluate the classification accuracy, and discuss the results.Section 6 concludes the article and outlines future work.

R E L A T E D W O R K
Much work on paradigm prediction comes from research in part-of-speech (POS) tagging and the related task of POS guessing (Mikheev 1997;Kupiec 1992).The problem has also been addressed in the context of rule-based machine translation (Esplá-Gomis et al. 2011).However, most work seems to address paradigm prediction in relation to (semi-)automatic lexicon acquisition (Oliver 2003;Tadić and Fulgosi 2003;Oliver and Tadić 2004;Clement et al. 2004;Sagot 2005;Forsberg et al. 2006;Hana 2008;Šnajder et al. 2008;Adolphs 2008;Lindén 2009;Kaufmann and Pfister 2010;Esplá-Gomis et al. 2011).The basic idea is to first use a lemmatizer to obtain the lemmas and paradigms for each word-form from a corpus.Because of grammar ambiguity, this usually results in a number [ 3 ] of possible candidates.Thus, the next step is to disambiguate the output of the morphology grammar by assessing the plausibility of each lemma-paradigm pair.This is most commonly done by generating the corresponding word-forms and analyzing their corpus frequencies.An incorrect lemma-paradigm pair is likely to produce linguistically invalid word-forms that will not be attested in the corpus, and in this case a suitably designed corpus-based scoring function can be used to decide which paradigm is correct.Some approaches use the web as additional source of information (Oliver and Tadić 2004;Cholakov and Van Noord 2009).Moreover, some approaches use word-form properties to decide on the correct paradigm: Forsberg et al. (2006) use handcrafted constraints, while Segalovich (2003) guesses the stems and the paradigms based on morphological similarity.Lindén (2009) uses both corpus-based features and lexicon-based information to learn analogical relations with which lemmas and paradigms of unknown words can be predicted.It is also possible to use context-based information when analyzing the word-forms from corpus (Kaufmann and Pfister 2010).More recent approaches use machine learning to predict the stem and the morphosyntactic features (Kaufmann and Pfister 2010).
Another line of research that has addressed the problem of paradigm induction is unsupervised morphology learning (Hammarström and Borin 2011).Unsupervised morphology learning aims to discover morphology descriptions from unannotated data, for the purpose of, inter alia, deriving language descriptions, bootstrapping morphological analyzers, and modeling language acquisition.The seminal work is that of Goldsmith (2001), who extracts sets of stems and affixes (so-called signatures), the latter bearing resemblance to paradigms, based on minimum description length principle.In other work paradigms are typically induced by clustering the word-forms from corpus and an analysis of their endings (Nakov et al. 2004;Oliver 2003;Monson et al. 2008), possibly within a probabilistic framework (Chan 2006;Dreyer and Eisner 2011).
In this work we do not consider the problem of unsupervised paradigm induction, but instead address the task of paradigm prediction in a supervised setting.We [ 4 ] are interested in building good models for paradigm prediction, assuming that the training data is available.Our work focuses on the machine learning aspect of the problem: we test a comprehensive set of features and carry out a detailed evaluation of the models.

P R O B L E M D E F I N I T I O N
The problem of predicting inflectional paradigms of (unknown) words can be formulated as follows: given a word-form w, determine its stem s and the corresponding inflectional paradigm p.For example, given word-form vojnika (genitive singular/accusative singular/genitive plural form of the noun vojnik (soldier)), we wish to determine that its stem is vojnik and that its paradigm p is the one shown in Table 1.The corresponding paradigm is the one which, when used with stem s, generates the valid word-forms of s, including word-form w itself.The stem s and the paradigm p are tied together in the sense that s functionally depends on p: in other words, given w, the inflectional paradigm (possibly ambiguously) determines the stem of w. 1 For example, if we know that p is the paradigm of vojnika, we also know that the stem of vojnika is 1 Ambiguity arises in the presence of a non-bijective transformation from a stem to a word-form, which gives rise to a non-functional inverse transformation back from the word-form to the stem.
A typical example in Croatian inflectional morphology are the morphologically conditioned steminternal changes that replace two or more distinct phonemes with one identical phoneme.A case in point are the palatalization alternations k/č (vojnik→vojniče) and c/č (stric→striče).For details, please refer to (Šnajder 2010).
[ 5 ] vojnik.Likewise, the stem and the inflectional paradigm (possibly ambiguously) determine the lemma l.Thus, the problem of paradigm prediction actually amounts to determining, for a given word-form w, its lemma l and the associated inflectional paradigm l.In what follows, we call a pair (l, p), consisting of lemma l and inflectional paradigm p, a lemma-paradigm pair, or an LPP for short.
We call an LPP (l, p) correct if (1) the lemma l is valid (it is an existing word of the language and it is indeed a lemma) and (2) the paradigm p is the correct paradigm for l; otherwise we call the LPP incorrect.
The difficulty in determining the correct inflectional paradigm arises from the fact that for most word-forms there are many candidate LPPs -a large number of possible stems can be combined with many paradigms defined for a language.
It should be emphasized that this will be the case even when using a handcrafted morphology grammar.A morphology grammar can, of course, narrow down the space of possibilities, but it cannot completely resolve the ambiguity because the question of which stems combine with what paradigms is ultimately a lexical one.Thus, in order not to discard a possibly valid hypothesis, a morphology grammar will have to overgenerate.In view of this, the problem of paradigm prediction is typically approached in two steps: (1) generation of LPP hypotheses admissible by the grammar and (2) the selection of the correct LPP based on grammar-external evidence.Note that, due to homography, some word-forms will have more than one correct LPP.The selection of correct LPPs is typically accomplished using some heuristic scoring mechanisms.Alternatively, as we do in this article, selection can be framed as a classification problem.

L P P c a n d i d a t e g e n e r a t i o n
The first step in paradigm prediction is the generation of LPP candidates using a morphology grammar (an inflectional morphology model).We assume that the grammar is generative (capable of generating word-forms given a lemma) and reductive (capable of reducing a word-form to a stem); consequently, by compositionality we assume that the grammar is capable of lemmatizing a given [ 6 ] word-form.We can abstract this functionality with two functions: adjectives.These words could be covered by the adjective paradigms but would need to be additionally disambiguated; we leave this for future work.
In HOFM , the morphological tags are encoded as MULTEXT-East descriptors (Erjavec et al. 2003).3MULTEXT-East encodes values of morphosyntactic attributes in a single string, using positional encoding.Each attribute is represented by a single letter at a predefined position, while non-applicable attributes are represented by hyphens.HOFM omits the values of those features that cannot be deduced solely at the morphological level, such as noun type (common/proper) or animacy of nouns and adjectives.For example, descriptor "Nmsn" denotes a word-form that is a masculine noun in singular nominative case, but whose type and animacy feature are unknown. 4As regards the verbs, the current version of HOFM encodes the complete paradigms of main verbs, except the aorist, imperfect, and passive forms.In HOFM, the passive forms are covered by adjectival paradigms, i.e., the passive participle (which is used for building both passive verb forms and adjective forms) is considered as part of the adjectival paradigm. 5On the other hand, the aorist and imperfect forms were left out because they are rather uncommon in contemporary texts. 6HOFM also accounts for doubletes (morphological variants with identical grammatical features), quite common for Croatian adjectives (e.g., the -og/oga and -om/omu/ome allomorphs in crvenog/crvenoga and crvenom/crvenomu/crvenome, respectively) and nouns with stem changes (e.g., tvrtki/tvrtci).

L P P c l a s s i f i c a t i o n
In the second step, given candidate LPPs generated by the grammar, we wish to decide which one is correct.In a supervised setting, the problem may be cast as (1) multiclass classification (choosing one LPP among many candidate LPPs), (2) multilabel classification (choosing a number of LPPs among many candidate LPPs),9 or (3) binary classification (deciding for each LPP from candidate LPPs whether it is correct).The problem with (1) and ( 2) is that that the set of possible classes cannot be straightforwardly defined.More precisely, in these cases each class should correspond to a single LPP (not a single paradigm, as a single paradigm can occur in different LPPs), so one should come up with a way of representing these without actually encoding the lemma itself (e.g., by encoding the paradigm and the stem transformation).Another problem is that not all such classes would be admissible by the grammar, so one would need to find a way to include that information as well (at a cost of increased complexity, this information could be fed to the classifier as a binary feature).An additional, albeit less significant problem with ( 1) is that it does not account for homographs (the cases in which a single word-form has more than one correct LPP).10 Approach (3), in which binary decisions are made for each LPP candidate, does not suffer from either of these problems and we shall adopt it here.
For classification, we use the support vector machine (SVM) (Vapnik 1999) with a radial basis function (RBF) kernel.The SVM algorithm tends to outperform other machine learning algorithms on a variety of learning problems.The RBF kernel implicitly defines an infinite-dimensional feature space, and is thus a good choice for problems for which the number of instances is much larger than the number of features, which will be the case here.We use the LIBSVM implementation of the SVM algorithm (Chang and Lin 2011).
As a source of training data, we use the semi-automatically acquired inflectional lexicon from Šnajder et al. (2008). 11The lexicon was acquired from articles comprising the newspaper section of the Croatian National Corpus totaling 20 million word form tokens (Tadić 2002).The lexicon contains 68,465 manually verified LPPs for Croatian nouns, adjectives, and verbs.We will use a fraction of this data for training and testing.It should be noted that the distribution of LPPs in the lexicon with respect to the paradigms is very uneven; the ten least frequent paradigms appear only 40 times in the lexicon, whereas the ten most frequent paradigms appear over 50,000 times.

F E A T U R E S
Given an LPP candidate generated by the grammar, we compute a set of features based on which the LPP can be classified as either correct or incorrect.At this point we make no attempt to define a minimal set of features; instead, we use features that are easily computable and can be intuitively justified.We distinguish between two groups of features: string-based and corpus-based.

S t r i n g -b a s e d f e a t u r e s
The string-based features are based on the orthographic properties of the lemma or the stem.The intuition behind this is that incorrect LPPs tend to generate ill-formed (or somewhat odd-formed) stems and lemmas.For example, there is no adjective in Croatian language that ends in -kč; an LPP that would generate such a stem could be discarded immediately.In fact, many paradigms defined in traditional grammar books are conditioned on the stem ending, requiring that it belongs to a certain group of phonemes or that it forms a consonant group.
Similarly, there are paradigms that are applicable only to one-syllable stems.
11 Alternatively, we could have used the Croatian Morphological Lexicon (Tadić and Fulgosi 2003), but this would have been less straightforward because this lexicon uses a different set of paradigms from HOFM.
[ 11 ] With string-based features we aim to capture this information in an implicit and less strict way.
We use a set of eleven string-based features: 1. EndsIn -the ending character of the stem; 2. EndsInCgr -a binary feature indicating whether the word-form ends in a consonant group (two consecutive consonants); 3. EndsInCons -a binary feature indicating whether the word-form ends in a consonant; 4. EndsInNonPals -a binary feature indicating whether the word-form ends in a non-palatal (v, r, l, m, n, p, b, f, t, d, s, z, c, k, g,  Notice that some features are overlapping.For example, the OneSyllable feature is a stripped down version of the NumSyllables feature.While in general a more expressive feature is preferred, the feature might just be too expressive and confuse the model by overfitting it to the training data.To account for this, the standard approach is to first consider all plausible features, some of which might overlap, and then perform feature analysis to filter out the redundant features.
We turn to feature analysis in Section 5.2.
The features StemSuffixProb and LemmaSuffixProb can be seen as soft condi- [ 12 ] tions on stem and lemma endings, respectively.The intuition is that, if a stem or lemma end in a suffix that is highly probable for a particular paradigm, then this paradigm is likely to be the correct one.Conversely, if a stem or a lemma end in a suffix that has rarely or never been observed for a stem or a lemma of a given paradigm, it is very likely that the stem or the lemma are ill-formed and do not belong to that particular paradigm.We obtain these probabilities as maximum likelihood estimates from the morphological lexicon used for training.As an example, consider Table 2, which shows five most frequent stem suffixes for noun paradigms N01 and N04 and adjective paradigm A06. 12 The probability distributions are quite different for the three paradigms.Incidentally, for paradigm N04 suffix -nik accounts for more than 51% of suffixes.Returning to our earlier example from Section 3.1, we can use this information as a strong evidence that LPP (vojnik, N04) is correct and that LPPs (vojnik, N01) and (vojnik, A06) are both incorrect.

C o r p u s -b a s e d f e a t u r e s
The second group of LPP features, the corpus-based features, are calculated based on the frequencies of word-forms attested in the corpus.The general idea 12 Paradigm N01 describes the so-called "type a" declension of masculine nouns, such as izvor (source) and ekran (screen).Paradigm N04 is similar to N01, except that it applies to stems ending in k/g/h, which undergo a stem change, as exemplified by Table 1.Paradigm A06 describes the inflection of qualificative adjectives with comparative suffix -iji, such as star (old) and loš (bad).
[ 13 ] Slovenščina 2.0, 2 ( 2013) is that a correct LPP should have more of its word-forms attested in the corpus than an incorrect LPP.Instead of only looking at total counts of attested wordforms, as proposed by Šnajder et al. (2008), one can also look at the distributions of attested word-forms across the morphological tags.The intuition behind this is that every inflectional paradigm has its own distribution of morphological tags, and that a correct LPP will generate word-forms that obey such a distribution.
For instance, in case of a noun paradigm, we can expect a genitive word-form to be far more frequent than a vocative word-form.Hence, an LPP that generates more vocative word-forms than genitive word-forms is unlikely to be correct.
In paradigms is rather different, although all tags have non-negative probabilities.
The third and sixth column show the tag distribution conditioned on both the paradigm and the lemma l = vojnik.From this we can see that, for example, if vojnik is paired with the paradigm N01, the genitive case forms would have a high probability of 0.77, while the probability of the same forms at the paradigm level is only 0.27.Overall, tag distribution of N04 seems to provide a better fit to lemma vojnik than tag distribution of N01.The similarity of distributions P(t|p) and P(t|l, p) can be measured in a number of ways, one of them being the Jensen-Shannon divergence.In this particular case, the Jensen-Shannon divergence is much larger for paradigm N01 than for N04, providing supporting evidence that N04 is the correct paradigm.
We use the following nine corpus-based features: 1. LemmaAttested -a binary feature indicating whether the lemma is attested in the corpus, i.e., #(l, C) > 0; 2. Score0 -the number of corpus-attested word-form types generated by the [ 16 ] 9. Score7 -the cosine similarity between the aforementioned distributions: We computed the above features on the Vjesnik newspaper corpus, spanning years 1999 through 2009 and totaling about 400K word-form types and about 55M word-form tokens.Stop words (function words, including all closed-class words) and words occurring less than tree times in the corpus were filtered out to reduce the noise.

O t h e r f e a t u r e s
Besides the string-and corpus-based features, we also use the following two features: 1. ParadigmId -a categorical (multinomial) feature denoting the LPP's inflectional paradigm; 2. POS -the part-of-speech of the LPP's inflectional paradigm (noun, adjective, or verb).
The intuition behind ParadigmId feature is that we expect a functional dependence to exist between the paradigm and the values of other features, and having ParadigmId as a feature allows the model to exploit this dependence.
For example, it is reasonable to expect that endsInCons feature is relevant only for a subset of paradigms that are applicable to stems ending in a consonant.
Similarly, we can expect the Score2 feature to be less indicative for adjectival paradigms, because the proportion of corpus-attested word-form types will generally be lower for adjectives than for other parts-of-speech because comparative and superlative word-forms are less frequent in the corpus. 13The same line of reasoning holds for the POS feature.
In this section we turn to the evaluation of the paradigm prediction models.The purpose of evaluation is twofold: apart from determining how accurately we can predict the inflectional paradigms, we also wish to analyze what features are most useful for this task.We continue by first describing the data set, followed by feature analysis and evaluation of classification accuracy.

D a t a s e t
We compiled the data set for training and testing from the aforementioned inflectional lexicon from Šnajder et al. (2008).We sampled from the lexicon 5,000 LPPs for training and 5,000 LPPs for testing, with at least one attested wordform in the corpus.Because the distribution of paradigms is very uneven, we used stratified sampling with respect to the inflectional paradigms.Furthermore, we ensured that there is no LPP that appears in the test set, but does not appear in the training set, as otherwise the probability distributions would be undefined.
Table 4 shows the distributions and coverage of noun, adjective, and verb paradigms in the test set.The distributions follow a power-law distribution; the five most-frequent paradigms for each part-of-speech account for over 77% of types in the data set and cover over 75% word-form tokens in the corpus.Nouns make up the majority of the lexicon (67%), followed by adjectives (22.6%), and verbs (10.4%).In the corpus, however, the proportion of verbs (25.8%) is larger than that of adjectives (19.2%), again with a clear prevalence of nouns (55%).
To generate the negative training and testing instances, we proceeded as follows.
For each LPP, we generate all word-forms using the function wfs (cf.Section 3.1).
Then, for all corpus-attested obtained word-forms, we generate the candidate LPPs using the function lm, and filter out those LPPs that exist in the lexicon.
This generates a large number of incorrect LPPs, from which we again sample 5,000 for training and 5,000 for testing.Thus we end up with 10,000 LPPs (5,000 correct and 5,000 incorrect) in both the training and test set.Given [ 18 ] the number of classes and features (a total of 146 binary-encoded features), the amount of training data ought to be sufficient; a larger training set would unnecessary increase the time required for training.Notice that the training set contains correct and some incorrect LPPs for each sampled word-form, while the test set contains LPPs obtained from word-forms that did not appear in the training set.Also notice that the number of positive and negative instances is artificially balanced; a realistic data set would contain about 17 incorrect LPPs for each correct LPP.We chose to balance the data set because SVM tends to perform poorly on imbalanced data sets (Wu and Chang 2003).

F e a t u r e a n a l y s i s
Some of the features we defined are redundant or perhaps irrelevant for LPP prediction.Because in absolute terms the number of features is not large, we need not perform feature analysis in order to reduce this number.Instead, the purpose of our feature analysis is to gain insight into what features are useful for paradigm prediction.
For feature analysis we used the open source tool Weka (Hall et al. 2009).We used three univariate filtering methods: information gain (IG), gain ratio (GR), and the RELIEF method.The univariate filtering methods determine the relevance of features based on the intrinsic properties of the data; a statistical test is applied to each individual feature in order to determine its importance, features are ranked accordingly, and a desired number of top-ranked features is then chosen.Among the three considered methods, RELIEF (Kira and Rendell 1992;Kononenko 1994) is probably the most efficient.RELIEF works by iteratively estimating the feature weights based on their ability to discriminate between neighboring instances in the input space. 14  Table 5 summarizes the feature analysis results.We lists feature rankings obtained on the training set, with first five ranks shown in bold.The first two methods produced similar rankings: among string-based features, suffix probabilities are ranked the highest, and among corpus-based features, feature Score5 is often ranked high, while ranks of other features vary.There are a number of features that are low-ranked (rank > 10) by each of the three methods: the five EndsIn* features, NumSyllables, OneSyllable, StemLength, Score1, Score3, and POS.Individually, these features seem to be less relevant for paradigm prediction, according to the methods we used.
The univariate methods do not measure the dependencies between the features, thus they cannot detect feature redundancy.We therefore also analyzed 14 More recently, Sun and Li (2006) have shown that RELIEF is less heuristic than initially thought, and in fact solves a margin optimization problem based on the nearest neighbor classifier.
[ 20 ] the features using two multivariate feature subset selection (FSS) methods: correlation-based feature selection (CFS) (Hall 1998) and consistency subset selection (CSS) (Liu and Setiono 1996), both with greedy forward search as the optimization method.Table 5 shows the optimal subset selection obtained with each of these methods.Notice that both selected subsets contain both stringand corpus-based features.

C l a s s i f i c a t i o n a c c u r a c y
We conducted two experiments to evaluate the classification accuracy of our models.In the first experiment, we evaluate the binary classification accuracy, which is in line with how we formulated the problem of paradigm prediction.
In the second experiment, we consider a more realistic setting and evaluate classification accuracy on a per word basis.

BINARY CLASSIFICATION ACCURACY
In the first experiment, we trained eight models using different feature subsets.We optimized the parameters of each model separately using 5-fold crossvalidation on the training set.Table 6 shows classification accuracy on the test set.The reliability of probability estimates used for some of the corpus-based features depends on the frequencies of word-forms in the corpus.In a realistic setting, the unknown words tend to be less frequent in corpus.To analyze how models would perform in such cases, we evaluated on three frequency bands: all LPPs, LPPs for which the frequency of word-forms in the corpus is less than or equal to 100 (rare words, accounting for 66% of the test set) and less than or equal to 10 (very rare words, accounting for 22% of the test set).The performance baseline is the majority class in each test set.
As expected, the maximum accuracy of about 92% was achieved when using all features.Interestingly, in this case the classification accuracy does not decrease much on rare or very rare word-forms.Using only string-or corpus-based features gives worse performance than when using both kinds of features.Furthermore, as expected, using only corpus-based features decreases the performance on rare words.As regards the models with feature selected subsets, all perform above the baseline except the one obtained with CSS.The RELIEF method seems to have selected a very good subset of features; a model with only five features (ParadigmID, EndsIn, LemmaSuffixProb, Score5, and Score2) performs only slightly worse than the model using the full set of 22 features.
[ 22 ] this set, we generated the incorrect LPPs by choosing at random one word-form from the set wfs(l, p) and applying on it the lm(w) function to obtain all its LPP candidates.In this way we obtain for each word-form its correct LPP and all its incorrect LPPs. 15The final data set contains 17,111 LPPs.On this set we compute the model performance in terms of standard information retrieval measures of precision (P), recall (R), and (micro-averaged) F1-score (van Rijsbergen 1979).
Precision score of 100% would mean that no incorrect LPP has been classified 15 Here we ignore the fact that homographs have more than one correct LPP.This only marginally affects the precision scores.
[ 23 ] Choice of corpus.In this work we used a newspaper corpus, and it is possible that this choice has an effect on the overall prediction accuracy.As noted by one reviewer, a newspaper corpus is not likely to contain many aorist and imperfect verb forms, which would add significantly to homography.Although the grammar we used does not model the aorist and imperfect verb forms, the argument still applies to vocative noun forms, which would also add to homography but are likewise underrepresented in newspaper corpora.Although this issue deserves further examination, from a practical viewpoint it is indeed reasonable to use a corpus that minimizes homography, as this will improve the overall prediction accuracy.
Choice of grammar.Perhaps a more interesting question is the choice of the grammar.What might be of particular importance for paradigm prediction in Croatian is the modeling of verbs, more concretely the question of how to treat adjectival and adverbial participles (cf.footnote 5).In the current implementation of the HOFM grammar, the adjectival participles are not included in the verb paradigm.However, including them into the verb paradigm might allow for better prediction of verb paradigms.We leave this issue for future work.
Another issue is the level of grammar ambiguity.HOFM defines applicability conditions for many paradigms; a grammar that does not define such conditions would overgenerate more, leading to a decrease in precision.This suggest that paradigm prediction performance is dependent on the specific grammar used and perhaps does not readily generalize across different grammars.
Training set selection.Another issue that we did not address is the size and diversity of the training set.Often a large morphological lexicon is not available, and one wishes to use paradigm prediction to acquire such a lexicon.Related to this is the question of how many instances per paradigm we need to train a good classifier.The active learning framework provides a way to minimize the number of training instances and hence reduce the manual labeling efforts.
Active learning may also be combined with ranking-based classification to speed up the annotation process.
[ 26 ] Semi-automatic lexicon acquisition.Probably the most interesting application of paradigm prediction is semi-automatic lexicon acquisition.In this setting, confidence-ranked lists of LPP candidates are presented to an expert, who then identifies the correct LPP, which ideally should be ranked first.In this setting it would make sense to evaluate paradigm prediction as a ranking task.There are also a number of other factors that should be considered, such as the presence of noise in the corpus (i.e., words for which no correct LPP exists and which should be rejected), treatment of proper names, and the workflow parameters (e.g., in what order the word-forms should be processed, is the model being updated based on the input from the expert, etc.).
Other evaluation scenarios.There are a couple of other evaluation scenarios that may be considered.First is the evaluation in the context of rule-based tagging (e.g., constraint grammar based tagging, as described by Peradin and Šnajder (2012)), in which the goal is to disambiguate ambiguous morphosyntactic tags, rather than ambiguous paradigms (the former is probably an easier task in most cases).Related to this is a setting in which corpus-based information is not available (e.g., on-the-fly tagging), and one must choose the correct paradigm using only string-based and possibly context-based features.Yet another interesting evaluation scenario is the acquisition of inflectional lexicons from a list of lemmas, which is obviously an easier task than the one we addressed here because the level of grammar ambiguity is lower.

C O N C L U S I O N A N D P E R S P E C T I V E S
Being able to determine the inflectional paradigm of an unknown word is important for morphological analysis of highly inflectional languages.We have addressed the problem of paradigm prediction for Croatian words as a binary classification task over the output of a morphology grammar.We defined a number of string-and corpus-based features and trained different SVM models on selected subsets of these features.The highest accuracy (about 92%) was achieved using the complete set of 22 features.Just slightly worse performance [ 27 ] can be obtained with a subset of only five features (paradigm label, two stringbased features and two corpus-based features).Degradation in classification performance on infrequent words is minimal.When evaluated on a per word basis, the all-features model achieves 53% of F1-score and 70% of accuracy, outperforming the frequency-based baseline by a wide margin.The models perform best on adjectives and worst on verbs.
This work provides a basis for further research.Our first priority will be to apply paradigm prediction to semi-automatic lexicon acquisition and carry out a comprehensive task-based evaluation in this setting.From a machine learning perspective, we will consider using additional features, such as part-of-speech tags and capitalization features.
[ 33 ] or h); 5. EndsInPals -a binary feature indicating whether the word-form ends in a palatal(lj, nj, ć, d, č, dž, š, ž, or  j); 6. EndsInVelars -a binary feature indicating whether the word-form ends in a velar (k, g, or h); 7. LemmaSuffixProb -the probability P(s l |p) of lemma l having a 3-letter suffix s l given inflectional paradigm p; 8. StemSuffixProb -the probability P(s s |p) of stem s having a 3-letter suffix s s given inflectional paradigm p; 9. StemLength -the number of characters in the stem; 10.NumSyllables -the number of syllables in the stem; 11.OneSyllable -a binary feature indicating whether NumSyllables equals 1.

Table 1 :
Inflectional paradigm of the Croatian noun vojnik (soldier).Stem-internal changes (due to sibilarization and palatalization) are shown in bold.

Table 2 :
Five most frequent 3-letter stem suffixes for noun paradigms N01 and N04 and adjective paradigm A06 (estimates from a sample of morphological lexicon fromŠnajder  et al. (2008)).
s ) P(s s |N) Suffix (s s ) P(s s |N) Suffix (s s ) P(s s |A)

Table 3 :
Distribution of morphological tags for noun paradigms N01 and N04 in the corpus and the distributions of tags generated by two LPPs for lemma l = vojnik .Bottom row shows the Jensen-Shannon divergence between the two pairs of paradigm and LPP distributions.
what follows, we use #(w, C) to denote the number of occurrences of wordform w in corpus C. Let T(p) denote the set of morphological tags of inflectional paradigm p.Furthermore, let P(t|p) denote the probability distribution of morphological tag t conditioned on the inflectional paradigm p, and let P(t|l, p) denote the probability of morphological tag t generated by LPP (l, p).We obtain these distributions as maximum likelihood estimates using the LPPs from the inflectional lexicon L and word-form frequencies from corpus C: P(t|p) = (l,p )∈L; p =p; (w,t )∈wfs(l,p); t =t #(w, C) (l,p )∈L; p =p; (w,t)∈wfs(l,p) #(w, C)

Table 4 :
Frequency-sorted lists of noun, adjective, and verb paradigms from the test set.Cov.% denotes the proportion of word-form tokens in the corpus covered by each of the paradigm.

Table 5 :
Feature selection analysis with univariate filtering (Ranking) and multivariate feature subset selection (FSS).

Table 6 :
Paradigm classification accuracy (%) for models with different feature subsets, for three different frequency bins of the word-forms