COMBINING AVAILABLE DATASETS FOR BUILDING NAMED ENTITY RECOGNITION MODELS OF CROATIAN AND SLOVENE

.


I N T R O D U C T I O N
Named entity recognition and classification (NERC), nowadays often called Slovenščina 2.0, 2 (2013) [36] just named entity recognition (NER) is a subtask of the information extraction task.It aims to locate and classify text elements into predefined categories, and is regularly applied on more complex natural language processing problems, using statistical or rule-based models.State-of-the-art systems tend to be open-domain and language independent.This paper presents efforts in creating NER models for Croatian and Slovene language, available for free academic and non-academic use.Besides performing initial experiments on datasets developed from our side, we experiment with multiple recently published datasets for both languages as well.Thereby we receive a clearer picture of the underlying phenomena and manage to publish models of greater robustness and higher accuracy than those built on our previous datasets only.
The tool we are using to build the models is the Stanford Named Entity Recognizer (StanfordNER), nowadays a frequently used tool for NER.It is an implementation of Conditional Random Fields (CRF) sequence models and is available under GNU GPL license and free for academic use (Finkel et al. 2005).Besides many feature extractors that come with this tool, it is designed to work with the clustering method proposed by (Clark 2003) which combines standard distributional similarity with morphological similarity to cover infrequent words for which distributional information alone is unreliable.This paper is structured as follows: in Section 2 we give an overview of related work, in Section 3 we present the datasets used in our research.Section 4 gives the experimental setup, section 5 the results of our initial experiments and section 6 the results of the experiments on additional datasets.We lay out the main conclusions in section 7.

R E L A T E D W O R K
To our knowledge, there has been some effort in developing NER systems for South Slavic languages which were mostly rule-based.New statistic based Slovenščina 2.0, 2 (2013) [37] approaches have recently emerged.
A rule-based system for Croatian described in (Bekavac 2005) uses regular grammars for recognition and classification of names over annotated texts.
The system contains a module for sentence segmentation, lexicon of common words, specialized lists of names and transducers for automatic recognition of certain word forms.
A statistical approach described in (Bošnjak 2007) uses a semi-supervised method based on lists of names and entity extraction system.
For Serbian a rule-based system (Vitas, Pavlović-Lažetić 2008) based on lexical recognition is developed.The authors point out certain differences between English and Serbian language that make the task of building a successful system for Serbian challenging, as well as all the other Slavic languages which require a more thorough preparation of the system due to rich inflection.
None of the presented systems are available for academic usage which hinders researchers investigating tasks that require NER as a preprocessing step.One of the main intentions of our research is to improve this situation.In the process of building a good NER system, features are considered as important as the selection of machine learning algorithms.The aim is to find an optimal set of features that will ensure the highest system accuracy with minimum complexity in classifier building.Several NER approaches use a very large number of features (Mayfield et al. 2003), but the inclusion of additional features after a certain point can even yield worse results.
In this paper we use Stanford NER property files obtained in our previous research (Filipić et al. 2012) along with the findings about best performing settings for Croatian and Slovene which include POS, MSD and distributional similarity features.The only work we are aware of that examines the usage of distributional features in Stanford NER is (Faruqui, Padó 2010).The paper describes the process of building and optimizing NER models for German and Slovenščina 2.0, 2 (2013) [38] by using distributional features F1 is improved for 6% in-domain and 9% outof-domain.
The latest approach in Croatian NER research, called CroNER (Glavaš et al. 2012), is based on supervised learning using CRF.Observed classes include personal names, ethnics, percentages, locations, organizations, dates, and monetary and temporal expressions.The analysis of the Vjesnik corpus containing 310,000 tokens, reaches (exact) F1 measure of 87.42%.The system uses gazetteer-based features for personal names, ethnicity, city, state, street and organization names, and a rich set of lexical and morphological features specific for Croatian.Authors also defined a special named entity type to cover instances of possessive adjectives.According to the authors, two different methods for document-level consistency of NE labels are implemented: postprocessing rules (hard consistency constraint) and a two-stage CRF (soft consistency constraint).Post-processing rules are hand-crafted patterns designed to extract or re-label named entities omitted or misclassified by the CRF model.Two-stage CRF aims to consolidate NE label predictions on document and corpus level by employing a second CRF model that uses features computed from the output of the first CRF model.NER model for Slovene (Štajner et al. 2012), developed using Mallet (McCallum 2002), also implements a CRF supervised learning algorithm.
Research has been made on Slovene SSJ500k 1 corpus annotated with morphosyntactic tags and three named entity classes.It has shown that the inclusion of morphosyntactic tag features benefits named entity extraction.
The system reaches precision of 77% and recall of 76%, having stronger performance on personal and geographical named entities than on other entities, since the class of other entities (everything not being person or location) is very diverse and difficult to predict.

C O R P O R A
Two initial datasets used in the first batch of our experiments, one Croatian (HR) and one Slovene (SL) were built during a student project (Filipić et al. 2012) from data taken from specific Internet domains from the Croatian and Slovene web corpora hrWaC and slWaC (Ljubešić, Erjavec 2011).The Croatian corpus (HR) contains 59,212 tokens taken from four different Internet domains covering two general newspaper portals, nacional.hrand jutarnji.hr,one ICT portal bug.hr and the business news portal poslovni.hr.The data was annotated during a student project in which data diversity was given special emphasis.The Slovene corpus (SL) is at almost two thirds the size of the Croatian one, containing 37,032 tokens from one general news portal rtvslo.si.
While selecting the Slovene data the main goal was to build a usable dataset with limited annotation capacities.
We obviously live in exciting times for natural language processing in both Croatia and Slovenia because after finishing our initial batch of experiments, three additional datasets -two for Croatian and one for Slovene -have emerged, all published under quite permissive licenses.
The additional Croatian datasets include the SETimes and Vjesnik corpora.
SETimes is a newspaper domain corpus consisting of general news articles written in Croatian language, originally extracted from the "Southeast European Times" web portal.It contains 178,982 tokens and has the highest density of named entities in the text.The Vjesnik corpus contains a collection of two main text domains -internal affairs and other text domains, evenly distributed between culture, foreign affairs and other news, lifestyle and sports.The Corpus contains 104,494 tokens.Text collection was performed by using a custom crawler, texts were further processed, i.e., cleaned, sentence split and tokenized by using Apache OpenNLP2 tools and POS/MSD annotated using the CroTag MSD-tagger (Agić et al. 2008).Related [40] experiments with Croatian NER using the Vjesnik corpus are described in (Agić, Bekavac 2013).
The additional resource obtained for Slovene language is a part of the SSJ500k corpus.It is available under the Creative Commons CC-BY-NC-SA license.It is manually lemmatized and morphosyntactically tagged, and in part dependency parsed and annotated for named entities.Named entities are classified into four classes: names of persons, locations and organizations and other entities.In our research we use only the part of the corpus annotated with named entities which is 118,609 tokens in size.The amount of data in all datasets for both languages is given in Tables 1 and 2  To be able to use POS information on unseen Croatian data, we trained a model for the HunPos tagger (Halácsy et al. 2007) from the initial Croatian dataset.We performed a simple test of the resulting model by dividing the dataset into training and test set with a 9:1 ratio.Accuracy obtained on the test set was 95.1%.We publish the tagger trained on all available data along with the NER models and the benchmark datasets.To our knowledge, this is the first freely available part-of-speech tagger for Croatian.4 The expected difference in diversity of the initial datasets can be clearly observed from the amount of annotated named entities for each corpus.First of all, although the SL corpus has 37% less textual material than HR, it has just 6% less named entities showing a higher density of named entities one [42] would expect from a straightforward newspaper dataset.Furthermore, when we look at the type of named entities, we can observe that the SL dataset contains many more person names and slightly more locations while the Croatian dataset contains more organization names and named entities labelled with the miscellaneous category.This data confirms our assumption that the Croatian dataset is much more diverse and will thereby present a harder task for supervised classification in the initial experiments.
The additional Croatian datasets are newspaper datasets and show a nonsurprising high named entity density in the text.The density is even higher than the one in the initial Croatian dataset regardless of the fact that additional datasets do not contain the miscellaneous category.
The additional Slovene dataset is rather named-entity-sparse having half of the initial Slovene dataset density.This should not come as a surprise since the SSJ dataset is a reference corpus and therefore does not contain only newspaper texts as all other datasets do.
We  We built additional test sets once we obtained the additional datasets by adding similar amount of information from each dataset to the joint test set.
The Croatian test set includes 6,730 tokens from the Vjesnik corpus, and 6,736 tokens from the SETimes corpus along with the previously mentioned initial HR test set.Since the MISC category is not present in the additional Croatian datasets, the extended Croatian dataset naturally does not contain that category.Slovene additional test set contains 6,981 tokens from SSJ corpus, along with the initial SL test dataset.The MISC category is retained in that test set since both datasets contain it.
For calculating distributional similarity of tokens from large monolingual corpora, portions of hrWaC and slWaC web corpora were used.For Croatian we built a 100Mw corpus and for Slovene a 50Mw corpus, both containing data from large news portals.

E X P E R I M E N T A L S E T U P
Since different annotation levels on initial Croatian and Slovene datasets were available, in the first batch of experiments we evaluated different settings for each language on the HR and SL corpora.Besides part-of-speech information for both languages, on SL data MSD and lemma information was present as well.
Slovenščina 2.0, 2 (2013) [44] On HR data we experimented with POS information ("POS"), distributional information ("DISTSIM") calculated from 10Mw, 50Mw and 100Mw corpora while on Slovene data we experimented with POS, MSD ("MSD") and lemma ("LEMMA") information and distributional information obtained from 10Mw and 50Mw corpora.Thereby we performed 8 initial experiments on HR data and 11 initial experiments on SL data (we eliminated the experiments varying with availability of lemma information once it proved to be non-informative).
All the experiments were performed on development sets of both datasets via Obtaining additional datasets for both languages at a later point enabled us to perform an additional batch of experiments and re-examine our findings in the initial experiments.We built additional test sets containing left-out information from all datasets as described in the previous section.We performed calculations on the few most promising settings from the initial batch of experiments.The important difference between the two languages in the second batch of experiments is that the Croatian dataset does not contain the miscellaneous category while the Slovene dataset does.
Finally we compared results obtained with different amounts of annotated data on both languages with the best performing settings to identify the gain we can expect from adding more annotated data.

R E S U L T S O F T H E F I R S T B A T C H O F E X P E R I M E N T S
The results obtained by 5-fold cross-validation on both development sets are presented for Croatian in Figure 1 and for Slovene in Figure 2. The results of each cross-validation are averaged by calculating their harmonic mean.
Regarding the statistical significance of the results, we perform a one-tailed paired t-test over pairs of results we find interesting.On Croatian results we can observe already in the second experiment (POS) that basic morphological information in this simple setting improves F1 for 4.5% (p = 0.002).Our third experiment (DISTSIM 10M) shows that using distributional information Slovenščina 2.0, 2 (2013) [46] obtained from a 10 million token corpus improves the result as much as the part-of-speech information with similar significance (p = 0.005).By combining both features we improve our results for 8.5%, more significantly in comparison with using only one of those features (p < 0.001).By calculating distributional information on five and ten times more data we get improvements of 2% and 3% when not using part-of-speech information and 1% and 2% when using part-of-speech information.The differences between neighbouring corpus sizes (10 and 50; 50 and 100) are not statistically significant, but the differences between using 10Mw and 100Mw corpora are (p = 0.007).We see a steady rise in performance as the unlabelled monolingual corpus size increases, motivating us to perform similar calculations on much larger datasets in the future.
The results on Slovene data regarding the categories present in Croatian data are rather similar backing up those findings.There are two types of information in Slovene data we did not have for Croatian -MSD and lemma.
By using MSD and not only POS information the results do improve for additional 1%, but statistically insignificantly (p = 0.21).On the contrary, by adding lemma information to the MSD decreases the result significantly for [47] 5.5% (p = 0.007).One could expect such an outcome since lemmatization performs worst on named entities.Adding more distributional information by moving from a 10Mw to a 50Mw corpus we achieve an improvement of 5% which is even steeper than the one obtained on Croatian data, now highly significant (p < 0.001).This could be explained by the higher simplicity and similarity of this dataset to the monolingual corpus used for distributional similarity calculation, pointing to a conclusion that for datasets of narrower domains additional data sources such as this one give more improvement.We can observe on both datasets that, when using distributional similarity from larger corpora, including additional features like POS or MSD accounts for a lower increase in the results.
When comparing results on Croatian and Slovene datasets one observes right away that the results on Slovene data are much better although the size of the dataset is below half the size.This can be traced back to the fact that the Slovene dataset contains a narrower domain, has a higher vocabulary transfer and a higher amount of named entities like person and location which are considered easier to recognize and classify.On the other hand the resulting Slovenščina 2.0, 2 (2013) [48] Croatian module is expected to be more robust and should perform better on different domains.evaluating five models built on different data on five different evaluation sets.
We chose two settings per dataset for final testing on the left-out test set.The first one uses distributional information, but leaves out the need for morphological annotation of the data while the second one uses both distributional and morphological information.We present the results of precision, recall, F1, true positives and false positives and negatives by category in Table 5.
The number of false negatives shows to be on both datasets and settings higher than the number of false positives with higher percentage of precision than recall as a direct consequence.On Slovene data the best performing categories are PERS, LOC, ORG and then MISC.On Croatian data LOC tends to perform best, ORG and PERS forming a tie and MISC being traditionally the worst category.The somewhat unexpected order of category performance on the Croatian dataset can probably be followed to the wider domain of that dataset.

R E S U L T S O F T H E S E C O N D B A T C H O F E X P E R I M E N T S
In the second batch of experiments we used the secondary test sets consisting of left-out parts of all datasets used for training the models.We experimented with using part of speech and distributional information since these features showed to be most promising in the first batch of experiments.An additional reason not to include full MSD information is the result of an experiment where we assessed the usability of the MSD information on larger datasets such as the SETimes corpus which is partially manually and partially automatically annotated with full morphosyntactic information.We used 5fold cross-validation on the dataset and the differences between using POS and MSD were consistently below 1%.This goes in line with our overall findings in the initial batch of experiments where additional linguistic features were losing importance when increasing the amount of annotated data.
The results for specific datasets are given in Table 6.The distributional Slovenščina 2.0, 2 (2013) [50] information proves to be of greater importance than the part-of-speech information on all datasets.Combining those two does just slightly improve the results when compared to using only distributional information.This is a very usable finding since it enables us to build final models that do not rely on part-of-speech information and thereby do not require such pre-processing.Distributional similarity shows better performance on smaller datasets, with the difference on HR being 9.95%, on the Vjesnik dataset 7.53% and on the largest and densest SETimes dataset just 4.35%.POS features seem to lose on their significance when using bigger datasets as well.
The SSJ corpus, although almost 4 times bigger than the SL corpus, shows just slightly greater performance in recognizing named entities.The reason for this result is the low density of named entities in that dataset showing that newspaper corpora are better reference corpora for annotating and modelling this phenomenon.
The overall lower results on the Slovene datasets can be followed to their smaller size, inclusion of the miscellaneous category and the lower usefulness of the SSJ dataset for the task at hand.
The harmonic mean of all F1 measures on all corpora actually sums up our main findings -POS information has a small positive impact while DIST information has a very significant impact by improving the result for 7 points.
Combining those two does not yield enough improvement to justify the preprocessing step of part-of-speech tagging.Detailed results on the combined corpora are in Table 7. Combined . All corpora were tagged using the IOB2 standard following the CoNLL 2003 annotation guidelines3 where each row represents a token in the text with its linguistic annotation and designated predefined named entity category.IOB2 labels show whether a word is at the beginning (B), inside (I) or outside (O) of a named entity.Both initial datasets (HR and SL) were annotated with the four traditional categories -location (LOC), organization (ORG), person (PERS) and miscellaneous (MISC).The additional Slovene dataset (SSJ) contains the same three categories while the two additional datasets in Croatian (SETimes and Vjesnik) have only the basic three categories annotated -location (LOC), organization (ORG) and person (PERS).Possessive adjectives indicating named entities are additionally annotated in the SETimes, Vjesnik and SSJ datasets as it is the case in the initial HR and SL datasets (Filipić et al. 2012).Basic part-of-speech (first letter of the Multext-East MSD) (Erjavec et al. 2003) on the HR corpus was manually annotated since related work shows that these features are useful for the task.Slovene SL corpus was MSD tagged and lemmatized with the freely available ToTaLe tagger (Erjavec et al. 2005) trained on JOS corpus data (Erjavec et al. 2010).
Figure 3 which depicts the F1 evaluation result as a function of the number of annotated named entities in each dataset.This plot shows the possible impact of adding more data for training the model as well.

Figure 3 :
Figure 3: F1 measure as a function of the number of named entities.OnCroatian datasets we can observe a typical logarithmic behaviour with obvious room for improvement by annotating even more data.The Slovene datasets, when compared to the Croatian ones, are all small and near to each other regarding the amount of annotated named entities and are obviously still in the strong growth phase so building a larger Slovene dataset should be an even higher priority than for Croatian.The slower rise of the Slovene learning curve can be followed to two specificities of those datasets -inclusion of the miscellaneous category and smaller density of named entities providing a smaller amount of positive examples in the training set and leaving more room for errors in the test set.

Table 1 :
Size of the Croatian corpora and the number of annotated named entities.

Table 2 :
Size of the Slovene corpora and the number of annotated named entities.
divided the initial corpora into development and test sets by shuffling documents and producing test sets of similar size for both languages.The decision to build test sets of similar size was guided by the idea of publishing those test sets as benchmark datasets for both languages.For that reason the HR development set contains 53,142 tokens while the SL one contains 29,686 tokens, i.e. 56% of the amount of Croatian data.An additional insight into the features and thereby specificities of the two initial datasets is given by calculating vocabulary transfer between identical portions of development and test sets.The numbers are given in Table3.Vocabulary transfer is calculated as token and type percentage of named entities in the test set being already present in the development set.Two interesting properties can be observed here.First of all, the Slovene vocabulary transfer is higher than the Croatian one pointing at the expected lower content diversity of Slovene data.Secondly, there is almost no between token and type transfer on Croatian data showing that the diversity of named entities is really high.Namely, this points to the fact that almost none of the named entities from the development set present in the test set appears more than once in the Croatian test set which is not the case in Slovene data. difference

Table 3 :
Vocabulary transfer for initial corpora on identical portions of development and test set.

Table 4 :
First 20 elements of sample clusters obtained with Clark's tool on the 100Mw Croatian and 50Mw Slovene corpus.** The Croatian cluster contains exclusively country and city names, and the Slovene cluster contains first names of people of both Slovene and English origin.
5-fold cross-validation that takes into account document borders.By respecting document borders we were trying to keep the vocabulary transfer as low as possible and thereby obtain the most realistic results, i.e., differences between different experimental settings.Distributional similarity was calculated by using Clark's cluster_neyessen tool (Clark 2003) with default settings (numberStates=5, frequencyCutoff=5, iterations= 10).The number of resulting clusters was set on best-performing values in (Faruqui, Padó 2010), i.e., for 10Mw corpora 100 clusters and for 50Mw and 100Mw corpora 400 clusters were built.First twenty elements of example clusters calculated from the Croatian 100Mw and Slovene 50Mw corpora are given in Table 4.The Croatian cluster contains exclusively country and city names in the locative (or dative) case.The Slovene cluster contains first names of people in the nominative case of both Slovene and English origin.These examples show very clearly how the cluster ID can be used as a very informative feature in the supervised training procedure.

Table 6 :
F1 test results on all available corpora based on secondary test sets.