DEFINING COLLOCATION FOR SLOVENIAN LEXICAL RESOURCES

In this paper, we define the notion of collocation for the purpose of its use in machine-readable language resources, which will be used in the creation of electronic dictionaries and language applications for Slovene. Based on theoretical and lexicographically-driven studies we define collocation as a lexical phenomenon, defined by three key aspects: statistical, syntactic, and semantic. We take lexicographic relevance as a point of departure for defining collocations within the typology of word combinations, as well as for distinguishing them from free combinations. Free combinations are (frequent) syntactically valid word combinations without lexicographic value and consequently there is no need for the description of their meaning, or syntactic role. Next, we distinguish collocations from all multiword lexical units (compounds, phraseological units and lexico-grammatical units) using the lexicographic view that multiword lexical units, whose meaning is not a sum of its parts, require a description of their meaning whereas collocations do not. In the final part, we return to the three aspects of collocation and their role in automatic extraction of collocational information from corpora. Semantic criterion or dictionary relevance of extracted collocations has particularly exposed the problem of semantically broad collocates such as certain types of adverbs, adjectives and verbs, and word which feature in different syntactic roles (e.g. pronouns and adjuncts). We discuss a particular issue of collocations related to proper names and the decisions about their inclusion into the dictionary based on the evaluation of lexicographers.

In this paper, we define the notion of collocation for the purpose of its use in machine-readable language resources, which will be used in the creation of electronic dictionaries and language applications for Slovene. Based on theoretical and lexicographically-driven studies we define collocation as a lexical phenomenon, defined by three key aspects: statistical, syntactic, and semantic. We take lexicographic relevance as a point of departure for defining collocations within the typology of word combinations, as well as for distinguishing them from free combinations. Free combinations are (frequent) syntactically valid word combinations without lexicographic value and consequently there is no need for the description of their meaning, or syntactic role. Next, we distinguish collocations from all multiword lexical units (compounds, phraseological units and lexico-grammatical units) using the lexicographic view that multiword lexical units, whose meaning is not a sum of its parts, require a description of their meaning whereas collocations do not. In the final part, we return to the three aspects of collocation and their role in automatic extraction of collocational information from corpora. Semantic criterion or dictionary relevance of extracted collocations has particularly exposed the problem of semantically broad collocates such as certain types of adverbs, adjectives and verbs, and word which feature in different syntactic roles (e.g. pronouns and adjuncts). We discuss a particular issue of collocations related to proper names and the decisions about their inclusion into the dictionary based on the evaluation of lexicographers.

I N T R O D U C T I O N
The inclusion of collocations in machine-readable language resources, which are used in the creation of electronic dictionaries and language applications, requires a detailed, yet general enough, definition of the notion of collocation.
It is important that such a definition can be applied in the development of language technologies as well as in language description, in our case in the compilation of Dictionary of Modern Slovene (Gorjanc et al., 2017). Majority of studies that describe collocation as a lexically relevant phenomenon mention three key aspects: (i) statistical, which defines collocation as a statistically significant combination of two or more words, (ii) syntactic, which expects certain syntactic relations between words, and (iii) semantic, which presupposes that a collocation has a specific communication role. The latter aspect has made collocations since their "beginnings" (Firth, 1957;Altenberg, 1991;Sinclair, 1991) a lexical phenomenon that is lexicographically relevant and especially important for non-native speakers of a language (Palmer, 1933).
Considering these established notions of collocations, our paper has two aims.
Firstly, we want to identify characteristics that define collocations as lexically relevant units. By this we mean that collocations are observed as an important part of lexis and worth including into language resources, intended for the creation of dictionaries, language tools and further computer processing (Klemenc et al., 2017). Secondly, we want to define collocations within all types of word combinations, especially in terms of their syntactic and semantic characteristics, which is important when considering their "place" in the dictionary database as well as their description aimed at human users.
The paper is structured as follows. First, the basic notions that describe collocation as a lexically relevant phenomenon are presented. Considering that collocation is a combination of at least two words, it means that we need to consider its relation to all types of word combinations, taking into account the specifics of lexicographic workflow and automatic data extraction from corpora. In Section 3, we describe a typology developed in the compilation of Slovene Lexical Database (Gantar, 2015), which distinguishes between different types of lexicographically relevant multiword units. Next, we present parameters for automatic extraction of collocation candidates from the corpus, and discuss problematic points discovered during the evaluation. Automatically extracted collocation candidates that were deemed as bad or not relevant are divided into four groups according to their nature: problems in corpus annotation, problems related to statistical criteria, problems related to syntactic criteria, and problems related to semantic criteria (or dictionary relevance).
We conclude the paper by discussing steps for improving automatic extraction of collocations from corpora, and offering some solutions for the presentation of collocations as dictionary units.

C O L L O C A T I O N A S A L E X I C A L P H E N O M E N O N
In the study of collocations, the approaches differ depending on how general or narrow the definition of collocation intends to be, and on the purpose of the definition, for example when including collocations in a dictionary. Although different approaches according to their purpose (different types of dictionaries, language learning, natural language processing etc.), focus on different characteristics of collocations, their definitions of collocation revolve around three criteria: statistical, syntactic and semantic.

Statistical criterion
One of the key characteristics when defining collocation is its statistical value, which must be higher than random, or as Atkins and Rundell (2008, p. 302) state, collocation is "a recurrent combination of words, where one specific lexical item (the 'node') has observable tendency to occur with another (the collocate) with a frequency higher than chance". A great body of research exists on measuring collocation strength or collocativity (e.g., Berry-Rogghe, 1973;Church and Hanks, 1990;Church et al., 1991;Biber, 1993;Manning and Schütze, 1999;Evert, 2004;Gries, 2013 Evert (2009), namely that "different association measures will produce entirely different rankings of the collocates" (ibid., p. 1218) and "there is no ideal association measure for all purposes" (ibid., p. 1236).
As will be shown in the next sections, testing of automatic extraction of collocations for dictionary-making purposes has shown that the statistical criterion needs to be combined with semantic and syntactic characteristics of collocations. This is evidenced by findings such as that statistically relevant collocations are usually syntactically more flexible (Gantar et al., 2019) and that collocations containing semantically very general collocates, which are often also very frequent, are semantically less informative and consequently lexicographically less relevant.

Syntactic criterion
As evident from various definitions (Moon, 1998;Hausmann, 1989;Kilgarriff et al., 2004;Seretan, 2010;Baldwin and Kim, 2010;Fellbaum, 2015), collocations are also defined by syntactic relations in which they occur, as well as their internal syntactic relationships. It is worth noting that all word combinations are not possible or syntactically correct and all (frequent) syntactically correct word combinations are not collocations (see also Section 3.1 on the distinction between collocations and free word combinations). Therefore, when considering syntactic criteria in defining collocation one must also consider the number of elements and their lexical value (semantic or grammatical word classes 1 versus functional and modificational word classes), and relatedly also the order of elements in the collocation. Namely, the syntactic nature of word combinations allows for element insertion (e.g. *organizirati mizo 'to organize a table' → organizirati okroglo mizo 'to organize a round table') and adaptation to the context with opening valency positions (tekmovalni del 'competition part' → tekmovalni del programa 'competition part of the programme'). 1 The expression grammatical collocation can also be found in literature (cf. Benson et al., 1986).
As a result, automatic exctraction of lexically relevant collocations from the corpus warranted a careful description of syntactic structures (see Section 4 for more).

Semantic criterion
The semantic criterion is the most important criterion for distinguishing collocations from multiword lexical units and is at the same time the most difficult to specify. While statistical and syntactic criteria are more generally accepted, the body of research on collocations uses one of the two basic approaches when considering their lexical characteristics. The first approach sees collocations as a separate type of phraseological units which is partly or completely (semantically and syntactically) fixed and has become established through regular contextual use. This definition includes especially so-called "phraseological" or "strong" collocations which are limited in lexical choice of its components (Halliday, 1966;Cowie, 1981;Sinclair, 1991), and are a relevant part of mental lexicon.
An example of a phraseological collocation, as put forward by Halliday, is the expression strong tea. While the same meaning could be conveyed by the roughly equivalent powerful tea, this expression is considered excessive and awkward by native English speakers. On the other hand, there are approaches that define collocations more broadly, i.e. as word combinations that are not limited or exclusive but rather allow longer (open) lists of collocates (e.g. herbal/camomile/pepermint/sage tea). Atkins and Rundell (2008, p. 167) define collocations as "… salient phrases in corpus citations [that] yet seem to have no idiomatic meaning" and "… a significantly frequent grouping of words whose meaning is quite transparent" (ibid., p. 223).
In general it can thus be said that collocations found in general dictionaries are not treated as lexical units that require an explanation of their meaning. 2 The inclusion of collocations in dictionaries is due to the fact that they typically disambiguate meanings of polysemous words (e.g. king crown; Czech crown; dental crown) or are due to their widespread use typical of natural language use (pitch black, thick fog; but not *thick black). Their use is sometimes not only language-specific but also culture-specific (take a walk). We have thus selected the semantic criterion, or more specifically the lexicographer's decision about the semantic transparency of word combination and consequently its inclusion among lexical units, as the point of departure of our typology of multiword lexical units. In our typology, presented in the following sections, collocations are excluded from the narrower phraseological framework, which is especially important for their role in the dictionary database.

COLLOCATIONS IN RELATION TO OTHER WORD COMBINATIONS
The fact that the collocation is always a combination of at least two (usually lexical) words requires that we define their relationship towards other frequent word combinations (free combinations) that represent certain syntactic combinations, but usually do not feature in dictionaries. At the same time, collocations need to be defined in terms of their relationship towards different kind of word combinations that behave like lexical units (i.e. multiword lexical units), and thus require a semantic description, or occupy some pragmatic and communication role (see Figure 1).

Figure 1:
Collocations in word combination typology.

Collocations and free combinations
In our dictionary-driven typology collocations are distinguished from socalled "free" word combinations mainly on the basis of their lexicographic relevance. For example, certain word combinations, which can be very frequent but do not disambiguate meanings and contain delexicalised words, are consequently semantically less informative. For example, free combinations such as in pri tem ('and then'), nisem vedel ('I didn't know'), ta način ('this way') etc. are not considered as lexical units. Considering all three aforementioned criteria, we can say that free combinations are, similar to collocations, often frequent word combinations, but differ from collocations in the fact that they do not have any lexicographic value.
It should be noted that syntactic combinations that exhibit characteristics of free combinations can become lexicographically relevant units if they take on certain connective, modificational or discourse roles in the text. For example, combinations such as glede tega ('about this') or zaradi tega ('because of this') have a role of text connectors, whereas the combination samo malo ('only a little' or 'just a moment') in certain contexts has a special discourse or pragmatic role and can be considered as a phraselogical unit.

Collocations and multiword lexical units
In defining collocations in relation to multiword lexical units (MLU), 3 i.e. different multiword units that belong to lexicon and in a dictionary, our main criterion is that MLUs need to exhibit some degree of idiomatic meaning or behaviour. 4 From the perspective of being considered for dictionary inclusion and description, they need to fulfil the criterion that their "meaning is more than the sum of the parts" (Atkins and Rundell, 2008, p. 167). This semantic criterion is, of course, relative and exclusively lexicographic. The judgement of a lexicographer whether a certain word combination requires its own semantic description or not depends on the type of dictionary and its target user(s) (human or computer).
To be able to distinguish collocations from MLUs and determine their role in the dictionary database, we divided MLUs into three groups ( Figure 2).
Phraseological units and compounds require semantic description. The third group consists of different types of lexico-grammatical units such as lightverb constructions that represent typical syntactic combinations in known syntactic and semantic roles. These units are not a standard part of dictionaries, but when they are included, they come with certain lexico-grammatical information. 5 Figure 2: Divison of multiword lexical units.

Compounds
Compounds are a type of multiword lexical units that require a description in the dictionary, given that their meaning cannot be deduced from the meaning of each component. In other words, their meaning is more than a sum of their parts. The main characteristic that distinguishes compounds from phraseological units in our typology is that they as a whole do not have a metaphorical or expressive meaning; for example topla greda ('greenhouse' or 'green- In addition, compounds usually cannot be directly translated into another language, e.g. a direct translation of dnevna soba would be 'day room' rather than the actual translation 'living room'. Similarly, a certain compound in one language is not a compound or a multiword unit in another, e.g. stara mama in Slovene means grandmother in English. In fact, we are aware that languages such as German, Dutch and Norwegian are known for the high productivity of compounds, without space delimitation, however in such cases the formal criteron of single-word vs. multiword structure already acts as the main criterion of distinguishing collocations from compounds. Also, compounds of terminological and semi-terminological nature are multiword lexical units that are of metaphorical origin, but their role is primarily denotative and not expressive, e.g. črna luknja ('black hole') as a space phenomenon. Such compounds can have a metaphorical meaning (among other meanings) which is consequently categorised in our typology under phraseological units.

Phraseological units
Phraseological units are also multiword lexical units with their own meaning.
However, unlike compounds, phraseological units have a metaphorical meaning (also called figurative or connotative meaning). From the communication perspective, this means that when using them, one wants to say something in a more noticeable or expressive manner, differently. Also, in language there is normally a more neutral term with a similar meaning, e.g. to make a mountain out of a molehill and exaggerate. We are therefore talking about phraseology (idiomatics) in its narrowest sense. It is worth pointing out that even within phraseological units we can find different types in terms of their structure and meaning, for example compound-like phraseological units (začarani

Lexico-grammatical units
Another group of word combinations that needs to be distinguished from collocations (and free combinations) are lexico-grammatical units, i.e. frequent multiword units that also contain grammatical and function words. Unlike collocations, the role of lexico-grammatical units in the text is that of sentence or text organisation, which makes them relevant for dictionaries and thus differentiates them from frequent free word combinations. Another characteristic of lexico-grammatical units is that they show statistically significant co-occurrence in certain syntactic relations and are accompanied by predictable syntactic roles in their context. hand -on the other hand').

C O L L O C A T I O N A S A D I C T I O N A R Y U N I T
So far, we defined collocation as a lexical phenomenon, i.e. as a string of words which (a) is statistically relevant, (b) has a predefined syntactic structure and (c) needs to be semantically transparent and meaningful. We also juxtaposed collocations with other word combinations, from free combinations on the one hand to multiword lexical units with their own meaning on the other. We now need to also consider the criterion of dictionary relevance. In this section, we present statistical, syntactic in semantic criteria when extracting collocations from a corpus with the aim of including them into digital dictionary database for Slovene. Furthermore, we outline the parameters for selection of those extracted collocation candidates that are suitable for inclusion in the Collocations Dictionary of Modern Slovene (Gorjanc et al., 2017).

Automatic extraction of collocation candidates
Automatic extraction of collocations from a corpus was conducted with the aim of creating a large digital dictionary database, with several satellite dictionary databases (Klemenc et al., 2017), including the database of collocations dictionary. The extraction was done in two stages, with each stage consisting of several extraction-evaluation iterations . The methodological decision was that automatically extracted data will be used for the Collocations Dictionary of Modern Slovene and immediately presented to the users, followed by regular updates of entries after lexicographic analysis .

Statistical parameters
In the first stage of automatic extraction, collocation candidates were extracted from the Gigafida reference corpus for Slovene (Logar et al., 2012), using a sample of 2,500 lemmas from the Slovene Lexical Database . We used grammatical relations 6 in the Sketch Engine tool (Kilgarriff et al., 2004), using the Sketch Grammar for Slovene, written especially with automatic extraction in mind (Krek, 2016). Moreover, good examples for each collocation were extracted using the GDEX tool and the configuration for Slovene (Kosem et al., 2011). The second iteration of the extraction was One additional step used in the second iteration was the inclusion of collocations with higher raw frequency. This was done because we found that logDice sometimes gives low ranking to highly frequent and relevant collocations, which meant that the exported data, while focussing on statistically more relevant collocations, could include an insufficient number of collocations for highly frequent and polysemous words to represent all the senses. Consequently, we performed and merged two extractions (using the same maximum limit of collocations per grammatical relation), one with collocations ranked by logDice, and the second one with collocates ranked by raw frequency. Expectedly, there was often a significant overlap between the two lists.

Syntactic structures
The first stage of automatic extraction of collocations used grammatical relations, defined in the sketch grammar file in the Sketch Engine tool. The grammatical relations included syntactic structures that were identified during lexicographic analysis. Initially, 528 syntactic structures were used , with noun and verb structures being the most common, but syntactic structures with prepositions (and nouns in different cases) are also prevalent (Table 1), as is also the case in collocations dictionaries for other languages.  It is noteworthy that in the word sketch, collocates under grammatical relations are listed as individual words and in lemma form. 8 Thus, in a morphologically rich language like Slovene, collocate and the headword often need to be put in the correct form to adequately reflect their use in a particular grammatical relation. This can be because of gender and/or number agreement of the headword and the collocate (rdeč -> rdeča jagoda; jesenski -> jesensko listje), or because the headword or the collocate need to be in a certain case (i.e. olupiti jabolko accusative ; črv v jabolku locative ). Moreover, additional elements (e.g. prepositions, conjunctions) were missing in relations with more than two elements, however in such cases the third element was always found in the same form. We solved this issue by automatically postprocessing the extracted data where each element of the grammatical relation (headword, collocate, preposition) was automatically attributed with their role in the collocation (using different tags) and written in the correct form (e.g. correct gender, case, number).

Semantic criteria
There were no specific semantic criteria set for the automatic extraction of collocations. We could say that the selection of grammatical relations already indirectly determined some semantics, as only lexical word classes (with the exception of prepositions and conjunctions in trinary grammatical relations, i.e. relations containing two lexical words and one function word) were used as collocation components. Also, the verb biti ('be') was excluded as a collocate in nearly all grammatical relation containing verbs. Other than that, no other criteria were used, as we wanted to induce semantic criteria (and potentially other statistical and syntactic criteria) from the evaluation with the users.

Evaluation
Evaluation of the automatically extracted collocation data comprised of three separate studies. The first one was conducted with dictionary users (students, translators etc.) on the initiallly extracted data for 2,500 lemmas (Krek et al., 2016), which were available online as the Database of the Collocations Dictionary. The focus was more on the interface features (layout of information, clarity etc.), but included also questions on the presentation of collocations and on the benefits and shortcomings of automatically extracted data.
The second study was done with lexicographers (and linguists) on the 35,989 lemmas dataset, using the Pybossa platform. Lexicographers inspected 17,576 collocations in 143 different grammatical relations for 333 different lemmas (Pori and Kosem, 2018), with at least three lexicographers "voting" on each collocation. They were presented with the information of the grammatical relation, collocation and one example, and were given various options. The optional answers were grouped into Yes, No and I don't know, however Yes and No options had suboptions, e.g. Yes had the suboption that the collocation is OK but the form displayed is not, for example when the collocation should have been in plural. The first findings of the study, with focus on grammatical relations containing adverbs, were presented in Pori and Kosem (2018).
The third study by Pori et al. (2020)  The findings of all three studies, which point to problems of automatic collocation identification and extraction and are relevant for this paper, can be divided into four interconnected topics: • shortcomings related to corpus data, • shortcomings related to syntactic criteria, • shortcomings related to statistical criteria, • shortcomings related to dictionary relevance.

Shortcomings related to corpus data
Many errors that occur during automatic extraction of collocation stem from problems in corpus annotation, i.e. lemmatisation (e.g. *piliti alkohol -> piti alkohol) and part-of-speech tagging (e.g. mixing between adjectives and adverbs (*težek do alkohola 'difficult to alcohol' -> težje do alkohola 'more difficult to get alcohol') or between adjectives and nouns (*premagati poljski 'beat Polish' -premagati poljsko 'beat Poland') that share forms. The first stage of automatic extraction was conducted on the Gigafida corpus, which was automatically tagged using the JOS tagset, with the accuracy of tagging reaching 97.88% at lemma level, and 91.34% at the level of all morphosyntactic tags (Grčar et al., 2012). Quite problematic for syntactic criteria were also errors in annotation of cases when the forms were the same, e.g. nominative and accusative of inanimate nouns, or genitive singular and nominative plural of feminine nouns.
Collocation identification was also influenced by certain linguistic decisions related to corpus annotation. For example, in hyphenated forms such as sladko-kisla omaka ('sweet-sour sauce'), each part of the hyphenated combination was annotated separately; thus, only collocations such as sladka omaka ('sweet sauce') and kisla omaka ('sour sauce') were extracted. Similarly, nominalised adjectives such as zaposleni ('the employed') were annotated as adjectives and thus not found in grammatical relations containing nouns.

Shortcomings related to syntactic criteria
The problems of corpus annotation also affected syntactic criteria, or better said, the quality of collocational output at different grammatical relations.
The sketch grammar is tagset-based, which means that grammatical relations must be defined via tags rather than e.g. syntactic relation identified by parsers. Aforementioned problems of incorrect case annotation therefore resulted in wrong grammatical relation attribution, e.g. *botrovati alkohol ('causes alcohol'; verb + noun accusative ) rather than alkohol botruje ('alcohol causes'; noun nominative + verb). Similarly, adjectives could be incorrectly identified as attributive even when used only predicatively, e.g. *priložena miška ('included mouse') instead of miška je priložena ('mouse is included') or *kriv hormon ('responsible hormones') instead of hormoni so krivi (hormones are responsible (for)). Such combinations, while syntactically correct, do not form meaningful collocations, which means that the expected syntactic relation had to be more narrowly defined on the syntactic/tree level.
There were also cases when one grammatical relation was a limited version of another one, often resulting in duplication of collocations. For example, the collocation vulkanskega izvora ('of volcanic origin') was extracted in the grammatical relation adjective genitive + noun genitive ; however, the genitive form was also included in the grammatical relation adjective + noun (agreement in all possible cases) as the collocation vulkanski izvor ('volcanic origin'). Yet, such collocations have different syntactic roles, as an attributive or subject/ object respectively. Thus, it is important to define grammatical relations more narrowly in such cases.
The evaluation made it clear that certain grammatical relations contained much more noise, i.e. they contained many more bad collocation candidates.
Whereas certain grammatical relations exhibited issues in general, at many different lemmas (e.g. noun + noun genitive ), others were problematic only at certain types of lemmas (e.g. inanimate nouns in the grammatical relation verb + noun accusative ). Furthermore, certain grammatical relations (e.g. verb + noungenitive ) contained such an overwhelming percentage of noise that they were excluded from the collocations dictionary altogether. 9 A problem related to good/bad collocation identification at certain grammatical relations, especially those with errors in case annotation, is related to the fact that at first glance such collocations look good (e.g. izolirati bakterije 'isolate bacteria' in the relation verb + noun genitive ; when it is verb + noun accusative (in plural); only when considering both their form and the grammatical relation they are found in one can discard them as bad. This is of course more problematic when lay users, which perhaps pay less attention to accompanying grammatical information, are confronted with automatically extracted data.

Shortcomings related to statistical criteria
We have already mentioned problems linked to the selection of statistical method for collocation, which led to additional extraction of collocations ranked by raw frequency. Moreover, the parameters set for extraction had to be adjusted for different groups of lemmas according to their word class, grammatical relation, and corpus frequency. Despite these rather detailed criteria, problems were still observed on both ends of frequency ranking, i.e.
at very frequent and very rare lemmas. For very frequent lemmas, the lists of extracted collocations were often too short, especially in the most common grammatical relations, resulting in non-coverage of certain (still salient) senses of the words. In fact, in such cases, the maximum number of collocations was often the only criterion that had to be used, as all the other were not even met (e.g. minimum collocation frequency). Similar problem with left out collocations was observed at very rare lemmas (i.e. rare as on the bottom end of our threshold of 400 hits in the corpus), but the reason was different; the problem occurred mainly because of collocation dispersion, i.e. there were many collocations in the grammatical relation belonging to the same semantic type (and representing the same sense), and while their joint frequency was very high, their individual frequency was below the minimum threshold and they were thus not extracted.
Additional issues that have come up during the evaluation were heavily linked to aforementioned errors in corpus annotation, and relatedly, errors in grammatical relation attribution. First and foremost, this includes collocation candidates that were always errors, and pushed down the ranking (and sometimes off the list of extracted data) other, good, collocations. However, there were also cases when syntactic problems were not absolute, i.e. the collocation was good but its statistics was misleading as the concordances included many incorrectly identified cases, in certain cases to the level where the number of good collocation examples was even below the minimum threshold of 4. For example, čakati nastop 'await a performance' is a good collocation in the verb + noun accusative structure, but examples contained many (incorrect) cases of nastop čaka 'a performance awaits'.
Collocation ranking is also interesting from the perspective of dictionary users. While one of the association measures seems the logical choice for collocation ordering in a dictionary as it reflects the nature of collocation, our initial research (Arhar Holdt, in press) has shown that this is not in line with the expectations of the users who clearly prefer (or expect?) frequency. Further evidence that this problem is not trivial is the practice of some dictionaries (e.g. see Hudeček and Mihajlević, 2020) that avoid any mention of statistics and list collocations by alphabet (only). In the case of our dictionary of collocations, we used a solution where logDice ranking was used as the default one, and an option of switching to alphabetical ranking was made available to the users.

Shortcomings related to dictionary relevance
The evaluation of automatically extracted collocational data from the perspective of dictionary relevance was conducted manually and with the aim of identifying criteria for the selection of collocations for our database, and for the presentation in the dictionary interface. We focussed mainly on determining the informative value of collocations (strong vs. weak collocations), the in- While these weak collocations were not considered relevant for the inclusion in the dictionary, they were still kept in the database because they met statistical and syntactic criteria and might be relevant for some other resource.
In fact, it is important to note that the record of all good (strong and weak) and bad collocation candidates should be kept in the database, and used for comparison in future automatic extractions, so that the duplication of work is avoided.
Interestingly A very specific issue in terms of dictionary relevance of collocation candidates were collocations related to proper names, i.e. collocations that are proper names themselves and often reflect some cultural or language (e.g. Vesele Štajerke 'Happy Styrians', which is the name of a band) and collocations with a collocate that is a proper name (e.g prestolnica Lombardije 'capital of Lombardy'). Such cases are not clear cut, which was also evident from the level of (dis)agreement among evaluators; while cases like Vesele Štajerke were seen as irrelevant for the collocations dictionary by all the evaluators, 11 prestolnica Lombardije showed less agreement as many believed the collocation was relevant as it was a representation of a highly salient and sense indicative combination prestolnica + country/region. In sum, while there are good arguments to include these types of collocations in dictionaries (see e.g. Hudeček and Mihaljević, 2020), we decided to treat such collocations separately as multiword named entities in the database.
Statistics is an essential part of collocation, and this goes beyond its constituent parts. A very important part of collocation not only at its identification but also in presentation to dictionary users is its predominant form. Two frequently problematized issues during evaluation was number for nouns and degree for adjectives. Semantic characteristics of several headwords either require or prefer non-singular form (plural or dual), e.g. *stresti bonbon 'dispense bonbon' instead of stresti bonbone 'dispense bonbons', or finančna težava 'financial trouble' instead of finančne težave 'financial troubles'. Similarly, typicality of collocation can be limited to the adjective in a certain form e.g.
superlative, as in *blizek sorodnik -> najbližji sorodniki 'closest relatives '. 12 All these collocations, if presented in the 'basic form', do not reflect typical use or even appear strange, which means that future extractions should consider the predominant form. A similar approach is already used in the Sketch Engine word sketches in the form of longest-commonest match (Kilgarriff et al., 2015), however the feature still needs improving as it does not always provide a result or often offers a sequence which is longer than the collocation. 13

C O N C L U S I O N S
Collocations are a highly relevant type of word combinations, and are defined by three types of criteria: statistical, syntactic and semantic. As shown in the paper, all three types are heavily interlinked, and each brings different decisions and problems. Equally important as these three types of criteria for any dictionary project is defining collocations in relation to other word combinations, i.e. free combinations and multiword lexical units; as we pointed out free combinations do not have any lexicographic value, whereas multiword lexical units do but they also require a description as their meaning is more than the sum of their parts. By knowing the typology in detail one can make better decisions as to which category the candidate word combination belongs.
Yet, as our evaluation of automatically extracted collocational data has shown, practical application of a theoretical framework brings new challenges, associated with the quality of corpus annotation, the purpose of the dictionary, and the expectations and needs of dictionary users. The challenges are mainly two-fold, with the common theme being the amount of collocations. Firstly, there is the need to separate the wheat from the chaff, i.e. bad collocation candidates from the good ones, caused by problems in corpus annotation or problems stemming from the identification of collocation on the basis of part-of-speech tags. Secondly, there is the question of dictionary relevance, the decision of which cannot be left (only) to statistical measures for collocation identification but is rather mainly semantic, and driven by the target users of the dictionary.
What our experience has shown is that the collocation is defined by statistical, syntactic, and semantic criteria, however these criteria are not set in stone, and cannot be generalized across the language (i.e. they cannot be the same for different types of words). Constant evaluation and improvement of the criteria is required. The Slovenian language as a morphologically rich language is particularly problematic as far as the syntactic criteria are concerned. Our efforts to improve the quality of automatic collocation identification are currently directed mainly in this direction. Thus, we are testing the extraction of collocations from a parsed corpus, using 76 collocational structures that have been 'translated' from the definitions of grammatical relations for a part-ofspeech tagged corpus. Initial results are promising and this approach seems to definitely solve a few existing problems (e.g. collocation form in terms of case and number as well as typicality, and the amount of bad candidates), but is likely to require some fine-tuning.
We are not neglecting the statistical and semantic aspects, though. On the statistical level, we are exploring the measures such as deltaP (Gries, 2013) to determine the symmetry of collocations, i.e. to establish which collocations are relevant only for one of its constituent parts. On the semantic level, we want to explore the characteristics of weak collocates and prepare stop lists, probably for different groups of lemmas. Most importantly, we are including all these activities in our efforts to compile a common digital database for Slovene where collocations, and all other word combinations, will be available to the research community and creators of language resources.