Defining collocation for Slovenian lexical resources

  • Iztok Kosem University of Ljubljana, Faculty of Arts, Slovenia; Jožef Stefan Institute, Ljubljana, Slovenia
  • Simon Krek Jožef Stefan Institute, Ljubljana, Slovenia
  • Polona Gantar University of Ljubljana, Faculty of Arts, Slovenia
Keywords: collocation, multiword lexical unit, word combination, Slovene, lexicography, dictionary database

Abstract

In this paper, we define the notion of collocation for the purpose of its use in machine-readable language resources, which will be used in the creation of electronic dictionaries and language applications for Slovene. Based on theoretical and lexicographically-driven studies we define collocation as a lexical phenomenon, defined by three key aspects: statistical, syntactic, and semantic. We take lexicographic relevance as a point of departure for defining collocations within the typology of word combinations, as well as for distinguishing them from free combinations. Free combinations are (frequent) syntactically valid word combinations without lexicographic value and consequently there is no need for the description of their meaning, or syntactic role. Next, we distinguish collocations from all multiword lexical units (compounds, phraseological units and lexico-grammatical units) using the lexicographic view that multiword lexical units, whose meaning is not a sum of its parts, require a description of their meaning whereas collocations do not. In the final part, we return to the three aspects of collocation and their role in automatic extraction of collocational information from corpora. Semantic criterion or dictionary relevance of extracted collocations has particularly exposed the problem of semantically broad collocates such as certain types of adverbs, adjectives and verbs, and word which feature in different syntactic roles (e.g. pronouns and adjuncts). We discuss a particular issue of collocations related to proper names and the decisions about their inclusion into the dictionary based on the evaluation of lexicographers.

Downloads

Download data is not yet available.

References

Altenberg, B. (1991). Amplifier Collocations in Spoken English. In S. Johansson & A. B. Stenström (Eds.), English Computer Corpora. Selected Papers and Research Guide (pp. 127–147). Berlin/New York: Mouton de Gruyter.

Arhar Holdt, Š. (in press). Razvrstitev kolokacij v slovarskem vmesniku: uporabniške prioritete. In Kolokacije kot temelj jezikovnega opisa: od statistike do semantike. Ljubljana: Ljubljana University Press, Faculty of Arts.

Atkins, B. T. S., & Rundell, M. (2008). The Oxford Guide to Practical Lexicography. New York: Oxford University Press.

Baldwin, T., & Kim, S. N. (2010). Multiword expressions. In Handbook of Natural Language Processing (2nd ed.). CRC Press, Taylor and Francis Group.

Benson, M., Benson, E., & Ilson, R. (1986). The BBI Dictionary of English Word Combinations. John Benjamins, Amsterdam.

Berry-Rogghe, G. L. (1973). The computation of collocations and their relevance in lexical studies. In The computer and literal studies (pp. 103–112). Edinburgh/New York: University Press.

Biber, D. (1993). Representativeness in Corpus Design. Literary and Linguistic Computing 8(4), 243–257.

Church, K., & Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics, 6(1), 22–29.

Church, K. W., Gale, W., Hanks, P., & Hindle, D. (1991). Using statistics in lexical analysis. In U. Zernik (Ed.), Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon (pp. 116–164). Erlbaum, Hillsdale, NJ.

Cowie, A. P. (1981). The treatment of collocations and idioms in learners' dictionaries. In A. P. Cowie (Ed.), Lexicography and its Pedagogical Applications [Thematic issue]. Applied Linguistics 2(3), 223–235.

Evert, S. (2004). The statistics of word cooccurrences: Word pairs and collocations. PhD Thesis, University of Stuttgart.

Evert, S. (2009). Corpora and collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook: Vol. 2 (pp. 1212–1248). Berlin/New York: Mouton de Gruyter.

Fellbaum, C. (2015). Syntax and grammar of idioms and collocations In T. Kiss & A. Alexiadou (Eds.), Syntax: Theory and analysis: Vol. 2 (pp. 776–802). Berlin/New York: Mouton de Gruyter.

Firth, J. R. (1957). Modes of Meaning. Papers in Linguistics 1934–51. London: Oxford University Press.

Gantar, P. (2015). Leksikografski opis slovenščine v digitalnem okolju. Ljubljana: Znanstvena založba Filozofske fakultete. Retrieved from http://www.ff.uni-lj.si/sites/default/files/Dokumenti/Knjige/e-books/leksikografski.pdf

Gantar, P., Colman, L., Parra Escartín, C., & Marínez Alonso, H. (2019). Multiword Expressions: Between Lexicography and NLP. International Journal of Lexicography, 32(2), 138–162.

Gantar, P., Kosem, I., & Krek, S. (2016). Discovering automated lexicography: the case of Slovene lexical database. International journal of lexicography, 29(2), 200–225.

Gorjanc, V., Gantar, P., Kosem, I., & Krek, S. (Eds.). (2017). Dictionary of Modern Slovene: Problems and Solutions. Ljubljana: Ljubljana University Press, Faculty of Arts.

Grčar, M., Krek, S., & Dobrovoljc, K. (2012). Obeliks: statistični oblikoskladenjski označevalnik in lematizator za slovenski jezik. In T. Erjavec & J. Žganec Gros (Eds.), Zbornik Osme konference Jezikovne tehnologije. Ljubljana: Institut Jožef Stefan.

Gries, S. (2013). 50-something years of work on collocations. International Journal of Corpus Linguistics, 18(1), 137–165.

Halliday, M. A. K. (1966). Lexis as a Linguistic Level. Journal of Linguistics, 2(1), 57–67.

Hausmann, F. J. (1989). Le dictionnaire de collocations. In F. J. Hausmann et al. (Eds.), Wörterbücher: ein internationales Handbuch zur Lexikographie (pp. 1010–1019). Berlin/New York: De Gruyter.

Hudeček, L., & Mihaljević, M. (2020). Collocations in Croatian Web Dictionary – Mrežnik. Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research, 8(1).

Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The Sketch Engine. In G. Williams & S. Vessier (Eds.), Proceedings of the 11th EURALEX International Congress (pp. 105–116). Lorient: France.

Kilgarrif, A., Baisa, V., Rychlý, P., & Jakubíček, M. (2015). Longest–commonest Match. In I. Kosem, M. Jakubíček, J. Kallas & S. Krek (Eds.), Electronic Lexicography in the 21st Century: Linking Lexical Data in the Digital Age. Proceedings of the eLex 2015 Conference (pp. 397–404). Ljubljana/Brighton: Trojina, Institute for Applied Slovene Studies/Lexical Computing Ltd.

Klemenc, B., Robnik Šikonja, M., Fürst, L., Bohak, C., & Krek, S. (2017). Technological design of a state-of-the-art digital dictionary. In V. Gorjanc, P. Gantar, I. Kosem & S. Krek (Eds.), Dictionary of Modern Slovene: Problems and Solutions (pp. 10–22). Ljubljana: Ljubljana University Press, Faculty of Arts.

Kosem, I., Husák, M., & McCarthy, D. (2011). GDEX for Slovene. In I. Kosem & K. Kosem (Eds.), Electronic Lexicography in the 21st Century: New applications for new users. Proceedings of the eLex 2011 Conference, 10–12 November, 2011, Bled, Slovenia (pp. 151–159). Ljubljana: Trojina, Institute for Applied Slovene Studies.

Kosem, I., Krek, S., Gantar, P., Arhar Holdt, Š., Čibej, J., & Laskowski, C. (2018). Collocations Dictionary of Modern Slovene. In J. Čibej, V. Gorjanc, I. Kosem & S. Krek (Eds.), Proceedings of the 18th EURALEX International Congress: Lexicography in Global Contexts, 17–21 July, 2018, Ljubljana, Slovenia (pp. 989–997). Ljubljana: Ljubljana University Press, Faculty of Arts. Retrieved from https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/view/118/211/3000-1

Krek, S. (2016). Leksikografska orodja za slovenščino: slovnica besednih skic. In V. Gorjanc, P. Gantar, I. Kosem & S. Krek (Eds.), Slovar sodobne slovenščine: problemi in rešitve (pp. 358–378). Ljubljana: Ljubljana University Press, Faculty of Arts.

Krek, S., Gantar, P., Kosem, I., Gorjanc, V., & Laskowski, C. (2016). Baza kolokacijskega slovarja slovenskega jezika. In T. Erjavec & D. Fišer (Eds.), Proceedings of the Conference on Language Technologies and Digital Humanities, September 29th–October 1st, 2016, Ljubljana, Slovenia (pp. 101–105). Ljubljana: Academic Publishing Division of the Faculty of Arts.

Logar, N., Grčar, M., Brakus, M., Erjavec, T., Arhar Holdt, Š., & Krek, S. (2012). Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko; Fakulteta za družbene vede.

Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, Massachusetts: The MIT Press, Chap. 5. Collocations.

Moon, R. (1998). Fixed Expressions and Idioms, a Corpus-Based Approach. Oxford: Oxford University Press.

Palmer, H. E. (1933). Second Interim Report on English Collocations, Submitted to the Tenth Annual Conference of English Teachers under the Auspices of the Institute for Research in English Teaching. Tokyo: Institute for Research in English Teaching.

Pecina, P. (2009). Lexical association measures and collocation extraction. Language Resources and Evaluation, 44(1–2), 137–158.

Pori, E., & Kosem, I. (2018). In the Search of Lexicographically Relevant Collocation: The Example of Grammatical Relations Containing Adverbs. Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research, 6(2), 154–185. doi: 10.4312/slo2.0.2018.2.154-185

Pori, E., Kosem, I., Čibej, J., & Arhar Holdt, Š. (2020). The attitude of dictionary users towards automatically extracted collocation data: a user study. Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research, 8(2), 168–201.

Seretan, V. (2010). Syntax-Based Collocation Extraction (1st ed.). Berlin, Heidelberg: Springer-Verlag.

Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press.

Wiechmann, D. (2008). On the computation of collostruction strength. Corpus Linguistics and Linguistic Theory 42, 253–290.

Published
2020-08-10
Supporting Agencies
Slovenian Research Agency, Horizon 2020
How to Cite
KosemI., KrekS., & GantarP. (2020). Defining collocation for Slovenian lexical resources. Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research, 8(2), 1-27. https://doi.org/10.4312/slo2.0.2020.2.1-27