Size of corpora and collocations: The case of Russian
Keywords:Collocations, Russian corpora, corpus size, corpus linguistics, statistical measures
With the arrival of information technologies to linguistics, compiling a large corpus of data, and of web texts in particular, has now become a mere technical matter. These new opportunities have revived the question of corpus volume that can be formulated in the following way: are larger corpora better for linguistic research or, more precisely, do lexicographers need to analyze bigger amounts of collocations? The paper deals with experiments on collocation identification in low-frequency lexis using corpora of different volumes (1 million, 10 million, 100 million and 1.2 billion words). We have selected low-frequency adjectives, nouns and verbs in the Russian Frequency Dictionary and tested the following hypotheses: 1) collocations in low-frequency lexis are better represented by larger corpora; 2) frequent collocations presented in dictionaries have low occurrences in small corpora; 3) statistical measures for collocation extraction behave differently in corpora of different volumes. The results prove the fact that corpora of under 100 M are not representative enough to study collocations, especially those with nouns and verbs. MI and Dice tend to extract less reliable collocations as the corpus volume extends, whereas t-score and Fisher’s exact test demonstrate better results for larger corpora.
Lyashevskaya, O., & Sharoff, S. (2009). The Frequency Dictionary of Modern Russian based on the Russian National Corpus data [Chastotnyy slovar’ sovremennogo russkogo yazyka (na materialakh Natsional'nogo Korpusa Russkogo Yazyka)]. Moscow: Azbukovnik.
Macmillan English Dictionary for Advanced Learners. (2002). Macmillan Education.
Steinfeld, E. (1963). Frequency dictionary of the Contemporary Russian language [Chastotnyy slovar' sovremennogo russkogo literaturnogo yazyka]. Tallin.
The British National Corpus, (Version 3) (BNC XML Edition). 2007. Distributed by Bodleian Libraries, University of Oxford, on behalf of the BNC Consortium. Retrieved from http://www.natcorp.ox.ac.uk/ (1. 5. 2020)
The Russian National Corpus [Natsional’nyy korpus russkogo yazyka]. Retrieved from http://www.ruscorpora.ru (1. 5. 2020)
The Brown Corpus. Retrieved from http://korpus.uib.no/icame/manuals/brown/index.htm, https://www.sketchengine.eu/brown-corpus/ (1. 5. 2020)
Zasorina, L. (1977). Frequency dictionary of the Russian language [Chastotnyy slovar' russkogo yazyka]. Moscow: Russkiy yazyk.
Benko, V. (2014). Aranea Yet Another Family of (Comparable) Web Corpora. Text, Speech and Dialogue. Proceedings of the 17th International Conference, TSD 2014, 8–12 September, 2014, Brno, Czech Republic. LNCS 8655 (pp. 257–264). Springer International Publishing Switzerland.
Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990.
Daudaravičius, V. (2010). The influence of collocation segmentation and top 10 items to keyword assignment performance. Computational Linguistics and Intelligent Text Processing. Proceedings of the 11th International Conference, CICLing 2010, 21–27 March, 2010, Iasi, Romania (pp. 648–660). Berlin: Springer.
Evert, S. (2004). The Statistics of Word Cooccurrences Word Pairs and Collocations. Dissertation, Institut für maschinelle Sprachverarbeitung, University of Stuttgart. Available at http//purl.org/stefan.evert/PUB/Evert2004phd.pdf (20. 2. 2020)
Evert, S., Uhrig P., Bartsch S., & Proisl, T. (2017). E-VIEW-alation – a large-scale evaluation study of association measures for collocation identification. In I. Kosem et al. (Eds.), Electronic lexicography in the 21st century: Lexicography from Scratch. Proceedings of the eLex 2017 Conference, 19–21 September, 2017, Leiden Netherlands (pp. 531–549). Leiden: Lexical Computing.
Khokhlova, M. (2010). Building Russian Word Sketches as Models of Phrases. In A. Dykstra & T. Schoonheim (Eds.), Proceedings of the XIV EURALEX International Congress, 6–10 July, 2010, Leeuwarden (pp. 364–371). Ljouwert: Fryske Akademy – Afûk.
Khokhlova, M. (2017). Big data and word frequency: Measuring the consistency of Russian corpora. Quantitative Approaches to the Russian Language (pp. 30–48). Routledge, Taylor & Francis.
Khokhlova, M. (2018a). Building a Gold Standard for a Russian Collocations Database. In J. Čibej et al. (Eds.), Lexicography in Global Contexts. Proceedings of the XVIII EURALEX International Congress (pp. 863–869). Ljubljana: Ljubljana University Press, Faculty of Arts.
Khokhlova, M. (2018b). Similarity between the Association Measures a Case Study of Noun Phrases. In Proceedings of the 12th Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2018 (pp. 21–27). Brno Tribun EU.
Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P., & Suchomel, V. (2014). The Sketch Engine: ten years on. Lexicography, 1, 7–36.
Pecina, P. (2009). Lexical Association Measures. Collocation Extraction. Prague Institute of Formal and Applied Linguistics.
Piotrowski, R. G., Bektaev, K. B., & Piotrowskaya, A. A. (1977). Mathematical Linguistics [Matematicheskaya lingvistika]. Moskva: Vysshaya shkola.
Piperski, A. (2015). To be or not to be: Corpora as Indicators of (Non-)Existence. Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue”, 1(14), 515–522.
Rychly, P. (2008). A lexicographer-friendly association score. Proceedings of the Second Workshop on Recent Advances in Slavonic Natural Language Processing RASLAN 2008 (pp. 6–9). Brno: Masaryk University.
Schmid, H. (1994). Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of the International Conference on New Methods in Language Processing. Manchester, UK.
Sinclair, J. (2005). Corpus and Text — Basic Principles. In M. Wynne (Ed.), Developing Linguistic Corpora: a Guide to Good Practice (pp. 1–16). Oxford: Oxbow Books. Retrieved from http://users.ox.ac.uk/~martinw/dlc/chapter1.htm (1. 5. 2020)
How to Cite
Copyright (c) 2020 Maria Khokhlova, Vladimir Benko
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All content of Slovenščina 2.0 is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
Slovenščina 2.0 applies the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license to all published material. Under this license, authors retain ownership of the copyright for their content, but allow anyone to download, reuse, reprint, modify, distribute, copy, remix, transform and/or build upon the content for any purpose, even commercial, as long as the original authors and source are cited. No permission is required from the authors or the publishers. Appropriate attribution can be provided by simply citing the original article. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. For any reuse or redistribution of a work, users must also make clear the license terms under which the work was published.
No separate publishing agreements are signed between the author and the publisher. Authors retain copyright and the publishing rights of their work without any restrictions.
Authors are permitted and encouraged to post the journal’s published version of the work online (e.g., in institutional repositories, on their own websites), with an acknowledgement of its initial publication in Slovenščina 2.0.