Updating the dictionary: Semantic change identification based on change in bigrams over time
Keywords:corpus statistics, bigrams, dictionary update, semantic change, Danish
We investigate a method of updating a Danish monolingual dictionary with new semantic information on already included lemmas in a systematic way, based on the hypothesis that the variation in bigrams over time in a corpus might indicate changes in the meaning of one of the words. The method combines corpus statistics with manual annotations. The first step consists in measuring the collocational change in a homogeneous newswire corpus with texts from a 14 year time span, 2005 through 2018, by calculating all the statistically significant bigrams. These are then applied to a new version of the corpus that is split into one sub-corpus per year. We then collect all the bigrams that do not appear at all in the first three years, but appear at least 20 times in the following 11 years. The output, a dataset of 745 bigrams considered to be potentially new in Danish, are double annotated, and depending on the annotations and the inter-annotator agreement, either discarded or divided into groups of relevant data for further investigation. We then carry out a more thorough lexicographical study of the bigrams in order to determine the degree to which they support the identification of new senses and lead to revised sense inventories for at least one of the words Furthermore we study the relation between the revisions carried out, the annotation values and the degree of inter-annotator agreement. Finally, we compare the resulting updates of the dictionary with Cook et al. (2013), and discuss whether the method might lead to a more consistent way of revising and updating the dictionary in the future.
DDO = Den Danske Ordbog [The Danish Dictionary]. Retrieved from https://ordnet.dk/ddo (17. 2. 2020)
Macmillan = Macmillan English Dictionary. Retrieved from https://www.macmillandictionary.com/ (17. 2. 2020)
Korpus.dsl.dk = Language Technology Resources for Danish. Retrieved from https://korpus.dsl.dk/resources.html
Cook, P., Lau, J. H., Rundell, M., McCarthy, D., & Baldwin, T. (2013). A lexicographic appraisal of an automatic approach for detecting new word-senses. In Electronic lexicography in the 21st century: thinking outside the paper. Proceedings of the eLex 2013 conference (pp. 49–65). Tallinn, Estonia.
Lorentzen, H. (2004). The Danish Dictionary at large: Presentation, Problems and Perspectives. In G. Williams & S. Vessier (Eds.), Proceedings of the 11th EURALEX International Congress (pp. 285–294). Lorient, France.
Mikolov, T., Sutskever, I, Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In Advances in neural information processing systems 26. Retrieved from https://arxiv.org/abs/1310.4546
Norling-Christensen, O., & Asmussen, J. (1998). The Corpus of The Danish Dictionary. Lexikos (Afrilex Series) 8, 223–242.
Pollak, S., Gantar, P., & Arhar Holdt, Š. (2019). What’s New on the Internetz? Extraction and Lexical Categorization of Collocations in Computer-Mediated Slovene. In International Journal of Lexicography, 32(2), 184–206.
Řehůřek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. In Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks (pp. 46–50). Valletta, Malta: University of Malta.
Řehůřek, R. (2020). models.phrases – Phrase (collocation) detection. Retrieved from https://radimrehurek.com/gensim/models/phrases.html (17. 2. 2020)
Tahmasebi, N., Borin, L., & Jatowt, A. (2018). Survey of Computational Approaches to Lexical Semantic Change [Preprint at ArXiv 2018]. Retrieved from https://arxiv.org/abs/1811.06278
Traugott, E. C. (2017). Semantic Change. Oxford Research Encyclopedias [Online publication]. doi: 10.1093/acrefore/9780199384655.013.323
How to Cite
Copyright (c) 2020 Sanni Nimb, Nicolai Hartvig Sørensen, Henrik Lorentzen
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All content of Slovenščina 2.0 is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
Slovenščina 2.0 applies the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license to all published material. Under this license, authors retain ownership of the copyright for their content, but allow anyone to download, reuse, reprint, modify, distribute, copy, remix, transform and/or build upon the content for any purpose, even commercial, as long as the original authors and source are cited. No permission is required from the authors or the publishers. Appropriate attribution can be provided by simply citing the original article. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. For any reuse or redistribution of a work, users must also make clear the license terms under which the work was published.
No separate publishing agreements are signed between the author and the publisher. Authors retain copyright and the publishing rights of their work without any restrictions.
Authors are permitted and encouraged to post the journal’s published version of the work online (e.g., in institutional repositories, on their own websites), with an acknowledgement of its initial publication in Slovenščina 2.0.