Encoding polylexical units with TEI Lex-o: A case study

  • Toma Tasovac Belgrade Center for Digital Humanities, Serbia
  • Ana Salgado New University of Lisbon, CLUNL, Portugal
  • Rute Costa New University of Lisbon, CLUNL, Portugal
Keywords: TEI, lexicography, language resources, polylexical units, interoperability


The modelling and encoding of polylexical units, i.e. recurrent sequences of lexemes that are perceived as independent lexical units, is a topic that has not been covered adequately and in sufficient depth by the Guidelines of the Text Encoding Initiative (TEI), a de facto standard for the digital representation of textual resources in the scholarly research community. In this paper, we use the Dictionary of the Portuguese Academy of Sciences as a case study for presenting our ongoing work on encoding polylexical units using TEI Lex-0, an initiative aimed at simplifying and streamlining the encoding of lexical data with TEI in order to improve interoperability. We introduce the notion of macro- and microstructural relevance to differentiate between polylexicals that serve as headwords for their own independent dictionary entries and those which appear inside entries for different headwords. We develop the notion of lexicographic transparency to distinguish between those units which are not accompanied by an explicit definition and those that are: the former are encoded as <form>–like constructs, whereas the latter becomes <entry>–like constructs, which can have further constraints imposed on them (sense numbers, domain labels, grammatical labels etc.). We codify the use of attributes on <gram> to encode different kinds of labels for polylexicals (implicit, explicit and normalised), concluding that the interoperability of lexical resources would be significantly improved if dictionary encoders would have access to an expressive but relatively simple typology of polylexical units.


Download data is not yet available.


Dicionário da Língua Portuguesa Contemporânea. (2001). João Malaca Casteleiro (Eds.), 2 vols. Lisboa: Academia das Ciências de Lisboa and Editorial Verbo.

Dictionnaire des Expressions et Locutions. (1993). Alain Rey and Sophie Chantreau (Eds.). Col. Les Usuels. Paris: Éd. Dictionnaires Le Robert.

Grande Dicionário Houaiss da Língua Portuguesa. (2015). Instituto António Houaiss Bloco Gráfico, Lda. Lisboa: Círculo de Leitores.

DARIAH WG = Lexical Resources and the H2020-funded European Lexicographic Infrastructure (ELEXIS). Retrieved from https://github.com/DARIAHERIC/lexicalresources/tree/master/Schemas/TEILex0 (23. 2. 2020)

TEI Consortium (Ed.) = TEI P5: Guidelines for Electronic Text Encoding and Interchange (2019). Version 3.5.0. [Last updated on 29th January 2019, revision 3c0c64ec4.] TEI Consortium. Retrieved from http://www.tei-c.org/Guidelines/P5/ (23. 2. 2020)

Atkins, B. T. S., & Rundell, M. (2008). The Oxford Guide to Practical Lexicography. Oxford: Oxford University Press.

Baldwin, T., & Kim, S. (2010): Multiword Expressions. In N. Indurkhya & F. J. Damerau (Eds.), Handbook of Natural Language Processing (2nd ed., pp. 267–292). Boca Raton, USA, CRC Press.

Bergenholtz, H., & Gouws, R. (2013). A Lexicographical Perspective on the Classification of Multiword Combinations. International Journal of Lexicography, 27(1), 1–24. doi: 10.1093/ijl/ect031

Calzolari, N., Fillmore, C. J., Grishman, R., Ide, N., Lenci, A., MacLeod, C., & Zampolli, A. (2002). Towards Best Practice for Multiword Expressions in Computational Lexicons. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2002) (pp. 1934–1940). Spain: Las Palmas, Canary Islands.

Considine, J. (2014). Academy Dictionaries 1600-1800. Cambridge, New York: Cambridge University Press.

Cowie, A. P. (1994). Phraseology. In R. E. Asher (Ed.), The Encyclopedia of Language and Linguistics (pp. 3168-3171). Oxford, UK: Pergamon.

Cowie, A. P. (Ed.). (1998). Theory, Analysis, and Applications. Oxford: OUP.

Fellbaum, C. (2016). Treatment of Multi-Word Units. In P. Durkin (Ed.), The Oxford Handbook of Lexicography (pp. 411–424). Oxford: Oxford University Press.

Fontenelle, T. (1997). Turning a Bilingual Dictionary into a Lexical-Semantic Database. Tübingen: Niemeyer.

Gantar, P., Colman, L., Parra Escartín, C., & Martínez Alonso, H. (2018). Multiword Expressions: Between Lexicography and NLP. International Journal of Lexicography, 32(2), 138–162. doi: 10.1093/ijl/ecy012

Hausmann, F. J. (1979). Un Dictionnaire des Collocations Est-Il Possible? Travaux de Linguistique et de Littérature, 17(1), 187–195.

ISO 24613-1 (2019). Language Resource Management — Lexical Markup Framework (LMF) — Part 1: Core Model. Genève: Organisation Internationale de Normalisation.

Jónsson, J. H. (2009). Lemmatisation of Multiword Lexical Units: Motivation and Benefits. In H. Bergenholtz, S. Nielsen & S. Tarp (Eds.), Lexicography at a Crossroads. Dictionaries and Encyclopedias Today, Lexicographical Tools Tomorrow (pp. 165–194). Bern: Peter Lang AG.

Kinable, D. (2015). Reflections on the Concept of a Scholarly Dictionary. Kernerman Dictionary News, 23, 11–2.

Lorentzen, H. (1996). Lemmatization of Multi-word Lexical Units: In Which Entry? In M. Gellerstram et al. (Eds.), Proceedings of the 7th EURALEX International Congress on Lexicography: Part I (pp. 415–421). Goteborg, Sweden: Goteborg University Department of Swedish.

McCrae, J. P., Tiberius, C., Khan, F., Kernerman, A., Declerck, T., Krek, S., Monachini, M., & Ahmadi, S. (2019). The ELEXIS interface for interoperable lexical resources. In I. Kosem, T. Zingano Kuhn, M. Correia, J. P. Ferreira, M. Jansen, I. Pereira, J. Kallas, M. Jakubíček, S. Krek & C. Tiberius (Eds.), Electronic Lexicography in the 21st Century: Smart Lexicography. Proceedings of the eLex 2019 Conference (pp. 417–433). Brno: Lexical Computing CZ, s.r.o. Retrieved from https://elex.link/elex2019/wp-content/uploads/2019/09/eLex_2019_37.pdf

Mel’čuk, I., Arbatchewsky-Jumarie, N., Iordanskaja, L., Mantha, S., & Polguère, A. (1984–1999). Dictionnaire Explicatif et Combinatoire du Français Contemporain. Recherches lexico-sémantiques, IV. Montréal: Les Presses de l’Université de Montréal.

Mel’čuk, I. (1998). Collocations and Lexical Functions. In A. P. Cowie (Ed.), Phraseology, Theory, Analysis, and Applications (pp. 23–54). Oxford: Oxford University Press.

Moon, R. (1998). Fixed Expressions and Idioms in English: A Corpus-Based Approach. Oxford: Clarendon Press.

Romary, L., & Tasovac, T. (2018). TEI Lex-0: A Target Format for TEI-Encoded Dictionaries and Lexical Resources. In Proceedings of the 8th Conference of Japanese Association for Digital Humanities (pp. 274–275). Retrieved from https://tei2018.dhii.asia/AbstractsBook_TEI_0907.pdf

Sailer, M., & Markantonatou, S. (2018). Multiword expressions: Insights from a multilingual perspective (Phraseology and Multiword Expressions): Vol. 1. Berlin: Language Science Press. doi: 10.5281/zenodo.1182583

Salgado, A., Costa, R., Tasovac, T., & Simões, A. (2019a). Improving the Consistency of Usage Labelling in Dictionaries with TEI Lex-0. Lexicography: Journal of ASIALEX 6(2), 133–156. doi: 10.1007/s40607-019-00061-x

Salgado, A., Costa, R., & Tasovac, T. (2019b). TEI Lex-0 In Action: Improving the Encoding of the Dictionary of the Academia das Ciências de Lisboa. In I. Kosem, T. Zingano Kuhn, M. Correia, J. P. Ferreira, M. Jansen, I. Pereira, J. Kallas, M. Jakubíček, S. Krek & C. Tiberius (Eds.), Electronic Lexicography in the 21st Century: Smart Lexicography. Proceedings of the eLex 2019 Conference, 1–3 October, 2019, Sintra, Portugal (pp. 417–433). Brno: Lexical Computing CZ, s.r.o. Retrieved from https://elex.link/elex2019/wp-content/uploads/2019/09/eLex_2019_23.pdf

Simões, A., Almeida, J. J., & Salgado, A. (2016). Building a Dictionary using XML Technology. In Open Access Series in Informatics (OASIcs). 5th Symposium on Languages, Applications and Technologies (SLATE'16): Vol. 51 (pp. 14:1–14:8). Germany, Dagstuhl: Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

Svensén, B. (2009). A Handbook of Lexicography: The Theory and Practice of Dictionary Making. Cambridge: Cambridge University Press.

Tasovac, T., & Petrović, S. (2015). Multiple Access Paths for Digital Collections of Lexicographic Paper Slips. In I. Kosem, M. Jakubíček, J. Kallas & S. Krek (Eds.), Electronic Lexicography in the 21st Century: Linking Lexical Data in the Digital Age. Proceedings of the eLex 2015 Conference (pp. 384–396). Ljubljana/Brighton: Institute for Applied Slovene Studies and Lexical Computing Ltd. Retrieved from https://elex.link/elex2015/proceedings/eLex_2015_25_Tasovac+Petrovic.pdf

Zgusta, L. (1971). Manual of Lexicography. Prague: Academia; The Hague/Paris: Mouton.

Supporting Agencies
Portuguese national funding agency for science, research and technology, ELEXIS
How to Cite
TasovacT., SalgadoA., & CostaR. (2020). Encoding polylexical units with TEI Lex-o: A case study. Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research, 8(2), 28-57. https://doi.org/10.4312/slo2.0.2020.2.28-57