ENCODING POLYLEXICAL UNITS WITH TEI LEX-0: A CASE STUDY

The modelling and encoding of polylexical units, i.e. recurrent sequences of lexemes that are perceived as independent lexical units, is a topic that has not been covered adequately and in sufficient depth by the Guidelines of the Text Encoding Initiative (TEI), a de facto standard for the digital representation of textual resources in the scholarly research community. In this paper, we use the Dictionary of the Portuguese Academy of Sciences as a case study for presenting our ongoing work on encoding polylexical units using TEI Lex-0, an initiative aimed at simplifying and streamlining the encoding of lexical data with TEI in order to improve interoperability. We introduce the notion of macro- and microstructural relevance to differentiate between polylexicals that serve as headwords for their own independent dictionary entries and those which appear inside entries for different headwords. We develop the notion of lexicographic transparency to distinguish between those units which are not accompanied by an explicit definition and those that are: the former are encoded as –like constructs, whereas the latter becomes –like constructs, which can have further constraints imposed on them (sense numbers, domain labels, grammatical labels etc.). We codify the use of attributes on  to encode different kinds of labels for polylexicals (implicit, explicit and normalised), concluding that the interoperability of lexical resources would be significantly improved if dictionary encoders would have access to an expressive but relatively simple typology of polylexical units.

concluding that the interoperability of lexical resources would be significantly improved if dictionary encoders would have access to an expressive but relatively simple typology of polylexical units.

I N T R O D U C T I O N
A polylexical unit can be defined as a stable and recurrent sequence of lexemes that are perceived as an independent lexical unit by the speakers of a language.
At the same time, scholars have long recognised that polylexical units are essential components of lexical resources (Svensén, 2009;Atkins and Rundell, 2008;Fontenelle, 1997;Hausmann, 1979;Mel'čuk et al., 1984Mel'čuk et al., -1999Zgusta, 1971). When including a polylexical item in a dictionary, lexicographers decide on the degree of its lexical independence based on several criteria from different fields of knowledge, including statistics, semantics, morphosyntax, pragmatics and/or, broadly speaking, culture. This kind of lexicographic judgement, enacted through a particular editorial policy and influenced by the conventions of a given lexicographic tradition, necessarily leads to multiple ways of capturing, classifying and presenting lexicographic knowledge about polylexical units. The lack of a more general agreement within the lexicographic community makes the process of encoding dictionaries particular- Unlike corpus linguists who try to describe linguistic evidence as it appears in recorded instances of genuine language use, or practising lexicographers who try to systematise their knowledge about words and their meaning by laying it out in dictionary articles, dictionary encoders work on formally representing the concrete lexicographic content of existing dictionaries. This is an important distinction to be kept in mind in the context of what we are trying to achieve in this paper. When, in the rest of this paper, we discuss polylexical units, we will do so from the point of view of lexicographic data modelling, i.e. the process of explicitly marking up the structural hierarchies and the scope of particular textual components appearing in existing dictionary entries in order to convert them to electronic format as part of lexicographic digitisation workflow (Tasovac and Petrović, 2015). In other words, our starting point will be polylexical units as a stable and recurrent sequence of lexemes that are perceived as independent lexical units by the lexicographers of a given dictionary. Our focus will be on how these linguistic phenomena appear on a printed dictionary page and at which level of the dictionary microstructure. Our main goal will be to explore how these phenomena can be formally described using the recommendations of the Text Encoding Initiative (TEI), 1 in general, and TEI Lex-0, 2 in particular.
The encoding of polylexical units in dictionaries is a topic that has not been covered adequately and in sufficient depth by the TEI, a de facto standard for the digital representation of textual resources in the scholarly research community. We will discuss the challenges and propose some solutions to this problem. We will also argue that a typology of polylexical units for dictionaries encoding -especially given both the limited resources which are usually available for this kind of work and data interoperability as a worthy goal to pursue -need to be relatively general so that it can be used and applied by dictionary encoders in a straight-forward fashion.
The terminology we use in this paper aims to be supra-theoretical, and consequently, as neutral as possible, hence our preference for "polylexical units".
We recognize, nonetheless, that the term "multiword expression" (MWE) is already widely used, including in the LMF standard, ISO 24613-1:2019. In this paper, we will, therefore, proceed as follows: when we refer to the linguistic structure of a lexical unit composed of two or more lexemes, we will use the term polylexical unit. In our discussion of TEI Lex-0, we will allow "MWE" as an attribute value in order to provide better alignment with LMF and because the TEI Lex-0 community has already used this term.
This article is organised as follows: in Section 2, the lexicographic treatment of polylexical units is explored based on the Dictionary of the Portuguese Academy of Sciences (DLPC) as a case study. A TEI Lex-0 representation of polylexical units in DLPC is discussed in Section 3; and, finally, in Section 4, we offer some concluding remarks and some recommendations about the future work needed in this area.

L E X I C O G R A P H I C T R E A T M E N T O F P O L Y L E X I C A L U N I T S
Dictionaries by design describe systematised knowledge about words and their meanings through typographic conventions that are imbued with meaning and affected by a long tradition: the use of bold typefaces to signal the lemma or headword in a dictionary article; the use of abbreviations (especially in print dictionaries) for grammatical features or usage labels (Salgado et al., 2019a); the numbering of senses and the use of different typefaces for different elements in the hierarchy (definitions, examples, etc.). Experienced dictionary users can become quite proficient at understanding and navigating the structure of the dictionary by interpreting the dictionary's typographic features and the way these features may differ from one dictionary to another.
Still, that kind of understanding, based on both knowledge and experience, is not something which can always be easily formalised.
Two main challenges are affecting the modelling of polylexical units in dictionaries, both of them related to the typographical constraints of the printbased, general-language dictionaries: 1. In most general-language dictionaries, polylexical units do not appear as headwords, i.e., independent lexical units in the dictionary macrostructure, but rather as sub-units within entries that have a monolexical headword; and 2. Polylexical units in dictionaries are not always explicitly labelled as such: they may be typographically singled out, using a particular typeface, but they are not always accompanied by the label which identifies the given unit as a "collocation", "idiom" or a "proverb".
The position of polylexical units in the dictionary and the benefits of lemmatisation have been discussed before (see Jónsson (2009) and Lorentzen (1996), for instance) but for our purposes, it is essential to note that when we suggest particular encodings of the Dictionary of the Portuguese Academy of Sciences, we will be following the structure and the conventions of that very dictionary.
That means that we will not be trying to flatten the hierarchy or to encode all polylexical units using the same set of tags. We will be encoding them as they appear within the structure imposed by the dictionary itself.
As for the lack of explicit labels for particular types of polylexical units, we will, in the subsequent sections, explain the extent to which the types can be deduced from the entry structure. We will, in the process, also consult the Introduction to the Dictionary, which to some degree explains the structure from the point of view of the dictionary editors.

DLPC as a case study
The Dicionário da Língua Portuguesa Contemporânea (DLPC) is a monolingual Portuguese dictionary published by Academia das Ciências de Lisboa (2001). As such it is representative of the Academy tradition in European lexicography: large-scale and long-term dictionary projects, initiated and compiled by official national bodies established to record, maintain and promote authoritative accounts of language use (see Considine, 2014). It contains around 70,000 entries and was published in 2001 in two volumes, totalling 3880 pages. The PDF version of the printed dictionary was later converted into XML using a customised version of the P5 schema of the Text Encoding Initiative (TEI), while a custom-built dictionary writing system using TEI as a data model in the backend, was developed to serve as an editing environment for the new and improved online edition of the dictionary (Simões et al., 2016). Besides, the DLPC is currently being converted to the TEI Lex-0 format for data interoperability purposes (Salgado et al., 2019b).
We selected DLPC as a case study in our ongoing work on developing guidelines for encoding polylexicals in TEI Lex-0 for two reasons: (1) as a monolingual scholarly dictionary of the Portuguese language, DLPC covers a wide range of polylexical units from collocations to strongly lexicalised expressions; and (2) because scholarly dictionaries, with their "pursuit of completeness concerning the entries relevant to subject matters" (see Kinable, 2015) typify detailed lexicographic information and elaborate microstructure, which can more often than not pose challenges in terms of consistent data modelling.
Given the lack of detail given to the encoding of polylexical units in the TEI Guidelines, the authors thought it was essential to take a single but complex dictionary as a starting point for our exploration of the topic in this paper. It goes without saying that further comparative work will be needed to validate and improve our recommendations. But it also goes without saying that the proposed mechanisms for marking up polylexical units in DLPC at different levels of the dictionary microstructure will generally be applicable to other dictionaries as well. While dictionaries may differ in terms of their "typographic view", i.e. page layout, column and line breaks, and their "editorial view", i.e. the sequential arrangement of individual tokens along with the use of specific font styles, punctuation and special symbols (the so-called "editorial" view), they are more easily comparable in terms of their "lexical view", i.e. the underlying structure and the types of information units contained in them. 3 While our focus on DLPC here is, above all, a matter of practicality, we will be using it as a springboard for illustrating broader encoding challenges.
Structurally speaking, we should distinguish two main types of polylexical items: 1. polylexical units which serve as headwords for their own independent dictionary entries; 2. polylexical units which appear inside entries for different headwords.
We will refer to the first category as the macrostructurally relevant polylexical units and the second as the microstructurally relevant polylexical units. The notion of relevance here is local -it refers only to the structure of the given dictionary. In the context of this particular dictionary and, more generally speaking in the Portuguese orthographic tradition, hyphenation is treated as a mark of lexicalisation and non-compositional meaning, which leads to lexicographic treatment at an entry-level. For instance, lugar-comum [commonplace] does not merely connote a common type of place [lugar comum]: the meaning of the hyphenated unit -an ordinary thing, a platitude or a cliché -cannot be obtained from its constituent parts. As such, it is considered, from the point of view of the lexicographer, headword material. 4 Latin phrases, which are used in the Portuguese language, are included in the DLPC macrostructure as entries of their own because they cannot be easily ascribed to particular Portuguese headwords.

Microstructurally relevant polylexical units
Microstructurally  The monolexical lemma descalçar [to remove], as shown in Figure 1, has four numbered senses. luvas and descalçar as meias. This is an example of lexicographic shorthand, typical of print dictionaries. In the given case, the user is expected to be able to decipher that the verb descalçar, in the given sense (removing something one is wearing), is typically used with objects such as shoes, boots or gloves.
This type of polylexical unit is classified as "co-ocorrente privilegiado" [privileged co-occurrent] in the Introduction to DLPC. 5 The sets separated by the semi-colon are described as "semantically and syntactically related blocks". 6 It appears, however, that this rule is not always followed consistently because the two sets we described above are semantically and syntactically indistinguishable: the difference in the gender of the collocate (as botas vs. os sapatos) is of no relevance to the construction of this particular type of polylexical unit.

Lexicographically non-transparent polylexical units
In DLPC, the treatment of lexicographically non-transparent polylexical units follows a minimal entry-like structure in which the polylexical unit itself is set in boldface (similar to a lemma) and accompanied by a definition (or a pointer to a definition under a different entry). These units can themselves be divided into two further categories, based on the position they take up in the entry microstructure: 1. those that are attached to particular senses; and 2. those that appear at the end of the entry, following the description of individual senses.
Take, for instance, the following example ( Figure  The dictionary itself defines locução in its grammatical sense as a group of words that work, semantically and syntactically as a whole, equivalent to a single word. 8 The same sense also includes several different types of expressions: adjectival, adverbial, conjunctive, prepositional and verbal. 8 "Grupo de palavras que funcionam, semântica e sintacticamente como um todo, que equivalem a um só vocábulo. Rey and Chantreau (1993) underline the difference between lexical and grammatical phrases: "Locution […] est exactement 'manière de dire', manière de former le discours, d'organiser les éléments disponibles de la langue pour produire une forme fonctionnelle. C'est pourquoi on peut parler de 'locutions adverbiales' ou 'prépositives', alors que ces mots grammaticaux complexes ne seraient jamais appelés des 'expression' (p. VI).  The entry for dura [duration], on the other hand, as shown in Figure 4, has two numbered senses followed by two polylexical units: ser de pouca dura [to be short-lived] and ser sol de pouca dura [lit. to be a sun that does not last, i.e., to be a nine days' wonder]) without explicit labelling of the type of units that they are.
In DLPC proper, expressão idiomática has the domain label Linguistics and is defined as an expression that is peculiar to the language, usually because its meaning is not literal. 9 The expressão fraseológica [phraseologi- cal expression] is not defined in the dictionary.

R E P R E S E N T I N G P O L Y L E X I C A L U N I T S I N T E I L E X -0
TEI is a de facto standard for the digital encoding of all types of written texts, ranging from standard books to poems, visiting other less straightforward documents, e.g., tables, mathematical formulae, cookery recipes or even music notation. It also defines how specific humanities resources, including morphologically annotated monolingual and parallel corpora, should be encoded.
Chapter 9 of the TEI Guidelines 10 focuses specifically on the encoding of dictionaries and other types of lexical resources.
TEI Lex-0 11 (Romary and Tasovac, 2018) is a newer, stricter subset of TEI, which was launched in 2016 by the DARIAH Working Group on Lexical Resources. 12 The goal of TEI Lex-0 is to establish a baseline encoding and a target format to facilitate the interoperability of heterogeneously encoded lexical resources. TEI Lex-0 should not be thought of as a replacement of the Dictionary Chapter in the TEI Guidelines but rather as a "format that existing TEI dictionaries can be unequivocally transformed to in order to be queried, visualised, or mined uniformly". 13 In the context of the ELEXIS project, 14 TEI Lex-0 has been adopted, together with OntoLex, as one of the baseline formats for the ingestion of existing dictionaries into the ELEXIS infrastructure (McCrae et al., 2019). While TEI Lex-0 is being developed, some of its best-practice recommendations are also changing the recommendations of TEI Guidelines themselves.

Polylexical units in TEI Guidelines
The Dictionary Chapter of the TEI Guidelines is very sparse when it comes to recommendations for encoding polylexical units. The only mention of the adjective "multi-word" appears in the definition of the element <term>: "contains a single-word, multi-word, or symbolic designation which is regarded as a technical term" but this is not relevant for the encoding of polylexical units in general-purpose dictionaries.
TEI includes an element <colloc> (collocate), which is defined as containing "any sequence of words that co-occur with the headword with significant frequency" but, in a different example, "colloc" is used as an attribute value for the element <usg> (usage). It is precisely this type of ambiguity that TEI Lex-0 is trying to resolve. for nestable entry-like structures without the need to resort to <re>, a differently named element whose content model would be indistinguishable from <entry> itself. Eventually, the new content model of <entry>, which allows nesting, was adopted by TEI itself.

Encoding macrostructurally relevant polylexical units
In terms of modelling, polylexical units as headwords do not present any particular challenges for TEI Lex-0. Because they function as lemmas in dictionary entries, they need to be encoded with the required @type attribute on <form>. DLPC does not label them explicitly as polylexical, which is why previously in Salgado et al. (2019b), the authors recommended that this information be encoded as a @type attribute on <entry>. At the time, the goal was to differentiate entries based on their headwords as monolexical, polylexical, affixes and abbreviations. Nevertheless, for lexicographic work with digital lexical resources, it is crucial not only to be able to extract all polylexical units but also to have the possibility to individualize them. That is why we need to go one step further and develop a mechanism for encoding different types of polylexical units. <entry xml:lang="pt" xml:id="decreto-lei" type="polylexicalUnit"> <form type="lemma"> <orth>decreto-lei</orth> <pron>dɨkrεtulˈɐj</pron> </form> <gramGrp> <gram type="mwe" value="composto"/> <gram type="pos" norm="NOUN">s.</gram> <gram type="gen">m.</gram> </gramGrp> <!--etc. --> </entry> In Figure 5, the only addition to the encoding suggested in Salgado et al. (2019b) is the inclusion of <gram type="mwe" value="composto"/> to mark up the particular kind of polylexicality, even though this type of entry-level polylexicals is not explicitly labelled as such. For a detailed explanation of how one can encode different types of polylexical units, regardless of whether the given dictionary uses explicit labels for them or not, see Section 3.4 in this paper.
The situation with Latin expressions is slightly different because they are explicitly labelled in DLPC as such. See Figure 6: can not say the same of loc. lat., which combines grammatical and etymological information. Therefore, we recommend that this label be modelled as two different components: an mwe label for loc. lat., which adequately represents the label of the source, and an etym element to explicitly mark up the language of origin.

Encoding microstructurally relevant polylexical units
Microstructurally relevant polylexical units will be encoded differently in TEI Lex-0 depending on whether they are lexicographically transparent or not.
Only the non-transparent ones will require full markup within an <entry> construct.

Encoding lexicographically transparent polylexical units
Following from our discussion in Section 2.2.2.1, the TEI Lex-0 encoding of lexicographically transparent polylexical units in DLPC should meet the following requirements: 1. each set of polylexical units should be grouped together to represent the microstructure of the entry adequately; 2. each polylexical unit should be identifiable as such for easy retrieval; 3. the explicit label "+" should be used only where it occurs in the dictionary text, but the implicit positioning of the headword in the given polylexical unit should be marked up as well.
A sense-related non-transparent polylexical unit can be encoded in TEI Lex-0 within an <entry> construct. 15 The type of the polylexical unit is indicated by the <gram> element, which is discussed in greater detail in the following section of this paper.

Encoding types of polylexical units
We saw above that some polylexical units in DLPC are explicitly labelled as such (for instance loc. lat. or loc. adv., but some are not -for instance, hyphenated compounds as headwords, or idiomatic expressions. TEI Lex-0 should provide a consistent but flexible mechanism for labelling types of polylexical units in dictionaries regardless of whether these labels exist explicitly in the dictionary source or not. We propose to encode this information using the existing TEI gramGrp/gram mechanism, in order to have the maximum flexibility to cover these three distinct types of labels: 1. implicit labels, i.e., those labels whose value can only be deduced from its typographical properties or its position in the entry structure, but are not present on the dictionary page (for instance, compounds as headwords in DLPC); 2. explicit labels, i.e. labels which appear on the dictionary page (for instance, loc. adv. in DLPC); 3. normalised labels, i.e. normalised versions of either implicit or explicit labels, which can be used to improve the interoperability of the labels.
The consistent labelling of polylexical units in a dictionary can be achieved by adopting the following principles: 1. Any polylexical unit should be identified by the presence of a generic element-attribute combination: <gram type="mwe"/>. Without any further classification, <gram type="mwe"/> does not tell us anything about the specific type of the polylexical unit.
3. Implicit labels should be placed in the @value attribute.
4. Normalised values should be placed in the @norm attribute.
In addition to being encoded as text nodes, explicit labels should, for the sake of consistency with implicit labels, also use the @value attribute. This is to avoid situations in which some labels are encoded as text and some as attributes. The consistent use of the @value attribute for both explicit and implicit labels will make it easier to retrieve all labels of a specific type regardless of how they are labelled in the text of the dictionary. Also, it is important to emphasize that the @value and @norm attributes should be kept conceptually distinct: the former should be used as a locally non-ambiguous identifier of both the explicit and implicit labels in a given dictionary; the latter, on the other hand, should be optionally used as a placeholder for a dictionary-independent classification of the local label.

C O N C L U D I N G R E M A R K S
Our recommendations for encoding polylexical units using TEI Lex-0 show that TEI Lex-0 is fully capable of consistently marking up polylexical units as constituent parts of the dictionary macro-and microstructure, regardless of whether they appear as headwords in independent entries, or in nested entry-like structures inside entries for monolexical units. The use of nested <entry> elements to encode polylexical units inside dictionary entries is a robust mechanism which can take care of all kinds of lexicographic constraints imposed on the description of polylexical units (polysemy, domain labels, grammatical labels etc.), whereas the combination of <gram> element and attributes @type, @value and @norm can be used consistently to encode explicit, implicit and normalised versions of the labels.
In this paper, we focused on the formal representation of polylexical units as they appear on the page of a single dictionary because we wanted to document the process of translating lexicographic and typographic conventions from linear text strings to hierarchical, tree-like structures using the vocabulary and syntactic constraints of TEI Lex-0. While further comparative work will be needed to validate our recommendations on a larger sample, the process we described in this paper and the markup solutions we proposed are sufficiently abstract to serve as a basis for marking up the lexical view of polylexical items in various dictionaries, even though we can expect to see more pronounced differences in their editorial and typographic views. When it comes to designing and applying TEI Lex-0 markup to dictionary entries, the question of whether a dictionary is a paper dictionary, a retrodigitised one or a born-digital resource is of little consequence: what matters is that one can consistently identify, represent and validate all the microstructural elements in a given dictionary entry using a standardised vocabulary.
As we could see in the penultimate section of this paper, the interoperability of encoded lexical resources would be significantly improved if dictionary encoders would have access to a typology of polylexical units that was both expressive and straightforward enough to apply when modelling lexical data.
It would be safe to say that very detailed typologies, like the one proposed by Bergenholtz (2013), which includes twenty different types of MWEs, would be challenging to implement in practice. That is why more work on the classification of polylexical items specifically for encoding purposes will be necessary.
One could argue that there is "no hope of finding a single classification or taxonomy of polylexical units that can be used for all purposes" (Sailer, 2018, p. vi), but a comparative study of multiple dictionaries in different languages would bring us one step closer to proposing, discussing and eventually agreeing on a sensible typology that could be used in the context of TEI Lex-0 as a set of attribute values for normalizing local lexicographic classifications. We hope to pursue this line of work in the future.