1. Introduction
TEI Lex-0 is both a technical specification and a set of community-based recommendations for encoding machine-readable dictionaries. It is rooted in the Guidelines of the Text Encoding Initiative (TEI) and delivered as a customization of the TEI schema.
Following the spirit of TEI Analytics, developed in the context of the MONK project (Zillig 2009), TEI Lex-0 aims at establishing a baseline encoding and a target format to facilitate the interoperability of heterogeneously encoded lexical resources. This is important both in the context of building lexical infrastructures as such (Ermolaev and Tasovac 2012) and in the context of developing generic TEI-aware tools such as dictionary viewers and profilers.
For the latest changes, see our revision history.
1.1. The community
Preliminary work for the establishment of TEI Lex-0 started in the Working Group "Retrodigitised Dictionaries" lead by Toma Tasovac and Vera Hildenbrandt as part of the COST Action European Network of e-Lexicography (ENeL). Upon the completion of the COST Action, the work on TEI Lex-0 was taken up by the DARIAH Working Group "Lexical Resources". Currently, the work on TEI Lex-0 is also supported by the H2020-funded European Lexicographic Infrastructure (ELEXIS).
1.1.1. DARIAH Working Group
The DARIAH Working Group on Lexical Resources is a self-organized scholarly community working under the auspices of the pan-European Digital Research Infrastructure for Arts and Humanities (DARIAH-EU). The goals of the WG are:
- to explore, assess and recommend standard tools and methods for the creation, application and dissemination of born-digital and retro-digitized lexical resources (dictionaries, lexicons, thesauri, word lists etc.) as well as other, similar kinds of structured data (gazetteers, almanacs, encyclopaedias etc.); and
- to foster, develop and publicize digitally-enabled lexicographic research from a cross-disciplinary and transnational perspective.
The WG focuses on the application and explication of existing standards, both onomasiological (TMF, TBX and SKOS) and semasiological (LMF, TEI, and Ontolex); draws upon the expertise of various DARIAH partners who are active in this field; and collaborates with relevant external projects and associations, such as the European Lexicographic Infrastructure (ELEXIS) and CLARIN in order to ascertain the widest possible reach of the Working Group’s results.
At the same time, the WG pursues a strong research-driven agenda on the diversity of European lexicographic heritage. In addition to investigating pan-European vocabularies and multiple dimensions of lexical borrowing, the working group evaluates current practices and formulates guidelines on data enrichment and mutual linking of existing electronic dictionaries in view of their common European heritage.
WG Chairs

Laurent Romary is Directeur de Recherche at Inria (team ALMAnaCH (France)). He received a PhD degree in computational linguistics in 1989 and his Habilitation in 1999. He carries out research on the modelling of semi-structured documents, with a specific emphasis on texts and linguistic resources. He has been active in standardisation activities with ISO, as chair of committee ISO/TC 37/SC 4 (2002-2014), chair of ISO/TC 37 (2016-) and the Text Encoding Initiative, as member (2001-2011) and chair (2008-2011) of its Technical Council. He also has a long-standing implication in open science related activities.

Toma Tasovac is Director of the Belgrade Center for Digital Humanities (BCDH) and DARIAH-EU. He was educated at Harvard University, Princeton University and Trinity College Dublin. His areas of interest include lexicography, data modeling, TEI, digital editions and research infrastructures. He previously served as the National Coordinator of DARIAH-RS and Chair of the National Coordinators' Committee at DARIAH-EU. Under Toma's leadership, BCDH has received funding from various national and international granting bodies, including Erasmus Plus and Horizon 2020.
DigiLex Blog
The working group runs a blog called DigiLex: Legacy Dictionaries Reloaded as a platform for sharing tips, raising questions and discussing methods for the creation of lexical resources.
1.1.2. ELEXIS
ELEXIS is a H2020-funded project which proposes to integrate, extend and harmonise national and regional efforts in the field of lexicography, both modern and historical, with the goal of creating a sustainable infrastructure which will (1) enable efficient access to high-quality lexical data in the digital age, and (2) bridge the gap between more advanced and lesser-resourced scholarly communities working on lexicographic resources.
1.1.3. Contributors
- Piotr Banski
- Jack Bowers
- Jesse de Does
- Katrien Depuydt
- Tomaž Erjavec
- Alexander Geyken
- Axel Herold
- Vera Hildenbrandt
- Mohamed Khemakhem
- Boris Lehečka
- Snežana Petrović
- Laurent Romary
- Ana Salgado
- Toma Tasovac
- Andreas Witt
1.1.4. The Rahtz Prize
In recognition of their work on TEI Lex-0, the DARIAH WG Lexical Resources was awarded the 2020 Rahtz Prize for TEI Ingenuity.
Members of the DARIAH Working Group Lexical Resources have made a valuable contribution to the Dictionaries Chapter of the TEI Guidelines. Their efforts and their expertise have been formidable and highly appreciated by the TEI Community for many years. — Martina Scholger, Chair of the TEI Technical Council
1.1.5. Meetings
The Working Group has organized a number of working meetings dedicated to the development of TEI Lex-0. These include:
- Toward Best Practice Guidelines for Encoding Legacy Dictionaries: An ENeL-DARIAH-PARTHENOS Expert Workshop. Preußische Staatsbibliothek, Berlin (17-19 November 2016).
- Overview of Retrodigitized Dictionaries and Best-Practice Guidelines For Encoding Legacy Dictionaries. ENeL Annual Meeting, Budapest (24 February 2017).
- TEI Lex-0 @DARIAH WG "Lexical Resources". Harnack Haus, Freie Universität Berlin (27 April 2017).
- TEI Lex-0 @DARIAH WG "Lexical Resources". Austrian Center for Digital Humanities, Austrian Academy of Sciences, Vienna (26 June 2017).
- TEI Lex-0: From Best-Practice Guidelines to a TEI Schema. DARIAH-EU Coordination Office, Berlin (2-3 May 2018). Funded by DARIAH-EU's Working Groups Funding Scheme and ELEXIS.
- TEI Lex-0 and Beyond: A Workshop. University of Ljubljana (16 July 2018). Funded by DARIAH-EU's Working Group Funding Scheme and ELEXIS.
- TEI Lex-0 Meeting. DARIAH-EU Coordination Office, Berlin (30 January 2019).
- Joint TEI Lex-0 / Ontolex-Lemon Meeting. Collocated with eLex 2019. Sintra, Portugal (4 October 2019). Funded by ELEXIS.
- Toward a TEI Lex-0 Publisher: A Workshop, DARIAH-EU Coordination Office, Berlin (16-17 December 2019). Funded by the Belgrade Center for Digital Humanities.
1.1.6. Training measures
TEI Lex-0 and best practices in lexical data modeling have been introduced to large number of young scholars at various training events, including:
- Lexical Data Masterclass 2017. Co-organized by DARIAH, the Berlin Brandenburg Academy of Sciences (BBAW), Inria and the Belgrade Center for Digital Humanities, with the support of the German Ministry of Education and Research (BMBF), CLARIN and DARIAH-DE. For an overview, check out this blog post.
- Lexical Data Masterclass 2018. Co-organized by DARIAH, the Berlin Brandenburg Academy of Sciences (BBAW), Inria and the Belgrade Center for Digital Humanities, with the support of the German Ministry of Education and Research (BMBF), French Ministry for Higher Education, Research and Innovation (MESRI), ELEXIS, CLARIN and DARIAH-DE. For an overview, check out From Àbèsàbèsì to XPath on DigiLex.
- From Print to Screen: The Theory and Practice of Digitizing Dictionaries. Lisbon Summer School in Linguistics (2-6 July 2018).
- Encoding Dictionaries with TEI: A Masterclass. Lisbon Summer School in Linguistics (1-5 July 2019).
- DH Training Workshop: Digital Methods for Linguistic Investigation (13-15 November 2019). Organized by the Seminar für Semitistik und Arabistik, Freie Universität Berlin, with the support of the Alexander von Humboldt Foundation and Syncro Soft.
The European Digital Humanities Masterclass 2020 had to be postponed due to the Corona pandemic.
A picture is worth a thousand words














1.2. The Guidelines
To what extent can we achieve consistent encoding within a given community of practice by following the TEI Guidelines? The topic is of particular importance for lexical data if we think of the potential wealth of content we could gain from pooling together the information available in the variety of highly structured, historical and contemporary lexical resources. The encoding possibilities offered by the Dictionaries Chapter in the Guidelines are too numerous and too flexible to guarantee sufficient interoperability and a coherent model for searching, visualising or enriching multiple lexical resources.
TEI Lex-0 should not be thought of as a replacement of the Dictionaries Chapter in the TEI Guidelines or as the format that must be necessarily used for editing or managing individual resources, especially in those projects and/or institutions that already have established workflows based on their own flavors of TEI. TEI Lex-0 should be primarily seen as a format that existing TEI dictionaries can be unequivocally transformed to in order to be queried, visualised, or mined in a uniform way. At the same time, however, there is no reason why TEI Lex-0 could not or should not be used as a best-practice example in educational settings or as a foundation of new TEI-based projects. This is especially true considering the fact that TEI Lex-0 aims to to stay as aligned as possible with the TEI subset developed in conjunction with the revision of the ISO LMF (Lexical Markup Framework) standard (cf. Romary 2015)
1.2.1. How to cite these guidelines
Full citationToma Tasovac, Laurent Romary, Piotr Banski, Jack Bowers, Jesse de Does, Katrien Depuydt, Tomaž Erjavec, Alexander Geyken, Axel Herold, Vera Hildenbrandt, Mohamed Khemakhem, Boris Lehečka, Snežana Petrović, Ana Salgado and Andreas Witt. 2018. TEI Lex-0: A baseline encoding for lexicographic data. Version 0.9.5-dev. DARIAH Working Group on Lexical Resources. https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html.
Short citationToma Tasovac, Laurent Romary et al. 2018. TEI Lex-0: A baseline encoding for lexicographic data. Version 0.9.5-dev. DARIAH Working Group on Lexical Resources. https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html.
1.2.2. Revision history
Changes to the TEI Lex-0 specification up to version 0.8.6 were included in comments inside the ODD file itself. Starting with version 0.9.0, we're listing a summary of the changes in this list for easier reference.
- docsAdded documentation on encoding condensed forms a là "leleti (sě)".
- specAdded model.languageProfile to better structure <language> as per #245.
- specAdded <ruby> annotation support as per #225
- specAdded <measure> (to be used, for instance, within <extent> in <fileDesc> as per #257.
- xprocAdded a temporary step to fix xml:base and xml:lang issues in xincluded examples as per #256
- specDeprecated
gram[@type="government"]in favor ofgram[@type="government"]as per #254 - specRefactored model classes to fix XSD UPA violations as per #223.
- docsMinor corrections in the documentation
- xprocfix documentation build on macOS and Windows in oXygen XML Editor
- specadded
degreeas <gram> type value - docsfixed some typographical errors in the documentation
- spec<catDesc> must contain a <term>
- specswitch to using the external TEI add-on in oXygen when generating schema and documentation
- specfix the mismatch in <usg> types between the specification and documentation (use
temporalinstead oftime - specrequire <listBibl> in <sourceDesc> with three suggested type values:
dictionaries,corporaandliterature
- xprocswitch to using oXygen's TEI framework when generating schema and documentation
- specallow <list> and <item> because lists feature prominently in dictionary front matter
- specintroduce model.lexicalInter (based on model.inter), model.lexicalPhrase (based on model.phrase) and macro.lexicalParaContent (based on macro.paraContent) to make it easier to simplify the content model of various dictionary elements
- specremove model.listLike from model.lexicalInter
- htmllink version number in the menu to revision history
- specallow <abbr> and <expan> so that they can be used in lists of abbreviations in dictionary front matter
- specintroduced
valencyas a suggested value ingram[@type="valency"] - specintroduced
gram[@type="government"]and clarified the difference fromgram[@type="colloc"]. See sections on Typology ofgramand Collocates - specmade
@typemandatory on <TEI> - specadd <principal> and <affiliation> for more robust metadata in the <teiHeader>
- htmlfix namespace issues in html output
- docsadd new examples to the Header section
- docsadd section on hierarchichal usage labels
- specallow <taxonomy>, <category> and <catDesc> in <classDecl>
- docsmove the specification to a different webpage for quicker loading
- docsadd section on TEI Header
- docscorrection of various misspellings
- specadd <monogr> (needed for <biblStruct>)
- specadd <forename> and <surname> for more fine-grained bibliographic information
- specadd <editorialDecl>
- specadd <email> to make possible contact information in the header
- specrequire <availability> in <publicationStmt> to provide <licence>
- specmake <sourceDesc> optional
- specallow only <biblStruct> in <sourceDesc>
- specmake model.publicationStmtPart.agency unbound to allow both <publisher> and <authority> in <publicationStmt>
- specadd role to <authority> with suggested values: funder, sponsor, rightsHolder
- specrequire <language>, <langUsage> and <profileDesc>
- specadd role to <language> with a closed list of values: objectLanguage, workingLanguage, sourceLanguage, targetLanguage


