1. Introduction

TEI Lex-0 is both a technical specification and a set of community-based recommendations for encoding machine-readable dictionaries. It is rooted in the Guidelines of the Text Encoding Initiative (TEI) and delivered as a customization of the TEI schema.

Following the spirit of TEI Analytics, developed in the context of the MONK project (Zillig 2009), TEI Lex-0 aims at establishing a baseline encoding and a target format to facilitate the interoperability of heterogeneously encoded lexical resources. This is important both in the context of building lexical infrastructures as such (Ermolaev and Tasovac 2012) and in the context of developing generic TEI-aware tools such as dictionary viewers and profilers.

For the latest changes, see our revision history.

1.1. The community

Preliminary work for the establishment of TEI Lex-0 started in the Working Group "Retrodigitised Dictionaries" lead by Toma Tasovac and Vera Hildenbrandt as part of the COST Action European Network of e-Lexicography (ENeL). Upon the completion of the COST Action, the work on TEI Lex-0 was taken up by the DARIAH Working Group "Lexical Resources". Currently, the work on TEI Lex-0 is also supported by the H2020-funded European Lexicographic Infrastructure (ELEXIS).

1.1.1. DARIAH Working Group

The DARIAH Working Group on Lexical Resources is a self-organized scholarly community working under the auspices of the pan-European Digital Research Infrastructure for Arts and Humanities (DARIAH-EU). The goals of the WG are:

to explore, assess and recommend standard tools and methods for the creation, application and dissemination of born-digital and retro-digitized lexical resources (dictionaries, lexicons, thesauri, word lists etc.) as well as other, similar kinds of structured data (gazetteers, almanacs, encyclopaedias etc.); and
to foster, develop and publicize digitally-enabled lexicographic research from a cross-disciplinary and transnational perspective.

The WG focuses on the application and explication of existing standards, both onomasiological (TMF, TBX and SKOS) and semasiological (LMF, TEI, and Ontolex); draws upon the expertise of various DARIAH partners who are active in this field; and collaborates with relevant external projects and associations, such as the European Lexicographic Infrastructure (ELEXIS) and CLARIN in order to ascertain the widest possible reach of the Working Group’s results.

At the same time, the WG pursues a strong research-driven agenda on the diversity of European lexicographic heritage. In addition to investigating pan-European vocabularies and multiple dimensions of lexical borrowing, the working group evaluates current practices and formulates guidelines on data enrichment and mutual linking of existing electronic dictionaries in view of their common European heritage.

WG Chairs

Laurent Romary is Directeur de Recherche at Inria (team ALMAnaCH (France)). He received a PhD degree in computational linguistics in 1989 and his Habilitation in 1999. He carries out research on the modelling of semi-structured documents, with a specific emphasis on texts and linguistic resources. He has been active in standardisation activities with ISO, as chair of committee ISO/TC 37/SC 4 (2002-2014), chair of ISO/TC 37 (2016-) and the Text Encoding Initiative, as member (2001-2011) and chair (2008-2011) of its Technical Council. He also has a long-standing implication in open science related activities.

Toma Tasovac is Director of the Belgrade Center for Digital Humanities (BCDH) and DARIAH-EU. He was educated at Harvard University, Princeton University and Trinity College Dublin. His areas of interest include lexicography, data modeling, TEI, digital editions and research infrastructures. He previously served as the National Coordinator of DARIAH-RS and Chair of the National Coordinators' Committee at DARIAH-EU. Under Toma's leadership, BCDH has received funding from various national and international granting bodies, including Erasmus Plus and Horizon 2020.

DigiLex Blog

The working group runs a blog called DigiLex: Legacy Dictionaries Reloaded as a platform for sharing tips, raising questions and discussing methods for the creation of lexical resources.

1.1.2. ELEXIS

ELEXIS is a H2020-funded project which proposes to integrate, extend and harmonise national and regional efforts in the field of lexicography, both modern and historical, with the goal of creating a sustainable infrastructure which will (1) enable efficient access to high-quality lexical data in the digital age, and (2) bridge the gap between more advanced and lesser-resourced scholarly communities working on lexicographic resources.

1.1.3. Contributors

Piotr Banski
Jack Bowers
Jesse de Does
Katrien Depuydt
Tomaž Erjavec
Alexander Geyken
Axel Herold
Vera Hildenbrandt
Mohamed Khemakhem
Boris Lehečka
Snežana Petrović
Laurent Romary
Ana Salgado
Toma Tasovac
Andreas Witt

1.1.4. The Rahtz Prize

In recognition of their work on TEI Lex-0, the DARIAH WG Lexical Resources was awarded the 2020 Rahtz Prize for TEI Ingenuity.

Members of the DARIAH Working Group Lexical Resources have made a valuable contribution to the Dictionaries Chapter of the TEI Guidelines. Their efforts and their expertise have been formidable and highly appreciated by the TEI Community for many years. — Martina Scholger, Chair of the TEI Technical Council

1.1.5. Meetings

The Working Group has organized a number of working meetings dedicated to the development of TEI Lex-0. These include:

Toward Best Practice Guidelines for Encoding Legacy Dictionaries: An ENeL-DARIAH-PARTHENOS Expert Workshop. Preußische Staatsbibliothek, Berlin (17-19 November 2016).
Overview of Retrodigitized Dictionaries and Best-Practice Guidelines For Encoding Legacy Dictionaries. ENeL Annual Meeting, Budapest (24 February 2017).
TEI Lex-0 @DARIAH WG "Lexical Resources". Harnack Haus, Freie Universität Berlin (27 April 2017).
TEI Lex-0 @DARIAH WG "Lexical Resources". Austrian Center for Digital Humanities, Austrian Academy of Sciences, Vienna (26 June 2017).
TEI Lex-0: From Best-Practice Guidelines to a TEI Schema. DARIAH-EU Coordination Office, Berlin (2-3 May 2018). Funded by DARIAH-EU's Working Groups Funding Scheme and ELEXIS.
TEI Lex-0 and Beyond: A Workshop. University of Ljubljana (16 July 2018). Funded by DARIAH-EU's Working Group Funding Scheme and ELEXIS.
TEI Lex-0 Meeting. DARIAH-EU Coordination Office, Berlin (30 January 2019).
Joint TEI Lex-0 / Ontolex-Lemon Meeting. Collocated with eLex 2019. Sintra, Portugal (4 October 2019). Funded by ELEXIS.
Toward a TEI Lex-0 Publisher: A Workshop, DARIAH-EU Coordination Office, Berlin (16-17 December 2019). Funded by the Belgrade Center for Digital Humanities.

1.1.6. Training measures

TEI Lex-0 and best practices in lexical data modeling have been introduced to large number of young scholars at various training events, including:

Lexical Data Masterclass 2017. Co-organized by DARIAH, the Berlin Brandenburg Academy of Sciences (BBAW), Inria and the Belgrade Center for Digital Humanities, with the support of the German Ministry of Education and Research (BMBF), CLARIN and DARIAH-DE. For an overview, check out this blog post.
Lexical Data Masterclass 2018. Co-organized by DARIAH, the Berlin Brandenburg Academy of Sciences (BBAW), Inria and the Belgrade Center for Digital Humanities, with the support of the German Ministry of Education and Research (BMBF), French Ministry for Higher Education, Research and Innovation (MESRI), ELEXIS, CLARIN and DARIAH-DE. For an overview, check out From Àbèsàbèsì to XPath on DigiLex.
From Print to Screen: The Theory and Practice of Digitizing Dictionaries. Lisbon Summer School in Linguistics (2-6 July 2018).
Encoding Dictionaries with TEI: A Masterclass. Lisbon Summer School in Linguistics (1-5 July 2019).
DH Training Workshop: Digital Methods for Linguistic Investigation (13-15 November 2019). Organized by the Seminar für Semitistik und Arabistik, Freie Universität Berlin, with the support of the Alexander von Humboldt Foundation and Syncro Soft.

The European Digital Humanities Masterclass 2020 had to be postponed due to the Corona pandemic.

A picture is worth a thousand words

Figure 1: TEI Lex-0 would not be possible without joint community effort

1.2. The Guidelines

To what extent can we achieve consistent encoding within a given community of practice by following the TEI Guidelines? The topic is of particular importance for lexical data if we think of the potential wealth of content we could gain from pooling together the information available in the variety of highly structured, historical and contemporary lexical resources. The encoding possibilities offered by the Dictionaries Chapter in the Guidelines are too numerous and too flexible to guarantee sufficient interoperability and a coherent model for searching, visualising or enriching multiple lexical resources.

TEI Lex-0 should not be thought of as a replacement of the Dictionaries Chapter in the TEI Guidelines or as the format that must be necessarily used for editing or managing individual resources, especially in those projects and/or institutions that already have established workflows based on their own flavors of TEI. TEI Lex-0 should be primarily seen as a format that existing TEI dictionaries can be unequivocally transformed to in order to be queried, visualised, or mined in a uniform way. At the same time, however, there is no reason why TEI Lex-0 could not or should not be used as a best-practice example in educational settings or as a foundation of new TEI-based projects. This is especially true considering the fact that TEI Lex-0 aims to to stay as aligned as possible with the TEI subset developed in conjunction with the revision of the ISO LMF (Lexical Markup Framework) standard (cf. Romary 2015)

1.2.1. How to cite these guidelines

Full citation

Toma Tasovac, Laurent Romary, Piotr Banski, Jack Bowers, Jesse de Does, Katrien Depuydt, Tomaž Erjavec, Alexander Geyken, Axel Herold, Vera Hildenbrandt, Mohamed Khemakhem, Boris Lehečka, Snežana Petrović, Ana Salgado and Andreas Witt. 2018. TEI Lex-0: A baseline encoding for lexicographic data. Version 0.9.5-dev. DARIAH Working Group on Lexical Resources. https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html.

Short citation

Toma Tasovac, Laurent Romary et al. 2018. TEI Lex-0: A baseline encoding for lexicographic data. Version 0.9.5-dev. DARIAH Working Group on Lexical Resources. https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html.

1.2.2. Revision history

Changes to the TEI Lex-0 specification up to version 0.8.6 were included in comments inside the ODD file itself. Starting with version 0.9.0, we're listing a summary of the changes in this list for easier reference.

Version: 0.9.5 (2024-11-27)

docsAdded documentation on encoding condensed forms a là "leleti (sě)".
specAdded model.languageProfile to better structure <language> as per #245.
specAdded <ruby> annotation support as per #225
specAdded <measure> (to be used, for instance, within <extent> in <fileDesc> as per #257.
xprocAdded a temporary step to fix xml:base and xml:lang issues in xincluded examples as per #256
specDeprecated gram[@type="government"] in favor of gram[@type="government"] as per #254
specRefactored model classes to fix XSD UPA violations as per #223.
docsMinor corrections in the documentation

Version: 0.9.4 (2024-05-12)

xprocfix documentation build on macOS and Windows in oXygen XML Editor
specadded degree as <gram> type value
docsfixed some typographical errors in the documentation

Version: 0.9.3 (2024-02-12)

spec<catDesc> must contain a <term>
specswitch to using the external TEI add-on in oXygen when generating schema and documentation
specfix the mismatch in <usg> types between the specification and documentation (use temporal instead of time
specrequire <listBibl> in <sourceDesc> with three suggested type values: dictionaries, corpora and literature

Version: 0.9.2 (2023-04-22)

xprocswitch to using oXygen's TEI framework when generating schema and documentation
specallow <list> and <item> because lists feature prominently in dictionary front matter
specintroduce model.lexicalInter (based on model.inter), model.lexicalPhrase (based on model.phrase) and macro.lexicalParaContent (based on macro.paraContent) to make it easier to simplify the content model of various dictionary elements
specremove model.listLike from model.lexicalInter
htmllink version number in the menu to revision history
specallow <abbr> and <expan> so that they can be used in lists of abbreviations in dictionary front matter
specintroduced valency as a suggested value in gram[@type="valency"]
specintroduced gram[@type="government"] and clarified the difference from gram[@type="colloc"]. See sections on Typology of gram and Collocates
specmade @type mandatory on <TEI>
specadd <principal> and <affiliation> for more robust metadata in the <teiHeader>

Version: 0.9.1 (2021-03-24)

htmlfix namespace issues in html output
docsadd new examples to the Header section
docsadd section on hierarchichal usage labels
specallow <taxonomy>, <category> and <catDesc> in <classDecl>
docsmove the specification to a different webpage for quicker loading

Version: 0.9.0 (2021-09-26)

docsadd section on TEI Header
docscorrection of various misspellings
specadd <monogr> (needed for <biblStruct>)
specadd <forename> and <surname> for more fine-grained bibliographic information
specadd <editorialDecl>
specadd <email> to make possible contact information in the header
specrequire <availability> in <publicationStmt> to provide <licence>
specmake <sourceDesc> optional
specallow only <biblStruct> in <sourceDesc>
specmake model.publicationStmtPart.agency unbound to allow both <publisher> and <authority> in <publicationStmt>
specadd role to <authority> with suggested values: funder, sponsor, rightsHolder
specrequire <language>, <langUsage> and <profileDesc>
specadd role to <language> with a closed list of values: objectLanguage, workingLanguage, sourceLanguage, targetLanguage