\n\n
DEV SITE - NOT FOR INDEXING

TEI Lex-0

— A baseline encoding for lexicographic data

1. Introduction

TEI Lex-0 is both a technical specification and a set of community-based recommendations for encoding machine-readable dictionaries. It is rooted in the Guidelines of the Text Encoding Initiative (TEI) and delivered as a customization of the TEI schema.

Following the spirit of TEI Analytics, developed in the context of the MONK project (Zillig 2009), TEI Lex-0 aims at establishing a baseline encoding and a target format to facilitate the interoperability of heterogeneously encoded lexical resources. This is important both in the context of building lexical infrastructures as such (Ermolaev and Tasovac 2012) and in the context of developing generic TEI-aware tools such as dictionary viewers and profilers.

For the latest changes, see our revision history.

1.1. The community

Preliminary work for the establishment of TEI Lex-0 started in the Working Group "Retrodigitised Dictionaries" lead by Toma Tasovac and Vera Hildenbrandt as part of the COST Action European Network of e-Lexicography (ENeL). Upon the completion of the COST Action, the work on TEI Lex-0 was taken up by the DARIAH Working Group "Lexical Resources". Currently, the work on TEI Lex-0 is also supported by the H2020-funded European Lexicographic Infrastructure (ELEXIS).

1.1.1. DARIAH Working Group

The DARIAH Working Group on Lexical Resources is a self-organized scholarly community working under the auspices of the pan-European Digital Research Infrastructure for Arts and Humanities (DARIAH-EU). The goals of the WG are:

  • to explore, assess and recommend standard tools and methods for the creation, application and dissemination of born-digital and retro-digitized lexical resources (dictionaries, lexicons, thesauri, word lists etc.) as well as other, similar kinds of structured data (gazetteers, almanacs, encyclopaedias etc.); and
  • to foster, develop and publicize digitally-enabled lexicographic research from a cross-disciplinary and transnational perspective.

The WG focuses on the application and explication of existing standards, both onomasiological (TMF, TBX and SKOS) and semasiological (LMF, TEI, and Ontolex); draws upon the expertise of various DARIAH partners who are active in this field; and collaborates with relevant external projects and associations, such as the European Lexicographic Infrastructure (ELEXIS) and CLARIN in order to ascertain the widest possible reach of the Working Group’s results.

At the same time, the WG pursues a strong research-driven agenda on the diversity of European lexicographic heritage. In addition to investigating pan-European vocabularies and multiple dimensions of lexical borrowing, the working group evaluates current practices and formulates guidelines on data enrichment and mutual linking of existing electronic dictionaries in view of their common European heritage.

WG Chairs

Laurent Romary is Directeur de Recherche at Inria (team ALMAnaCH (France)). He received a PhD degree in computational linguistics in 1989 and his Habilitation in 1999. He carries out research on the modelling of semi-structured documents, with a specific emphasis on texts and linguistic resources. He has been active in standardisation activities with ISO, as chair of committee ISO/TC 37/SC 4 (2002-2014), chair of ISO/TC 37 (2016-) and the Text Encoding Initiative, as member (2001-2011) and chair (2008-2011) of its Technical Council. He also has a long-standing implication in open science related activities.

Toma Tasovac is Director of the Belgrade Center for Digital Humanities (BCDH) and DARIAH-EU. He was educated at Harvard University, Princeton University and Trinity College Dublin. His areas of interest include lexicography, data modeling, TEI, digital editions and research infrastructures. He previously served as the National Coordinator of DARIAH-RS and Chair of the National Coordinators' Committee at DARIAH-EU. Under Toma's leadership, BCDH has received funding from various national and international granting bodies, including Erasmus Plus and Horizon 2020.

DigiLex Blog

The working group runs a blog called DigiLex: Legacy Dictionaries Reloaded as a platform for sharing tips, raising questions and discussing methods for the creation of lexical resources.

1.1.2. ELEXIS

ELEXIS is a H2020-funded project which proposes to integrate, extend and harmonise national and regional efforts in the field of lexicography, both modern and historical, with the goal of creating a sustainable infrastructure which will (1) enable efficient access to high-quality lexical data in the digital age, and (2) bridge the gap between more advanced and lesser-resourced scholarly communities working on lexicographic resources.

1.1.3. Contributors

  • Piotr Banski
  • Jack Bowers
  • Jesse de Does
  • Katrien Depuydt
  • Tomaž Erjavec
  • Alexander Geyken
  • Axel Herold
  • Vera Hildenbrandt
  • Mohamed Khemakhem
  • Boris Lehečka
  • Snežana Petrović
  • Laurent Romary
  • Ana Salgado
  • Toma Tasovac
  • Andreas Witt

1.1.4. The Rahtz Prize

In recognition of their work on TEI Lex-0, the DARIAH WG Lexical Resources was awarded the 2020 Rahtz Prize for TEI Ingenuity.

Members of the DARIAH Working Group Lexical Resources have made a valuable contribution to the Dictionaries Chapter of the TEI Guidelines. Their efforts and their expertise have been formidable and highly appreciated by the TEI Community for many years. — Martina Scholger, Chair of the TEI Technical Council

1.1.5. Meetings

The Working Group has organized a number of working meetings dedicated to the development of TEI Lex-0. These include:

  • Toward Best Practice Guidelines for Encoding Legacy Dictionaries: An ENeL-DARIAH-PARTHENOS Expert Workshop. Preußische Staatsbibliothek, Berlin (17-19 November 2016).
  • Overview of Retrodigitized Dictionaries and Best-Practice Guidelines For Encoding Legacy Dictionaries. ENeL Annual Meeting, Budapest (24 February 2017).
  • TEI Lex-0 @DARIAH WG "Lexical Resources". Harnack Haus, Freie Universität Berlin (27 April 2017).
  • TEI Lex-0 @DARIAH WG "Lexical Resources". Austrian Center for Digital Humanities, Austrian Academy of Sciences, Vienna (26 June 2017).
  • TEI Lex-0: From Best-Practice Guidelines to a TEI Schema. DARIAH-EU Coordination Office, Berlin (2-3 May 2018). Funded by DARIAH-EU's Working Groups Funding Scheme and ELEXIS.
  • TEI Lex-0 and Beyond: A Workshop. University of Ljubljana (16 July 2018). Funded by DARIAH-EU's Working Group Funding Scheme and ELEXIS.
  • TEI Lex-0 Meeting. DARIAH-EU Coordination Office, Berlin (30 January 2019).
  • Joint TEI Lex-0 / Ontolex-Lemon Meeting. Collocated with eLex 2019. Sintra, Portugal (4 October 2019). Funded by ELEXIS.
  • Toward a TEI Lex-0 Publisher: A Workshop, DARIAH-EU Coordination Office, Berlin (16-17 December 2019). Funded by the Belgrade Center for Digital Humanities.

1.1.6. Training measures

TEI Lex-0 and best practices in lexical data modeling have been introduced to large number of young scholars at various training events, including:

The European Digital Humanities Masterclass 2020 had to be postponed due to the Corona pandemic.

A picture is worth a thousand words

1.2. The Guidelines

To what extent can we achieve consistent encoding within a given community of practice by following the TEI Guidelines? The topic is of particular importance for lexical data if we think of the potential wealth of content we could gain from pooling together the information available in the variety of highly structured, historical and contemporary lexical resources. The encoding possibilities offered by the Dictionaries Chapter in the Guidelines are too numerous and too flexible to guarantee sufficient interoperability and a coherent model for searching, visualising or enriching multiple lexical resources.

TEI Lex-0 should not be thought of as a replacement of the Dictionaries Chapter in the TEI Guidelines or as the format that must be necessarily used for editing or managing individual resources, especially in those projects and/or institutions that already have established workflows based on their own flavors of TEI. TEI Lex-0 should be primarily seen as a format that existing TEI dictionaries can be unequivocally transformed to in order to be queried, visualised, or mined in a uniform way. At the same time, however, there is no reason why TEI Lex-0 could not or should not be used as a best-practice example in educational settings or as a foundation of new TEI-based projects. This is especially true considering the fact that TEI Lex-0 aims to to stay as aligned as possible with the TEI subset developed in conjunction with the revision of the ISO LMF (Lexical Markup Framework) standard (cf. Romary 2015)

1.2.1. How to cite these guidelines

Full citation

Toma Tasovac, Laurent Romary, Piotr Banski, Jack Bowers, Jesse de Does, Katrien Depuydt, Tomaž Erjavec, Alexander Geyken, Axel Herold, Vera Hildenbrandt, Mohamed Khemakhem, Boris Lehečka, Snežana Petrović, Ana Salgado and Andreas Witt. 2018. TEI Lex-0: A baseline encoding for lexicographic data. Version 0.9.5-dev. DARIAH Working Group on Lexical Resources. https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html.

Short citation

Toma Tasovac, Laurent Romary et al. 2018. TEI Lex-0: A baseline encoding for lexicographic data. Version 0.9.5-dev. DARIAH Working Group on Lexical Resources. https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html.

1.2.2. Revision history

Changes to the TEI Lex-0 specification up to version 0.8.6 were included in comments inside the ODD file itself. Starting with version 0.9.0, we're listing a summary of the changes in this list for easier reference.

Version: 0.9.5 (2024-11-27)
  • docsAdded documentation on encoding condensed forms a là "leleti (sě)".
  • specAdded model.languageProfile to better structure <language> as per #245.
  • specAdded <ruby> annotation support as per #225
  • specAdded <measure> (to be used, for instance, within <extent> in <fileDesc> as per #257.
  • xprocAdded a temporary step to fix xml:base and xml:lang issues in xincluded examples as per #256
  • specDeprecated gram[@type="government"] in favor of gram[@type="government"] as per #254
  • specRefactored model classes to fix XSD UPA violations as per #223.
  • docsMinor corrections in the documentation
Version: 0.9.4 (2024-05-12)
  • xprocfix documentation build on macOS and Windows in oXygen XML Editor
  • specadded degree as <gram> type value
  • docsfixed some typographical errors in the documentation
Version: 0.9.3 (2024-02-12)
  • spec<catDesc> must contain a <term>
  • specswitch to using the external TEI add-on in oXygen when generating schema and documentation
  • specfix the mismatch in <usg> types between the specification and documentation (use temporal instead of time
  • specrequire <listBibl> in <sourceDesc> with three suggested type values: dictionaries, corpora and literature
Version: 0.9.2 (2023-04-22)
  • xprocswitch to using oXygen's TEI framework when generating schema and documentation
  • specallow <list> and <item> because lists feature prominently in dictionary front matter
  • specintroduce model.lexicalInter (based on model.inter), model.lexicalPhrase (based on model.phrase) and macro.lexicalParaContent (based on macro.paraContent) to make it easier to simplify the content model of various dictionary elements
  • specremove model.listLike from model.lexicalInter
  • htmllink version number in the menu to revision history
  • specallow <abbr> and <expan> so that they can be used in lists of abbreviations in dictionary front matter
  • specintroduced valency as a suggested value in gram[@type="valency"]
  • specintroduced gram[@type="government"] and clarified the difference from gram[@type="colloc"]. See sections on Typology of gram and Collocates
  • specmade @type mandatory on <TEI>
  • specadd <principal> and <affiliation> for more robust metadata in the <teiHeader>
Version: 0.9.1 (2021-03-24)
Version: 0.9.0 (2021-09-26)