Remarque : ce ticket est écrit en anglais parce que son contenu vient d'un mail écrit dans cette langue cherchant à trouver des composants open-source réutilisables.


'a)' entries correspond to the first development target, 'b)' to the next one, etc.

List of functionalities and specifications:


  • a) import a dictionary from TSV format (to create a dictionary)
  • import a dictionary from XML format (to create a dictionary)
  • import dictionary data from XML format (to update a dictionary)
  • import data from an OCR dictionary (to update a dictionary)
  • a) import data from a TXM corpus lexicon or index into a dictionary (to create or update a dictionary)
  • import user data from installed LibreOffice local dictionary (to update a dictionary)


  • export a dictionary into an XML format
  • a) export a dictionary into a TSV format
  • c) export or transfer dictionary data to a lemmatizer (tokenization + pos tagging + lemmatization) learning component

User Interface:

  • b) connect dictionary data to tokenizing components
  • b) provide an editing UI for dictionary entries creation or update (browse, search, view, edit, copy, paste, delete)
  • a) provide an efficient persistency and search/browse underlying software component (for example relational database backend)


  • merge dictionaries
  • a) intersect dictionaries
  • etc.


  • start to provide a dictionary software component very soon
  • start with the trivial tabular model <one word per line, one word property per column> because the push is given by the need to develop open-source lemmatizers, TreeTagger architecture is the reference model (tabular lexicon + tagged golden corpus for learning) → language model to lemmatize
  • probably need to develop bi-lingual support later (TXM imports TMX corpora)
  • probably need to develop compound words support later (Unitex is the reference model: simple words lexicon+compound words lexicon+local grammars)
  • probably need to augment the tabular model in something richer later (several tables, hierarchical, network, etc.)
  • etc.

We should look for software components implementing such data and operations to which we could delegate dictionary management or that we could use, adapt, augment, etc.
The preferred components would be:

  • under open-source license
  • Java based (if the best component is C++ based, a Java - tight or SQL etc. - loose connector could do)
  • aware of standard external (serialized like in XML) representations
  • bonus: be able to host on line dictionaries (typically with RESTfull API access and access control)

Suggestions by PB:

  • FLEx “integrates dictionary development with basic corpus analysis”
  • KorAP (Next Generation Corpus Analysis Platform)


