Outils pour utilisateurs

Outils du site

Panneau latéral


Ceci est une ancienne révision du document !


Remarque : ce ticket est écrit en anglais parce que son contenu vient d'un mail écrit dans cette langue cherchant à trouver des composants open-source réutilisables.


'a)' entries correspond to the first development target, 'b)' to the next one, etc.

List of functionalities and specifications:


  • a) import a dictionary from TSV format (to create a dictionary)
  • import a dictionary from XML format (to create a dictionary)
  • import dictionary data from XML format (to update a dictionary)
  • import data from an OCR dictionary (to update a dictionary)
  • a) import data from a TXM corpus lexicon or index into a dictionary (to create or update a dictionary)
  • import user data from installed LibreOffice local dictionary (to update a dictionary)


  • export a dictionary into an XML format
  • a) export a dictionary into a TSV format
  • c) export or transfer dictionary data to a lemmatizer (tokenization + pos tagging + lemmatization) learning component

User Interface:

  • b) connect dictionary data to tokenizing components
  • b) provide an editing UI for dictionary entries creation or update (browse, search, view, edit, copy, paste, delete)
  • a) provide an efficient persistency and search/browse underlying software component (for example relational database backend)


  • merge dictionaries
  • a) intersect dictionaries
  • etc.


  • start to provide a dictionary software component very soon
  • start with the trivial tabular model <one word per line, one word property per column> because the push is given by the need to develop open-source lemmatizers, TreeTagger architecture is the reference model (tabular lexicon + tagged golden corpus for learning) → language model to lemmatize
  • probably need to develop bi-lingual support later (TXM imports TMX corpora)
  • probably need to develop compound words support later (Unitex is the reference model: simple words lexicon+compound words lexicon+local grammars)
  • probably need to augment the tabular model in something richer later (several tables, hierarchical, network, etc.)
  • etc.

État de la plateforme

Avancement dans l'élaboration de la solution


État de l'art

We should look for software components implementing such data and operations to which we could delegate dictionary management or that we could use, adapt, augment, etc.
The preferred components would be:

  • under open-source license
  • Java based (if the best component is C++ based, a Java - tight or SQL etc. - loose connector could do)
  • aware of standard external (serialized like in XML) representations
  • bonus: be able to host on line dictionaries (typically with RESTfull API access and access control)

Suggestions by PB:

  • FLEx “integrates dictionary development with basic corpus analysis”
  • KorAP (Next Generation Corpus Analysis Platform)


Version finale





Protocole de test



État courant

Qui Quand Quoi

public/specs_dictionnaire.1447347826.txt.gz · Dernière modification: 2015/11/12 18:03 par slh@ens-lyon.fr