Outils pour utilisateurs

Outils du site


public:xml_tei_txm

XML-TXM Format Specification V2

Every corpus managed by TXM is represented by files encoded in the XML-TEI TXM format, XML-TXM for short. One file for each text of the corpus.

This format is specialized in the following textual elements:

  • texts :
    • texts and their properties1) = textual units
    • texts and the various use of their components (ignored, indexed, edited, etc.) = textual planes
    • texts and their rendering for reading = text editions
    • texts alignment = aligned corpora
  • words :
    • words and their properties2) = lexical units

This format is used by some externalized annotation formats. See for example:

Some annotations are also managed in their own foreign format. See for example:

  • Spécifications du module d'import XML-TS générique, for syntactic annotations.

Lexical Units

Lexical units are encoded inline in text word order flow.

They always hold their surface representation [alphanumeric graphical] form.

They can also hold additional annotations through inline or offline (standoff) encoding depending on the processing step involved in the corpus source import process.

Any inline word annotation encoding can always be transformed to an offline word annotation encoding and vice-versa.

This strategy is similar to the ANC corpus word annotation management strategy, and is implemented through standard TXM routines.

Typically, an import process will first tokenize the corpus source texts into inline words without any annotations, then associate pos and lemma annotations to words with TreeTagger through offline encoding and later transfert pos and lemma annotations into inline encoding for further import processing steps.

tei:w element

This element encodes all lexical units of a textual unit.

  • identifier: xml:id attribute
    • identifier syntax: w_XXX_xxxxxx IMPLEMENTED
      • XXX: (String) text id
      • xxxxx: (Integer) word number in text
    • the xml:id value is unique inside a corpus
    • attribute xml:id will never change IMPLEMENTED
    • if you need to add a token, you do :
      • Solution A:
        • id = previous_id+'_1' , '_2', etc.
        • if the word is inserted at the very beginning use “w_XXX_0”
        • this patch works with TXM 0.7.8 and above for back-to-text (edition from concordance) command. The back-to-text command will not work on TXM portal 0.6.2 for words with patched ids
      • Solution B:
        • id = “w_XXX_” + textLastWordNum + 1
        • increment textLastWordNum
  • attribute n the n° of the word (may change)
  • attribute xml:lang (=code iso, optional?)
    • xml:lang attribute only used if the language of the token is different from the context, otherwise, it is inherited from ancestor elements
  • may contain multiple txm:form elements
  • may have tei:note child elements

txm:form element

current implementation

This element encodes all surface forms of lexical units.

  • contains one of the forms of a word IMPLEMENTED

current implementation: specific to the XTZ import module

Needs specific XSL processing:

  • if a word has multiple presentation forms (or edition facets), only one of them is recorded in <txm:form>, the others become <txm:ana>
    • special characters are used to encode additional markup (position of line break, letters added to abbreviation expansion, lettrines, etc.)
    • different edition facets are built with XSLT scripts
  • line-, column- or page-breaks may occur within <txm:form>

Code example (Queste del saint Graal) :

<w id="w_qgraal_cm_2643">
  <txm:form>men<pb xml:id="page_161v"/><cb xml:id="col_161c"/><lb n="1"/>joient</txm:form>
  <txm:ana resp="none" type="#dipl">men/||ioient</txm:ana>
  <txm:ana resp="none" type="#facs">men/||ioient</txm:ana>
  <txm:ana resp="none" type="#pos">VERcjg</txm:ana>
</w>

planned

  • if multiple txm:form element, the attribute 'type' discriminates them. values: default, …

tei:w element annotations

w element annotations can be encoded inline or offline.

Inline annotations

  • a tei:w element may contain multiple annotations txm:ana elements IMPLEMENTED
    • the number and order of txm:form and txm:ana may vary in different tokens
    • the import modules must receive a list of form and ana to import in the form ( (type1+resp1 name1 mandatory|optional) (type2+resp2 name2 mandatory|optional) …) [import parameter - use txm:ana@type + txm:ana@resp]

txm:ana element

This element encodes all inline annotations of a lexical unit.

  • it contains an annotation of a word including msd (pos), lemma… IMPLEMENTED
  • the content of the element encodes the value of the annotation
  • there is no limit on the number of txm:ana elements in a w element
  • the attribute @type encodes the type of annotation IMPLEMENTED
    • type allows to identify repetition (not ambiguity) and a part of the name of the property
  • attribute @key: available (TEI semantics and constraints: reference ID to an external repository)
  • attribute @resp: reference to the resp id in the teiHeader (link to the “responsibility” of the encoding, typically the NLP tool and its parameters used to build the annotation value) IMPLEMENTED

Inline annotated words example3):

<p part="N">
    <w id="w_tdm80j_429">
        <txm:form>Phileas</txm:form>
        <txm:ana resp="none" type="#n">429</txm:ana>
        <txm:ana resp="#txm" type="#frpos">NOM</txm:ana>
        <txm:ana resp="#txm" type="#frlemma">Phileas</txm:ana>
    </w>
    <w id="w_tdm80j_430">
        <txm:form>Fogg</txm:form>
        <txm:ana resp="none" type="#n">430</txm:ana>
        <txm:ana resp="#txm" type="#frpos">NAM</txm:ana>
        <txm:ana resp="#txm" type="#frlemma">Fogg</txm:ana>
    </w>
    <w id="w_tdm80j_431">
        <txm:form>était</txm:form>
        <txm:ana resp="none" type="#n">431</txm:ana>
        <txm:ana resp="#txm" type="#frpos">VER:impf</txm:ana>
        <txm:ana resp="#txm" type="#frlemma">être</txm:ana>
    </w>
    <w id="w_tdm80j_432">
        <txm:form>membre</txm:form>
        <txm:ana resp="none" type="#n">432</txm:ana>
        <txm:ana resp="#txm" type="#frpos">NOM</txm:ana>
        <txm:ana resp="#txm" type="#frlemma">membre</txm:ana>
    </w>

Offline annotations (standoff)

Annotations are encoded:

  • in an independent XML-TEI file that declares the annotations type and the tool used to build them in a header Implemented;
  • at the bottom of the XML-TEI TXM file (not Implemented).

Pointer prefix declaration

Relation between standoff annotations and the text file is implicit. If an explicit relation is needed:

a) add TEI prefix declarations in the header:

<encodingDesc>
    <listPrefixDef>
        <prefixDef ident="text" matchPattern="([a-z]+)" replacementPattern="../txm/eneas.xml#$1"/>
        <prefixDef ident="frpos" matchPattern="([a-z]+)" replacementPattern="../../../tagsets/treetagger_frpos.xml#$1"/>
    </listPrefixDef>
</encodingDesc>

b) use the prefix in the text body:

   ...
   <link target="text:w429 frpos:NOM" />
   ...

Offline annotations example4):

<TEI>
  ...
    <text type="standoff">
        <body>
            <div>
                <linkGrp type="frpos">
          ...
                    <link target="#w429 #NOM" />
                    <link target="#w430 #NAM" />
                    <link target="#w431 #VER:impf" />
                    <link target="#w432 #NOM" />
                    <link target="#w433 #PRP:det" />
                    <link target="#w434 #NOM" />
                    <link target="#w435 #PUN" />
                    <link target="#w436 #KON" />
                    <link target="#w437 #ADV" />
                    <link target="#w438 #ADV" />
                    <link target="#w439 #SENT" />
          ...
                </linkGrp>
                <linkGrp type="frlemma">
          ...
                    <link target="#w429 #Phileas" />
                    <link target="#w430 #Fogg" />
                    <link target="#w431 #être" />
                    <link target="#w432 #membre" />
                    <link target="#w433 #du" />
                    <link target="#w434 #Reform-Club" />
                    <link target="#w435 #," />
                    <link target="#w436 #et" />
                    <link target="#w437 #voilà" />
                    <link target="#w438 #tout" />
                    <link target="#w439 #." />
          ...
                </linkGrp>
            </div>
        </body>
    </text>
</TEI>

Textual Units

Each textual unit is stored in an XML file named textid.xml.

tei:TEI (tei:teiHeader + tei:text)

The tei:TEI element encodes all textual units.

  • The tei:TEI element is mandatory
  • Each text is identified by an xml:id attribute (identifier)
    • The xml:id value is unique inside a corpus
    • attribute xml:id will never change
    • The tei:teiCorpus txm:corpusLastTextNum attribute encodes the last generated text id number (for automatically generated text id)
    • if a text has no id, generate one with “text” + integer starting from 1
    • if you need to add a text, you do : id = corpusLastTextNum+1
      • change words' id
  • if 2 texts have the same id, add an integer suffix to one starting from 1
  • @type = “standoff” if the file contains standoff annotations

tei:teiHeader

TEI header processing is not currently implemented in the XML-TEI TXM import module

This element encodes all informations needed for text processing.

  • tei:teiHeader is mandatory
  • technical metadata are encoded in txm:metadata milestone sub-elements of the txm:applicationDesc sub-element (sibling of tei:encodingDesc, tei:fileDesc, tei:profileDesc, tei:revisionDesc)
    • the version number of the TXM which has produced the corpus is encoded in '<txm:metadata name=“version” value=“0.6”/>'
    • the text 'textLastWordNum' is encoded in '<txm:metadata name=“textLastWordNum” value=“1234”/>'
    • the text main language is encoded in '<txm:metadata name=“lang” value=“en”/>'
    • external metadata - imported from CSV - are encoded in '<txm:metadata/>' element for each metadata
      • the 'name' attribute comes from the first line of CSV
      • the 'value' attribute comes from the text line column cell
  • with lexical annotations :
    • the tei:fileDesc contains a tei:titleStmt with one tei:respStmt per annotation tool
    • the tei:encodingDesc contains a tei:appInfo/txm:application per annotation tool
  • the tei:encodingDesc contains a tei:classDecl/tei:taxonomy per annotation type
    • the name of the annotation for CQL is encoded in tei:classDecl/tei:taxonomy@xml:id

See an example of teiHeader of a standoff annotation file.

Intermediate text structures

  • Any xml element (TEI or not) with text content is indexed and can be used in queries and subcorpus construction
  • <tei:s> is used for sentences and can optionally be added automatically during the import process

Textual Planes and Out-of-text elements

Text Editions

Corpus

tei:teiCorpus

This element encodes sets of texts:

  • contains one or several tei:TEI

tei:teiHeader

Contains:

  • The txm:corpusLastTextNum attribute or element encodes the last generated text id number (for automatically generated text id)

Aligned Corpora

tei:teiCorpus

This element encodes sets of texts and parallel corpora.

  • if a corpus has no id, generate one with “corpus” + integer from 1
  • if 2 corpora have the same id, add an integer suffix to one from 1
  • contains one or several tei:TEI
  • The txm:corpusLastTextNum attribute encodes the last generated text id number (for automatically generated text id)
  • if a 'align.xml' file is present in the same directory
    • it must be a tei:TEI element containing a tei:text with a tei:linkGroup element with tei:links encoding relation between 'corpus1' and 'corpus2', 'corpus1' and 'corpus3', etc. in the following way :
      ...
      <tei:linkGroup type="align">
       <tei:link target="#corpus1 #corpus2" txm:alignElement="div" txm:alignLevel="2"/>
       <tei:link target="#corpus1 #corpus3" txm:alignElement="p"/>
       <tei:link target="#corpus2 #corpus3" txm:alignElement="s"/>
      </tei:linkGroup>
      ...
  • aligned elements must have a txm:alignId attribute unique to a tei:teiCorpus and shared between one pair to several pair of corpus.
  • recursive use of element can be encoded by a txm:alignLevel attribute. Example : txm:alignLevel=“2” if the div element is contained by another div.
  • corpora must be strictly aligned (behaviour of txm, in other cases is not defined) TEST REQUIRED

tei:linkGrp

  • @type value is “align”
  • contains one link element per alignement
  • @target contains a list of corpus names
  • @txm:alignElement the aligned element name
  • @txm:alignLevel the depth of the aligned element

Foreign Annotations

The TXM platform combines annotations from different domains:

  • a) XML-TEI mark-up (source encoding)
  • b) NLP tools generated annotations (eg TreeTagger)
  • c) interactive annotations edited from within TXM tools
  • d) other linguistic annotations (eg TIGERSearch syntactic annotations)

NLP tools

Some annotations of type b) are represented in the XML-TXM format through inline or offline extended TEI markup (for example pos and lemma).

Some others are represented in their own format:

  • tree-bank syntactic annotations: See import_tiger to read about how a corpus is imported with that type of annotations. See ajout_moteur_resolution_annotation to read about a typical combined management of the TEI representation and of the tree-bank representation.

Editing Annotations Within TXM

Some annotations of type c) are represented in the XML-TXM format through inline or offline extended TEI markup (for example non SyMoGIH based, simple or advanced annotations).

Some others are represented in their own format:

  • SyMoGIH historical semantics ontology annotations: See annotation SyMoGIH to read about how SyMoGIH entities identifiers are associated to spans of text through the concordance tool. In this case the ontology and the entities are remotely accessed (through a RDBMS connector). The annotations can be exported in XML-TEI format though (see Export des annotations SyMoGIH).
  • Analec coreference chains annotations: See annotation Analec to read about how Analec annotation structure entities are associated to spans of text through the text edition reading tool. In this case the ontolgies and the entities are embeded inside TXM.

Other Versions of the XML-TXM format

TODO V3

  • remove '#' from txm:ana/@type

Other points we need to see when we have time.

  • w
    • [AL] as stated in current ODD, may also have @ref (as member of att.canonical), @subtype & @cert (as member of att.responsibility). If we remove these attributes, we lose potential TEI compatibility
  • form
    • [AL] as stated in current ODD, may also have @ref (as member of att.canonical), @subtype. If we remove these attributes, we lose potential TEI compatibility
  • questions
    • how to manage internal elements in w element ? (supplied, choices, corr, sic … ?)
      • [AL] they should be allowed within txm:form, but maybe with restrictions (e.g. create multiple txm:form instead of choice)
      • these elements should probably be ignored in CQP indexes but used in building editions
    • Allow multiple text element ?
      • [AL] within a corpus ???

V1

Until TXM 0.5 : XML-TXM V1 format specifications are not public.

V2

From TXM 0.6 to …

Standardization

The XML-TXM format is defined as an extension to the TEI standard schema through ODD specifications. The ODD specifications sources are hosted on Sourceforge : https://sourceforge.net/p/txm/code/HEAD/tree/trunk/doc/tei-txm

Solution

TXM 0.7.9 -> TXM 0.8.1

Gestion des identifiants de mots

Lors d'un import, on rencontre plusieurs cas :

  • le tokenizer est activé :
    • homogène - TXM forge tous les identifiants de mots : l'unicité est alors assurée. Cas de sources ne sont pas tokenisées ou tokenisées mais sans identifiants
    • mixte - TXM forge certains identifiants de mots : l'unicité n'est pas assurée. Cas de sources partiellement tokenisées avec identifiants
  • le tokenizer n'est pas activé
    • seuls les mots tokenizé seront indéxés ET il faut que les identifiants soient renseignés et conformes aux règles défini par le format XML-TXM

Gestion des identifiants de textes

L'indexation des métadonnées par CQP repose sur les propriétés des structures 'text'.

Indexation des métadonnées de textes

L'indexation des métadonnées par CQP repose sur les propriétés des structures 'text'.

Les propriétés des structures 'text' proviennent des attributs de l'élément 'text' du format XML-TXM.

Les attributs de l'élément 'text' du format XML-TXM proviennent du fichier metadata.csv ou des attributs de l'élément 'text' dans les sources.

Outils backToText, NextPage, URSUnitAnnotate

Actuellement ces outils utilisent la structure suivante de la propriété 'id' des mots pour fonctionner :

  • 'w_<text id>_<word number>'
  • le préfixe 'w_' est utilisé par URSUnitAnnotate pour filtrer les SPAN de mots dans les pages d'édition
  • l'ordre numérique entier des valeurs de <word number> est utilisé par backToText et NextPage pour repérer les pages d'édition

Lors d'un import non tokeniseur avec des ID externes (de fait) ne respectant pas cette structure, ces outils ne fonctionneront pas.

outils Edition, Concordance/reference, etc.

Ces outils utilisent text@id si présent, sinon le nom du fichier sans son extension (textid.xml).

TXM X.Y.Z

Lors d'un import non tokeniseur avec des ID externes ne respectant pas cette structure, ces outils ne fonctionneront pas :

  • solution 1 : rejeter avec diagnostic toute source dont les <w> ne respectent pas cette structure
  • solution 2 : forger des id de mots bien formés - à l'import ou au chargerment - et déplacer les id externes vers une autre propriété (eg 'txmBackupID')
  • solution 3 : utiliser une propriété ID spécifique pour ce traitement (eg 'txmid')
  • solution 4 : déplacer le besoin de relation d'ordre entre mots et d'accès à l'identifiant de texte à un autre endroit que dans les id de mots
    • 4.1 : décomposer l'information dans plusieurs propriétés spécifiques à TXM (@number…)
    • 4.2 : stocker l'information dans des index spécifiques (eg index dévolu aux relations mots-pages-éditions-textes-corpus → faire évoluer import.xml vers autre chose)
1)
for example coming from a metadata.csv file
2)
for example 'pos' and 'lemma' information coming from TreeTagger
3)
First words of the sentence “Phileas Fogg était membre du Reform-Club, et voilà tout.” from the TDM80J TXM demo corpus, French version of 'Around the World in Eighty Days' (Le tour du monde en quatre-vingts jours, 1873, Jules Verne)
4)
First sentence of the TDM80J TXM demo corpus: “Phileas Fogg était membre du Reform-Club, et voilà tout.”
public/xml_tei_txm.txt · Dernière modification : 23/06/2021 11:03 de matthieu.decorde@ens-lyon.fr