Liste de liens :
Liste de liens :
Every corpus managed by TXM is represented by files encoded in the XML-TEI TXM format, XML-TXM for short. One file for each text of the corpus.
This format is specialized in the following textual elements:
This format is used by some externalized annotation formats. See for example:
Some annotations are also managed in their own foreign format. See for example:
Lexical units are encoded inline in text word order flow.
They always hold their surface representation [alphanumeric graphical] form.
They can also hold additional annotations through inline or offline (standoff) encoding depending on the processing step involved in the corpus source import process.
Any inline word annotation encoding can always be transformed to an offline word annotation encoding and vice-versa.
This strategy is similar to the ANC corpus word annotation management strategy, and is implemented through standard TXM routines.
Typically, an import process will first tokenize the corpus source texts into inline words without any annotations, then associate pos and lemma annotations to words with TreeTagger through offline encoding and later transfert pos and lemma annotations into inline encoding for further import processing steps.
This element encodes all lexical units of a textual unit.
This element encodes all surface forms of lexical units.
Needs specific XSL processing:
Code example (Queste del saint Graal) :
<w id="w_qgraal_cm_2643"> <txm:form>men<pb xml:id="page_161v"/><cb xml:id="col_161c"/><lb n="1"/>joient</txm:form> <txm:ana resp="none" type="#dipl">men/||ioient</txm:ana> <txm:ana resp="none" type="#facs">men/||ioient</txm:ana> <txm:ana resp="none" type="#pos">VERcjg</txm:ana> </w>
w element annotations can be encoded inline or offline.
This element encodes all inline annotations of a lexical unit.
Inline annotated words example3):
<p part="N"> <w id="w_tdm80j_429"> <txm:form>Phileas</txm:form> <txm:ana resp="none" type="#n">429</txm:ana> <txm:ana resp="#txm" type="#frpos">NOM</txm:ana> <txm:ana resp="#txm" type="#frlemma">Phileas</txm:ana> </w> <w id="w_tdm80j_430"> <txm:form>Fogg</txm:form> <txm:ana resp="none" type="#n">430</txm:ana> <txm:ana resp="#txm" type="#frpos">NAM</txm:ana> <txm:ana resp="#txm" type="#frlemma">Fogg</txm:ana> </w> <w id="w_tdm80j_431"> <txm:form>était</txm:form> <txm:ana resp="none" type="#n">431</txm:ana> <txm:ana resp="#txm" type="#frpos">VER:impf</txm:ana> <txm:ana resp="#txm" type="#frlemma">être</txm:ana> </w> <w id="w_tdm80j_432"> <txm:form>membre</txm:form> <txm:ana resp="none" type="#n">432</txm:ana> <txm:ana resp="#txm" type="#frpos">NOM</txm:ana> <txm:ana resp="#txm" type="#frlemma">membre</txm:ana> </w>
Annotations are encoded:
Pointer prefix declaration
Relation between standoff annotations and the text file is implicit. If an explicit relation is needed:
a) add TEI prefix declarations in the header:
<encodingDesc> <listPrefixDef> <prefixDef ident="text" matchPattern="([a-z]+)" replacementPattern="../txm/eneas.xml#$1"/> <prefixDef ident="frpos" matchPattern="([a-z]+)" replacementPattern="../../../tagsets/treetagger_frpos.xml#$1"/> </listPrefixDef> </encodingDesc>
b) use the prefix in the text body:
...
<link target="text:w429 frpos:NOM" />
...
Offline annotations example4):
<TEI> ... <text type="standoff"> <body> <div> <linkGrp type="frpos"> ... <link target="#w429 #NOM" /> <link target="#w430 #NAM" /> <link target="#w431 #VER:impf" /> <link target="#w432 #NOM" /> <link target="#w433 #PRP:det" /> <link target="#w434 #NOM" /> <link target="#w435 #PUN" /> <link target="#w436 #KON" /> <link target="#w437 #ADV" /> <link target="#w438 #ADV" /> <link target="#w439 #SENT" /> ... </linkGrp> <linkGrp type="frlemma"> ... <link target="#w429 #Phileas" /> <link target="#w430 #Fogg" /> <link target="#w431 #être" /> <link target="#w432 #membre" /> <link target="#w433 #du" /> <link target="#w434 #Reform-Club" /> <link target="#w435 #," /> <link target="#w436 #et" /> <link target="#w437 #voilà" /> <link target="#w438 #tout" /> <link target="#w439 #." /> ... </linkGrp> </div> </body> </text> </TEI>
Each textual unit is stored in an XML file named textid.xml.
The tei:TEI element encodes all textual units.
This element encodes all informations needed for text processing.
See an example of teiHeader of a standoff annotation file.
This element encodes sets of texts:
Contains:
This element encodes sets of texts and parallel corpora.
... <tei:linkGroup type="align"> <tei:link target="#corpus1 #corpus2" txm:alignElement="div" txm:alignLevel="2"/> <tei:link target="#corpus1 #corpus3" txm:alignElement="p"/> <tei:link target="#corpus2 #corpus3" txm:alignElement="s"/> </tei:linkGroup> ...
The TXM platform combines annotations from different domains:
Some annotations of type b) are represented in the XML-TXM format through inline or offline extended TEI markup (for example pos and lemma).
Some others are represented in their own format:
Some annotations of type c) are represented in the XML-TXM format through inline or offline extended TEI markup (for example non SyMoGIH based, simple or advanced annotations).
Some others are represented in their own format:
Other points we need to see when we have time.
Until TXM 0.5 : XML-TXM V1 format specifications are not public.
From TXM 0.6 to …
The XML-TXM format is defined as an extension to the TEI standard schema through ODD specifications. The ODD specifications sources are hosted on Sourceforge : https://sourceforge.net/p/txm/code/HEAD/tree/trunk/doc/tei-txm
Lors d'un import, on rencontre plusieurs cas :
L'indexation des métadonnées par CQP repose sur les propriétés des structures 'text'.
L'indexation des métadonnées par CQP repose sur les propriétés des structures 'text'.
Les propriétés des structures 'text' proviennent des attributs de l'élément 'text' du format XML-TXM.
Les attributs de l'élément 'text' du format XML-TXM proviennent du fichier metadata.csv ou des attributs de l'élément 'text' dans les sources.
Actuellement ces outils utilisent la structure suivante de la propriété 'id' des mots pour fonctionner :
Lors d'un import non tokeniseur avec des ID externes (de fait) ne respectant pas cette structure, ces outils ne fonctionneront pas.
Ces outils utilisent text@id si présent, sinon le nom du fichier sans son extension (textid.xml).
Lors d'un import non tokeniseur avec des ID externes ne respectant pas cette structure, ces outils ne fonctionneront pas :