Text Grid Dramen Corpus Import Tutorial

Requirements

To follow this Tutorial you will need :

1. Prepare files

We will use TXM “XML/w + CSV” import module which needs each text of the corpus to be stored in a separate file. So the first step is to split the teiCorpus.

  1. Apply txm-filter-teicorpustextgrid-xmlw.xsl stylesheet to the Dramen.xml file
    • you may use the TXM “ExecXSLMacro” for that purpose
    • it will create an “out” subfolder and produce 8 separate TEI files named by concatenating the following metadata :
      • creationDate
      • author
      • title
    • it will also add attributes @author, @title and @creationDate to the <text> element which will available for corpus exploitation
    • it will delete the teiHeader which does not belong to the source text

2. Primary tokenization

  1. Copy the source files produced by the xsl transformation to a folder, e.g. “drametemp”
  2. Open TXM and use File / Import / XML/w + CSV menu
  3. Select the source directory (where you placed the files)
  4. Start the import (“Play” button)
  5. You may stop the import as soon as “Tokenizartion complete” is dispayed on the console

3. Adjust word properties and perform final import

  1. Get the tokenized files from the $TXMHOME/corpora/dramentemp/tokenized foldel and place them to a new source folder, e.g. "dramen" (by default the folder name will become the name of the corpus)
  - Select the new source directory in the TXM "Import parameters of XML:w + CSV" form
  - Specify the main language "de" and check "Annotate the corpus" box if you want to annotate the files 
    * see [[http://txm.sourceforge.net/installtreetagger_en.html|http://txm.sourceforge.net/installtreetagger_en.html]] for instruction on how to use TreeTagger with TXM
  - Select the txm-filter-teitextgrid-xmlw-posttok.xsl stylesheet in the "Front XSL" field. This stylesheet will :
    * add @ref to every word (<w>) for default concordance references (filename + page number)
    * normalize transform <speaker> elements into @who attribute of the <sp> element (to allow comparing speakers)
    * raise initial <pb> tags as high as possible in the xml structure
  - Run the import process till the end

===== 4. Customize edition pages =====

  - Get the XML-TXM files from the $TXMHOME/corpora/dramen/txm/DRAMEN folder
  2. Run the txm-edition-xmltxm-textgrid.xsl stylesheet on all XML-TXM files
    • all original xml-tei elements will be transformed into <div>, <p> or <span> HTML elements with a @class created by concatenation of the original TEI element name and, if available, its @type and @subtype
  3. Save or rename the standard output files with an .html extenstion
  4. Run the txm-edition-page-split.xsl stylesheet on every .html file with the parameter cssname=txm-textgrid
    • the results will be written to the “default” subfolder
  5. Create a “css” subfolder in the “default” sirectory and copy the tei.css and txm-textgrid.css files there
  6. Replace the original “default” folder in $TXMHOME/corpora/dramen/HTML/DRAMEN with the one you have just generated
  7. Your TXM corpus is ready!
    • you can customize the style of the TXM edition by editing the txm-textgrid.css file
public/tutorial_textgriddramen.txt · Dernière modification: 2016/06/23 14:01 par slh@ens-lyon.fr