PLATO corpus : demontration of Perseus Greek & Treebank texts (AGDT 2) in TXM

Project presentation

  • goal :
    • demonstrating that one can work on texts available from Perseus project in TXM
    • TEI compliant import
    • compatibility of TXM with greek language
    • showing that TXM can work on the POS annotation provided by the Treebank (TreeTagger is not the only way to get tagged texts in TXM).
  • corpus
    • Plato's text Euthyphro from AGDT 2: tlg0059.tlg001.perseus-grc1.tb.xml
  • Available ressources (approximate list)
    • txm-filter-perseustreebank-xmlw.xsl


Make a directory (e.g. “plato”), and put inside the XML text file(s) downloaded from Perseus AGDT.

Then run the TXM command File>Import>XML/w + CSV with the following settings :

1. Source directory is “plato” (in our example).

2. Import parameters :

  • Main Language : untick “Annotate the corpus” (means : don't use TreeTagger)
  • Lexical Segmentation : no change - Default settings
  • Front XSL : indicate the copy of txm-filter-perseustreebank-xmlw.xsl in your file system
  • Editions : default setting (Build edition, Words per page = 500, Page break tag = pb)
  • Display font : default setting (Font name = <default>)
  • Commands : default setting (Concordance context structure limits = text)

3. Click on “Start corpus import” (above - beginning of the page)


We made 2 changes in the stylesheet :

  • a correction : rename Perseus @id attribute on <w> words for compatibility with TXM
  • an improvement : add <lb/> elements after each sentence for better rendering in HTML Edition.

XSL Perseus stylesheet used for this import


<?xml version="1.0"?>
  exclude-result-prefixes="edate xd xsi treebank" version="2.0">
  <xd:doc type="stylesheet">
      A stylesheet to prepare PERSEUS Treebank XML texts to TXM XML/w import.
      This stylesheet is free software; you can redistribute it and/or
      modify it under the terms of the GNU Lesser General Public
      License as published by the Free Software Foundation; either
      version 3 of the License, or (at your option) any later version.
      This stylesheet is distributed in the hope that it will be useful,
      but WITHOUT ANY WARRANTY; without even the implied warranty of
      Lesser General Public License for more details.
      You should have received a copy of GNU Lesser Public License with
      this stylesheet. If not, see
    <xd:author>Alexei Lavrentiev</xd:author>
    <xd:copyright>2012, CNRS / ICAR (ICAR3 LinCoBaTO)</xd:copyright>
  <xsl:output method="xml" encoding="utf-8" omit-xml-declaration="no"/>
  <xsl:template match="*">
      <xsl:apply-templates select="*|@*|processing-instruction()|comment()|text()"/>	
  <xsl:template match="@*|comment()">
  <xsl:template match="processing-instruction()"/>
  <xsl:template match="text()"><xsl:value-of select="."/></xsl:template>
<xsl:template match="treebank">
  <text type="treebank" version="{@version}" date="{normalize-space(child::date[1])}" annotator-short="{normalize-space(child::annotator[1]/short)}" annotator-name="{normalize-space(child::annotator[1]/name)}" annotator-address="{normalize-space(child::annotator[1]/address)}">
    <xsl:apply-templates select="descendant::sentence"/>
<xsl:template match="annotator"/>
<xsl:template match="sentence">
    <xsl:apply-templates select="@*"/>
    <xsl:attribute name="annotator"><xsl:value-of select="child::annotator"/></xsl:attribute>
  <xsl:template match="word">
      <xsl:apply-templates select="@*[not(name()='form')]"/>
      <xsl:value-of select="@form"></xsl:value-of>
<xsl:template match="word/@id">
	<xsl:attribute name="perseus-id"><xsl:value-of select="."/></xsl:attribute>

