This page is dedicated to project using TXM on texts taken from the Perseus Digital Library :

Please take care that this is a public page.

Anybody who has subscribed to txm-users mailing list can edit this page.

CICERO corpus : demontration of Perseus Latin texts in TXM

Project presentation

  • goal :
    • demonstrating that one can work on texts available from Perseus project in TXM
    • TEI compliant import
    • if possible, nice editions (could be shown through another corpus)
  • Available ressources (approximate list)
    • txm-filter-perseus-tei-xtz.xsl
      • p4 to p5 conversion
      • management of numbered div : div1, div2
      • management of nested <text> : when <group> then includes <subtext> instead of <text>
        • teiheader-to-metadata.xsl (?) : gets information from teiHeader and adds them as attribute to <text> element.
    • a useful macro : text2metadata à vérifier(to be checked) : generates a metadata.csv from the XML-TXM files of a corpus

Specifications

Conversion from TEI P4 to TEI P5 (Sebastian Ratz stylesheet).

Metadata : from <teiHeader><fileDesc><titleStmt>, get

  • first <title> content,
  • first <author> content,
  • first <editor> content.

Manage XML-TEI features which wouldn't work with CQP :

  • div1, div2 → div
  • <text><group><text> → <text><group><textgroupitem> (or other better tag name)

Distribute <milestone> attributes' information on word tokens (when available).

Get page number when available, put it as an @n attibute on <pb> element so thant TXM can use it to number pages in HTML Edition.

Render foreign words (tagged with <foreign> element) and titles (<title> elements content) as italics.

Solution

Make a directory (e.g. “cicero”).

This directory includes :

  • a copy of every XML file for latin texts of Cicero downloaded from Perseus DL.
  • a directory named “xsl”, which includes :
    • a directory named “2-front”, which includes :
      • p4top5.xsl
      • txm-front-teiperseus-xtz.xsl
    • a directory named “3-posttok”, which includes :
      • txm-posttok-addRef-perseus.xsl

Then run the TXM command File>Import>XML-XTZ + CSV with the following settings :

1. Source directory is “cicero” (in our example).

2. Import parameters :

  • Main Language : la (to use Treetagger with Latin parameter if TreeTagger has been setup and associated with TXM)
  • Lexical Segmentation : no change - Default settings
  • Editions : Build edition, Words per page = 750, Page break tag = pb
  • Display font : default setting (Font name = <default>)
  • Commands : Concordance context structure limits = text
  • Textual planes :
    • Outside-text = teiHeader,front,back
    • Outside-text to edit = bibl
    • Note elements = note
    • Milestone elements = [nothing, leave blank]
    • Options : default (= remove temporary directories)

3. Click on “Start corpus import” (above - beginning of the page)

Another import can be done, adding a metadata.csv file in order to get more metadata than only the ones automatically extracted from teiHeader (title, first author, first editor).

Feedback

Some features of XML-XTZ import have not been implemented yet, especially @rend attribute seems is not used to interpret <emph> and <hi> elements. So, through the front XSL (import step #2), we have changed some <hi> into <emph> for cases for which we wanted italics in HTML edition.

<note> content looses all its markup, this is really a drawback as tagged foreign words and italics are very often use in notes.

XSL Perseus stylesheets used for this import

txm-front-teiperseus-xtz.xsl

<?xml version="1.0"?>
<xsl:stylesheet
  xmlns:xd="http://www.pnp-software.com/XSLTdoc"
  xmlns:edate="http://exslt.org/dates-and-times"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:tei="http://www.tei-c.org/ns/1.0"
  exclude-result-prefixes="tei edate xd" version="2.0">
  
  <xd:doc type="stylesheet">
    <xd:short>
      A stylesheet to prepare PERSEUS XML-TEI texts to TXM import.
    </xd:short>
    <xd:detail>
      This stylesheet is free software; you can redistribute it and/or
      modify it under the terms of the GNU Lesser General Public
      License as published by the Free Software Foundation; either
      version 3 of the License, or (at your option) any later version.
      
      This stylesheet is distributed in the hope that it will be useful,
      but WITHOUT ANY WARRANTY; without even the implied warranty of
      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
      Lesser General Public License for more details.
      
      You should have received a copy of GNU Lesser Public License with
      this stylesheet. If not, see http://www.gnu.org/licenses/lgpl.html
    </xd:detail>
    <xd:author>Alexei Lavrentiev alexei.lavrentev@ens-lyon.fr</xd:author>
    <xd:copyright>2017, CNRS / IHRIM (Groupe CACTUS)</xd:copyright>
  </xd:doc>
  

  <xsl:output method="xml" encoding="utf-8" omit-xml-declaration="no"/>
  
  <xsl:template match="node()|@*">
    <!-- Copy the current node -->
    <xsl:copy>
      <!-- Including any attributes it has and any child nodes -->
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>
  
<!-- This template had better be commented if one uses a metadata file with the same information : -->
  <xsl:template match="/tei:TEI/tei:text">
    <xsl:copy>
      <xsl:copy-of select="@*"/>
      <xsl:attribute name="author"><xsl:value-of select="//tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:author[1]"/></xsl:attribute>
      <xsl:attribute name="title"><xsl:value-of select="//tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:title[1]"/></xsl:attribute>
      <xsl:attribute name="editor"><xsl:value-of select="//tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:editor[1]"/></xsl:attribute>
      <xsl:apply-templates/>
    </xsl:copy>
  </xsl:template>

<xsl:template match="tei:group/tei:text">
  <xsl:element name="subtext">
    <xsl:apply-templates select="@*|node()"/>
  </xsl:element>
</xsl:template>
  
  <xsl:template match="tei:pb">
    <xsl:copy>
      <xsl:attribute name="n">
        <xsl:choose>
          <xsl:when test="@n"><xsl:value-of select="@n"/></xsl:when>
          <xsl:when test="@*:id">
            <xsl:value-of select="replace(@*:id,'^p\.','')"/>
          </xsl:when>
          <xsl:otherwise><xsl:text>[s.n.]</xsl:text></xsl:otherwise>
        </xsl:choose>
      </xsl:attribute>
    </xsl:copy>
  </xsl:template>

<xsl:template match="tei:div1|tei:div2|tei:div3|tei:div4|tei:div5|tei:div6|tei:div7">
  <xsl:element name="div" namespace="http://www.tei-c.org/ns/1.0">
    <xsl:apply-templates select="@*|node()"/>
  </xsl:element>
</xsl:template>

<xsl:template match="tei:choice">
  <xsl:apply-templates select="tei:expan|tei:corr|tei:reg"/>
</xsl:template>

<xsl:template match="tei:choice/tei:expan">
  <w xmlns="http://www.tei-c.org/ns/1.0">
    <xsl:attribute name="abbr"><xsl:value-of select="normalize-space(parent::tei:choice/tei:abbr)"/></xsl:attribute>
    <xsl:apply-templates select="@*|node()"/>
  </w>
</xsl:template>
  
  <xsl:template match="tei:choice/tei:corr">
    <xsl:copy>
      <xsl:attribute name="sic"><xsl:value-of select="normalize-space(parent::tei:choice/tei:sic)"/></xsl:attribute>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>
  
  <xsl:template match="tei:choice/tei:reg">
    <xsl:copy>
      <xsl:attribute name="orig"><xsl:value-of select="normalize-space(parent::tei:choice/tei:orig)"/></xsl:attribute>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

<!-- Temporary patch for TXM indexing quote elements in notes -->

  <xsl:template match="tei:note//tei:quote">
    <quote-note>
      <xsl:apply-templates select="@*|node()"/>
    </quote-note>
  </xsl:template>

<!-- 
(i) adding an <emph> element in order to point out some elements' content (e.g. foreign, title) in TXM edition ;
(ii) adding a <w> element to prevent tokenisation from analysing some content (e.g. foreign) 
-->

<xsl:template match="tei:foreign[not(ancestor::tei:note)]">
<emph rend="italic" xmlns="http://www.tei-c.org/ns/1.0">
  <xsl:copy>
    <w xmlns="http://www.tei-c.org/ns/1.0">  
    <xsl:apply-templates select="@*|node()"/>
    </w>  
  </xsl:copy>
</emph>
</xsl:template>

<xsl:template match="tei:title">
<emph rend="italic" xmlns="http://www.tei-c.org/ns/1.0">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</emph>
</xsl:template>

<!-- Temporary patch to get the correct rendering for <hi @rend="italic"> content in TXM editions : must use <emph> instead of <hi> -->

<xsl:template match="tei:hi[matches(@rend,'italic')]" priority="1">
  <xsl:element name="emph" namespace="http://www.tei-c.org/ns/1.0">
    <xsl:apply-templates select="@*|node()"/>
  </xsl:element>
</xsl:template>

</xsl:stylesheet>

txm-posttok-addRef-perseus.xsl

<?xml version="1.0"?>
<xsl:stylesheet xmlns:edate="http://exslt.org/dates-and-times"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:tei="http://www.tei-c.org/ns/1.0"
  xmlns:txm="http://textometrie.org/ns/1.0"
  exclude-result-prefixes="tei edate" xpath-default-namespace="http://www.tei-c.org/ns/1.0" version="2.0">

  <!--
This software is dual-licensed:

1. Distributed under a Creative Commons Attribution-ShareAlike 3.0
Unported License http://creativecommons.org/licenses/by-sa/3.0/ 

2. http://www.opensource.org/licenses/BSD-2-Clause
		
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.

This software is provided by the copyright holders and contributors
"as is" and any express or implied warranties, including, but not
limited to, the implied warranties of merchantability and fitness for
a particular purpose are disclaimed. In no event shall the copyright
holder or contributors be liable for any direct, indirect, incidental,
special, exemplary, or consequential damages (including, but not
limited to, procurement of substitute goods or services; loss of use,
data, or profits; or business interruption) however caused and on any
theory of liability, whether in contract, strict liability, or tort
(including negligence or otherwise) arising in any way out of the use
of this software, even if advised of the possibility of such damage.

     
This stylesheet adds a ref attribute to w elements that will be used for
references in TXM concordances. Can be used with TXM XTZ import module.

Written by Alexei Lavrentiev, UMR 5317 IHRIM, 2017
  -->


  <xsl:output method="xml" encoding="utf-8" omit-xml-declaration="no"/> 
  
  
  <!-- General patterns: all elements, attributes, comments and processing instructions are copied -->
  
  <xsl:template match="*">      
        <xsl:copy>
          <xsl:apply-templates select="*|@*|processing-instruction()|comment()|text()"/>
        </xsl:copy>    
  </xsl:template>
  
  <xsl:template match="*" mode="position"><xsl:value-of select="count(preceding-sibling::*)"/></xsl:template>

  <xsl:template match="@*|comment()|processing-instruction()">
    <xsl:copy/>
  </xsl:template>
  
  <xsl:variable name="filename">
    <xsl:analyze-string select="document-uri(.)" regex="^(.*)/([^/]+)\.xml$">
      <xsl:matching-substring>
        <xsl:value-of select="regex-group(2)"/>
      </xsl:matching-substring>
    </xsl:analyze-string>
  </xsl:variable>
  
  
  <xsl:template match="tei:w">
    <xsl:variable name="ref">
      <xsl:choose>
        <xsl:when test="ancestor::tei:text/@*:id">
          <xsl:value-of select="ancestor::tei:text[1]/@*:id[1]"/>
        </xsl:when>
        <xsl:otherwise>
          <xsl:value-of select="$filename"/>
        </xsl:otherwise>
      </xsl:choose>
      <!-- ajout Perseus -->
      <xsl:if test="preceding::tei:milestone[@unit='chapter'][1][@n]">
        <xsl:text>, c. </xsl:text>
        <xsl:value-of select="preceding::tei:milestone[@unit='chapter'][1]/@n"/>
      </xsl:if>
      <xsl:if test="preceding::tei:milestone[@unit='section'][1][@n]">
        <xsl:text>, s. </xsl:text>
        <xsl:value-of select="preceding::tei:milestone[@unit='section'][1]/@n"/>
      </xsl:if>
      <!-- fin ajout Perseus -->
      
      <xsl:if test="preceding::tei:pb[1]/@n">
        <xsl:text>, p. </xsl:text>
        <xsl:value-of select="preceding::tei:pb[1]/@n"/>
      </xsl:if>
      <xsl:if test="ancestor::tei:p[@n]">
        <xsl:text>, § </xsl:text>
        <xsl:value-of select="ancestor::tei:p/@n"/>
      </xsl:if>
      <!--<xsl:if test="preceding::tei:lb[1]/@n">
        <xsl:text>, l. </xsl:text>
        <xsl:value-of select="preceding::tei:lb[1]/@n"/>
      </xsl:if>-->
    </xsl:variable>
    <xsl:copy>
      <xsl:apply-templates select="@*"/>
      <xsl:attribute name="ref"><xsl:value-of select="$ref"/></xsl:attribute>
      <xsl:apply-templates select="*|processing-instruction()|comment()|text()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

PLATO corpus : demontration of Perseus Greek & Treebank texts (AGDT 2) in TXM

Project presentation

  • goal :
    • demonstrating that one can work on texts available from Perseus project in TXM
    • TEI compliant import
    • compatibility of TXM with greek language
    • showing that TXM can work on the POS annotation provided by the Treebank (TreeTagger is not the only way to get tagged texts in TXM).
  • corpus
    • Plato's text Euthyphro from AGDT 2: tlg0059.tlg001.perseus-grc1.tb.xml
  • Available ressources (approximate list)
    • txm-filter-perseustreebank-xmlw.xsl

Solution

Make a directory (e.g. “plato”), and put inside the XML text file(s) downloaded from Perseus AGDT.

Then run the TXM command File>Import>XML/w + CSV with the following settings :

1. Source directory is “plato” (in our example).

2. Import parameters :

  • Main Language : untick “Annotate the corpus” (means : don't use TreeTagger)
  • Lexical Segmentation : no change - Default settings
  • Front XSL : indicate the copy of txm-filter-perseustreebank-xmlw.xsl in your file system
  • Editions : default setting (Build edition, Words per page = 500, Page break tag = pb)
  • Display font : default setting (Font name = <default>)
  • Commands : default setting (Concordance context structure limits = text)

3. Click on “Start corpus import” (above - beginning of the page)

Feedback

We made 2 changes in the stylesheet :

  • a correction : rename Perseus @id attribute on <w> words for compatibility with TXM
  • an improvement : add <lb/> elements after each sentence for better rendering in HTML Edition.

XSL Perseus stylesheet used for this import

txm-filter-perseustreebank-xmlw.xsl

<?xml version="1.0"?>
<xsl:stylesheet
  xmlns:xd="http://www.pnp-software.com/XSLTdoc"
  xmlns:edate="http://exslt.org/dates-and-times"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xmlns:treebank="http://nlp.perseus.tufts.edu/syntax/treebank/1.5"
  exclude-result-prefixes="edate xd xsi treebank" version="2.0">
  
  
  <xd:doc type="stylesheet">
    <xd:short>
      A stylesheet to prepare PERSEUS Treebank XML texts to TXM XML/w import.
    </xd:short>
    <xd:detail>
      This stylesheet is free software; you can redistribute it and/or
      modify it under the terms of the GNU Lesser General Public
      License as published by the Free Software Foundation; either
      version 3 of the License, or (at your option) any later version.
      
      This stylesheet is distributed in the hope that it will be useful,
      but WITHOUT ANY WARRANTY; without even the implied warranty of
      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
      Lesser General Public License for more details.
      
      You should have received a copy of GNU Lesser Public License with
      this stylesheet. If not, see http://www.gnu.org/licenses/lgpl.html
    </xd:detail>
    <xd:author>Alexei Lavrentiev alexei.lavrentev@ens-lyon.fr</xd:author>
    <xd:copyright>2012, CNRS / ICAR (ICAR3 LinCoBaTO)</xd:copyright>
  </xd:doc>
  

  <xsl:output method="xml" encoding="utf-8" omit-xml-declaration="no"/>
  
  <xsl:template match="*">
    <xsl:copy>
      <xsl:apply-templates select="*|@*|processing-instruction()|comment()|text()"/>	
    </xsl:copy>
  </xsl:template>
  
  <xsl:template match="@*|comment()">
    <xsl:copy/>
  </xsl:template>
  
  <xsl:template match="processing-instruction()"/>
  
  <xsl:template match="text()"><xsl:value-of select="."/></xsl:template>
  
<xsl:template match="treebank">
  <text type="treebank" version="{@version}" date="{normalize-space(child::date[1])}" annotator-short="{normalize-space(child::annotator[1]/short)}" annotator-name="{normalize-space(child::annotator[1]/name)}" annotator-address="{normalize-space(child::annotator[1]/address)}">
    <xsl:apply-templates select="descendant::sentence"/>
  </text>
</xsl:template>

<xsl:template match="annotator"/>
  
<xsl:template match="sentence">
  <xsl:copy>
    <xsl:apply-templates select="@*"/>
    <xsl:attribute name="annotator"><xsl:value-of select="child::annotator"/></xsl:attribute>
    <xsl:apply-templates/>
  </xsl:copy>
  <lb/>
</xsl:template>
  
  <xsl:template match="word">
    <w>
      <xsl:apply-templates select="@*[not(name()='form')]"/>
      <xsl:value-of select="@form"></xsl:value-of>
    </w>
  </xsl:template>

<xsl:template match="word/@id">
	<xsl:attribute name="perseus-id"><xsl:value-of select="."/></xsl:attribute>

</xsl:template>
</xsl:stylesheet>

PLAUTELAT & PLAUTEEN TXM demo

Goal

  • Context is 2012-12-05 University of Leipzig eHumanities Seminar
  • goal was to demo TXM on Latin and English translations of Plaute' plays from Perseus

Corpus

Corpus au Plaute's plays in Latin and their translation in English from Perseus.

Import parameters (updated from XML/w to XTZ):

  • 2-front :
    • txm-filter-teiperseus-xmlw.xsl
    • txm-filter-teip5-xmlw-preserve.xsl
  • lat.par TreeTagger model

Retour à la liste des projets.

public/perseus.txt · Dernière modification: 2017/05/12 08:33 par benedicte.pincemin@ens-lyon.fr