Outils pour utilisateurs

Outils du site

Action disabled: source

CICERO corpus : demontration of Perseus Latin texts in TXM

Project presentation

  • goal :
    • demonstrating that one can work on texts available from Perseus project in TXM
    • TEI compliant import
    • if possible, nice editions (could be shown through another corpus)
  • Available ressources (approximate list)
    • p4top5.xsl
      • TEI P4 to P5 conversion
    • txm-filter-perseus-tei-xtz.xsl
      • management of numbered div: div1, div2
      • management of nested <text>: when <group> then includes <subtext> instead of <text>
    • teiheader-to-metadata.xsl: gets information from teiHeader and adds them as attribute to <text> element.
    • a useful macro : text2metadata: generates a metadata.csv from the XML-TXM files of a corpus. Must be used before starting import process.


Conversion from TEI P4 to TEI P5 (Sebastian Ratz stylesheet).

Metadata : from <teiHeader><fileDesc><titleStmt>, get

  • first <title> content,
  • first <author> content,
  • first <editor> content.

Manage XML-TEI features which wouldn't work with CQP :

  • div1, div2 → div
  • <text><group><text> → <text><group><textgroupitem> (or other better tag name)

Distribute <milestone> attributes' information on word tokens (when available).

Get page number when available, put it as an @n attibute on <pb> element so that TXM can use it to number pages in HTML Edition.

Render foreign words (tagged with <foreign> element) and titles (<title> elements content) as italics.


Make a directory (e.g. “cicero”).

This directory includes :

  • a copy of every XML file for latin texts of Cicero downloaded from Perseus DL.
  • a directory named “xsl”, which includes :
    • a subdirectory named “2-front”, which includes :
      • p4top5.xsl
      • txm-front-teiperseus-xtz.xsl
    • a subdirectory named “3-posttok”, which includes :
      • txm-posttok-addRef-perseus.xsl

Then run the TXM command File>Import>XML-XTZ + CSV with the following settings :

1. Source directory is “cicero” (in our example).

2. Import parameters :

  • Main Language : la (to use Treetagger with Latin parameter if TreeTagger has been setup and associated with TXM)
  • Lexical Segmentation : no change - Default settings
  • Editions : Build edition, Words per page = 750, Page break tag = pb
  • Display font : default setting (Font name = <default>)
  • Commands : Concordance context structure limits = text
  • Textual planes :
    • Outside-text = teiHeader,front,back
    • Outside-text to edit = bibl
    • Note elements = note
    • Milestone elements = [nothing, leave blank]
    • Options : default (= remove temporary directories)

3. Click on “Start corpus import” (above - beginning of the page)

Another import can be done, adding a metadata.csv file in order to get more metadata than only the ones automatically extracted from teiHeader (title, first author, first editor).


Some features of XML-XTZ import have not been implemented yet, especially @rend attribute seems is not used to interpret <emph> and <hi> elements. So, through the front XSL (import step #2), we have changed some <hi> into <emph> for cases for which we wanted italics in HTML edition.

<note> content looses all its markup, this is really a drawback as tagged foreign words and italics are very often use in notes.

>>> Back to TXM Perseus Projects main page

XSL Perseus stylesheets used for this import


<?xml version="1.0"?>
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:tei="http://www.tei-c.org/ns/1.0"
  exclude-result-prefixes="tei edate xd" version="2.0">
  <xd:doc type="stylesheet">
      A stylesheet to prepare PERSEUS XML-TEI texts to TXM import.
      This stylesheet is free software; you can redistribute it and/or
      modify it under the terms of the GNU Lesser General Public
      License as published by the Free Software Foundation; either
      version 3 of the License, or (at your option) any later version.
      This stylesheet is distributed in the hope that it will be useful,
      but WITHOUT ANY WARRANTY; without even the implied warranty of
      Lesser General Public License for more details.
      You should have received a copy of GNU Lesser Public License with
      this stylesheet. If not, see http://www.gnu.org/licenses/lgpl.html
    <xd:author>Alexei Lavrentiev alexei.lavrentev@ens-lyon.fr</xd:author>
    <xd:copyright>2017, CNRS / IHRIM (Groupe CACTUS)</xd:copyright>
  <xsl:output method="xml" encoding="utf-8" omit-xml-declaration="no"/>
  <xsl:template match="node()|@*">
    <!-- Copy the current node -->
      <!-- Including any attributes it has and any child nodes -->
      <xsl:apply-templates select="@*|node()"/>
<!-- This template had better be commented if one uses a metadata file with the same information : -->
  <xsl:template match="/tei:TEI/tei:text">
      <xsl:copy-of select="@*"/>
      <xsl:attribute name="author"><xsl:value-of select="//tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:author[1]"/></xsl:attribute>
      <xsl:attribute name="title"><xsl:value-of select="//tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:title[1]"/></xsl:attribute>
      <xsl:attribute name="editor"><xsl:value-of select="//tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:editor[1]"/></xsl:attribute>
<xsl:template match="tei:group/tei:text">
  <xsl:element name="subtext">
    <xsl:apply-templates select="@*|node()"/>
  <xsl:template match="tei:pb">
      <xsl:attribute name="n">
          <xsl:when test="@n"><xsl:value-of select="@n"/></xsl:when>
          <xsl:when test="@*:id">
            <xsl:value-of select="replace(@*:id,'^p\.','')"/>
<xsl:template match="tei:div1|tei:div2|tei:div3|tei:div4|tei:div5|tei:div6|tei:div7">
  <xsl:element name="div" namespace="http://www.tei-c.org/ns/1.0">
    <xsl:apply-templates select="@*|node()"/>
<xsl:template match="tei:choice">
  <xsl:apply-templates select="tei:expan|tei:corr|tei:reg"/>
<xsl:template match="tei:choice/tei:expan">
  <w xmlns="http://www.tei-c.org/ns/1.0">
    <xsl:attribute name="abbr"><xsl:value-of select="normalize-space(parent::tei:choice/tei:abbr)"/></xsl:attribute>
    <xsl:apply-templates select="@*|node()"/>
  <xsl:template match="tei:choice/tei:corr">
      <xsl:attribute name="sic"><xsl:value-of select="normalize-space(parent::tei:choice/tei:sic)"/></xsl:attribute>
      <xsl:apply-templates select="@*|node()"/>
  <xsl:template match="tei:choice/tei:reg">
      <xsl:attribute name="orig"><xsl:value-of select="normalize-space(parent::tei:choice/tei:orig)"/></xsl:attribute>
      <xsl:apply-templates select="@*|node()"/>
<!-- Temporary patch for TXM indexing quote elements in notes -->
  <xsl:template match="tei:note//tei:quote">
      <xsl:apply-templates select="@*|node()"/>
(i) adding an <emph> element in order to point out some elements' content (e.g. foreign, title) in TXM edition ;
(ii) adding a <w> element to prevent tokenisation from analysing some content (e.g. foreign) 
<xsl:template match="tei:foreign[not(ancestor::tei:note)]">
<emph rend="italic" xmlns="http://www.tei-c.org/ns/1.0">
    <w xmlns="http://www.tei-c.org/ns/1.0">  
    <xsl:apply-templates select="@*|node()"/>
<xsl:template match="tei:title">
<emph rend="italic" xmlns="http://www.tei-c.org/ns/1.0">
    <xsl:apply-templates select="@*|node()"/>
<!-- Temporary patch to get the correct rendering for <hi @rend="italic"> content in TXM editions : must use <emph> instead of <hi> -->
<xsl:template match="tei:hi[matches(@rend,'italic')]" priority="1">
  <xsl:element name="emph" namespace="http://www.tei-c.org/ns/1.0">
    <xsl:apply-templates select="@*|node()"/>


<?xml version="1.0"?>
<xsl:stylesheet xmlns:edate="http://exslt.org/dates-and-times"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:tei="http://www.tei-c.org/ns/1.0"
  exclude-result-prefixes="tei edate" xpath-default-namespace="http://www.tei-c.org/ns/1.0" version="2.0">
This software is dual-licensed:
1. Distributed under a Creative Commons Attribution-ShareAlike 3.0
Unported License http://creativecommons.org/licenses/by-sa/3.0/ 
2. http://www.opensource.org/licenses/BSD-2-Clause
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
This software is provided by the copyright holders and contributors
"as is" and any express or implied warranties, including, but not
limited to, the implied warranties of merchantability and fitness for
a particular purpose are disclaimed. In no event shall the copyright
holder or contributors be liable for any direct, indirect, incidental,
special, exemplary, or consequential damages (including, but not
limited to, procurement of substitute goods or services; loss of use,
data, or profits; or business interruption) however caused and on any
theory of liability, whether in contract, strict liability, or tort
(including negligence or otherwise) arising in any way out of the use
of this software, even if advised of the possibility of such damage.
This stylesheet adds a ref attribute to w elements that will be used for
references in TXM concordances. Can be used with TXM XTZ import module.
Written by Alexei Lavrentiev, UMR 5317 IHRIM, 2017
  <xsl:output method="xml" encoding="utf-8" omit-xml-declaration="no"/> 
  <!-- General patterns: all elements, attributes, comments and processing instructions are copied -->
  <xsl:template match="*">      
          <xsl:apply-templates select="*|@*|processing-instruction()|comment()|text()"/>
  <xsl:template match="*" mode="position"><xsl:value-of select="count(preceding-sibling::*)"/></xsl:template>
  <xsl:template match="@*|comment()|processing-instruction()">
  <xsl:variable name="filename">
    <xsl:analyze-string select="document-uri(.)" regex="^(.*)/([^/]+)\.xml$">
        <xsl:value-of select="regex-group(2)"/>
  <xsl:template match="tei:w">
    <xsl:variable name="ref">
        <xsl:when test="ancestor::tei:text/@*:id">
          <xsl:value-of select="ancestor::tei:text[1]/@*:id[1]"/>
          <xsl:value-of select="$filename"/>
      <!-- ajout Perseus -->
      <xsl:if test="preceding::tei:milestone[@unit='chapter'][1][@n]">
        <xsl:text>, c. </xsl:text>
        <xsl:value-of select="preceding::tei:milestone[@unit='chapter'][1]/@n"/>
      <xsl:if test="preceding::tei:milestone[@unit='section'][1][@n]">
        <xsl:text>, s. </xsl:text>
        <xsl:value-of select="preceding::tei:milestone[@unit='section'][1]/@n"/>
      <!-- fin ajout Perseus -->
      <xsl:if test="preceding::tei:pb[1]/@n">
        <xsl:text>, p. </xsl:text>
        <xsl:value-of select="preceding::tei:pb[1]/@n"/>
      <xsl:if test="ancestor::tei:p[@n]">
        <xsl:text>, § </xsl:text>
        <xsl:value-of select="ancestor::tei:p/@n"/>
      <!--<xsl:if test="preceding::tei:lb[1]/@n">
        <xsl:text>, l. </xsl:text>
        <xsl:value-of select="preceding::tei:lb[1]/@n"/>
      <xsl:apply-templates select="@*"/>
      <xsl:attribute name="ref"><xsl:value-of select="$ref"/></xsl:attribute>
      <xsl:apply-templates select="*|processing-instruction()|comment()|text()"/>

>>> Back to TXM Perseus Projects main page

public/perseus_201705_cicero.txt · Dernière modification: 2017/12/01 17:54 par benedicte.pincemin@ens-lyon.fr