Outils pour utilisateurs

Outils du site


public:perseus_201705_cicero

Différences

Ci-dessous, les différences entre deux révisions de la page.

Lien vers cette vue comparative

public:perseus_201705_cicero [2017/12/01 17:54] (Version actuelle)
benedicte.pincemin@ens-lyon.fr créée
Ligne 1: Ligne 1:
 +====== CICERO corpus : demontration of Perseus Latin texts in TXM ======
  
 +**[[public:​perseus|>>>​ Back to TXM Perseus Projects main page]]**
 +
 +===== Project presentation =====
 +
 +  * context : Heidelberg, May 2017 : [[http://​www.altphil.uni-freiburg.de/​texte-messen/​digital-classics-iii-2013-re-thinking-text-analysis]]
 +
 +  * goal :
 +    * demonstrating that one can work on texts available from Perseus project in TXM
 +    * TEI compliant import
 +    * if possible, nice editions (could be shown through another corpus)
 +
 +  * corpus
 +    * Cicero'​s texts, latin edition : a copy is here : [[https://​sharedocs.huma-num.fr/#/​948/​3789/​Projets/​Textom%C3%A9trie/​Corpus/​src/​perseus/​Cicero/​170502latin]]
 +      * we get all files ending with _lat, except cic.pet_lat.xml because it's a text from Q. Tullius Cicero instead of M. Tullius Cicero.
 +
 +  * Available ressources (approximate list)
 +    * p4top5.xsl
 +      * TEI P4 to P5 conversion
 +    * txm-filter-perseus-tei-xtz.xsl
 +      * management of numbered div: div1, div2
 +      * management of nested <​text>:​ when <​group>​ then includes <​subtext>​ instead of <​text>​
 +    * teiheader-to-metadata.xsl:​ gets information from teiHeader and adds them as attribute to <​text>​ element.
 +    * a useful macro : text2metadata:​ generates a metadata.csv from the XML-TXM files of a corpus. Must be used before starting import process.
 +
 +===== Specifications =====
 +
 +Conversion from TEI P4 to TEI P5 (Sebastian Ratz stylesheet).
 +
 +Metadata : from <​teiHeader><​fileDesc><​titleStmt>,​ get
 +  * first <​title>​ content,
 +  * first <​author>​ content,
 +  * first <​editor>​ content.
 +
 +Manage XML-TEI features which wouldn'​t work with CQP :
 +  * div1, div2 -> div
 +  * <​text><​group><​text>​ -> <​text><​group><​textgroupitem>​ (or other better tag name)
 +
 +Distribute <​milestone>​ attributes'​ information on word tokens (when available).
 +
 +Get page number when available, put it as an @n attibute on <pb> element so that TXM can use it to number pages in HTML Edition.
 +
 +Render foreign words (tagged with <​foreign>​ element) and titles (<​title>​ elements content) as italics.
 +
 +===== Solution =====
 +
 +Make a directory (e.g. "​cicero"​).
 +
 +This directory includes :
 +  * a copy of every XML file for latin texts of Cicero downloaded from Perseus DL.
 +  * a directory named "​xsl",​ which includes :
 +    * a subdirectory named "​2-front",​ which includes :
 +      * p4top5.xsl
 +      * txm-front-teiperseus-xtz.xsl
 +    * a subdirectory named "​3-posttok",​ which includes :
 +      * txm-posttok-addRef-perseus.xsl
 +
 +Then run the TXM command File>​Import>​XML-XTZ + CSV with the following settings :
 +
 +1. Source directory is "​cicero"​ (in our example).
 +
 +2. Import parameters :
 +  * Main Language : la (to use Treetagger with Latin parameter if TreeTagger has been setup and associated with TXM)
 +  * Lexical Segmentation : no change - Default settings
 +  * Editions : Build edition, Words per page = 750, Page break tag = pb
 +  * Display font : default setting (Font name = <​default>​)
 +  * Commands : Concordance context structure limits = text
 +  * Textual planes :
 +    * Outside-text = teiHeader,​front,​back
 +    * Outside-text to edit = bibl
 +    * Note elements = note
 +    * Milestone elements = [nothing, leave blank]
 +    * Options : default (= remove temporary directories)
 +
 +3. Click on "Start corpus import"​ (above - beginning of the page)
 +
 +
 +Another import can be done, adding a metadata.csv file in order to get more metadata than only the ones automatically extracted from teiHeader (title, first author, first editor).
 +
 +===== Feedback =====
 +
 +Some features of XML-XTZ import have not been implemented yet, especially @rend attribute seems is not used to interpret <​emph>​ and <hi> elements. So, through the front XSL (import step #2), we have changed some <hi> into <​emph>​ for cases for which we wanted italics in HTML edition.
 +
 +<​note>​ content looses all its markup, this is really a drawback as tagged foreign words and italics are very often use in notes.
 +
 +**[[public:​perseus|>>>​ Back to TXM Perseus Projects main page]]**
 +
 +===== XSL Perseus stylesheets used for this import =====
 +
 +==== txm-front-teiperseus-xtz.xsl ====
 +
 +<code XML>
 +<?xml version="​1.0"?>​
 +<​xsl:​stylesheet
 +  xmlns:​xd="​http://​www.pnp-software.com/​XSLTdoc"​
 +  xmlns:​edate="​http://​exslt.org/​dates-and-times"​
 +  xmlns:​xsl="​http://​www.w3.org/​1999/​XSL/​Transform"​ xmlns:​tei="​http://​www.tei-c.org/​ns/​1.0"​
 +  exclude-result-prefixes="​tei edate xd" version="​2.0">​
 +  ​
 +  <xd:doc type="​stylesheet">​
 +    <​xd:​short>​
 +      A stylesheet to prepare PERSEUS XML-TEI texts to TXM import.
 +    </​xd:​short>​
 +    <​xd:​detail>​
 +      This stylesheet is free software; you can redistribute it and/or
 +      modify it under the terms of the GNU Lesser General Public
 +      License as published by the Free Software Foundation; either
 +      version 3 of the License, or (at your option) any later version.
 +      ​
 +      This stylesheet is distributed in the hope that it will be useful,
 +      but WITHOUT ANY WARRANTY; without even the implied warranty of
 +      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ​ See the GNU
 +      Lesser General Public License for more details.
 +      ​
 +      You should have received a copy of GNU Lesser Public License with
 +      this stylesheet. If not, see http://​www.gnu.org/​licenses/​lgpl.html
 +    </​xd:​detail>​
 +    <​xd:​author>​Alexei Lavrentiev alexei.lavrentev@ens-lyon.fr</​xd:​author>​
 +    <​xd:​copyright>​2017,​ CNRS / IHRIM (Groupe CACTUS)</​xd:​copyright>​
 +  </​xd:​doc>​
 +  ​
 +
 +  <​xsl:​output method="​xml"​ encoding="​utf-8"​ omit-xml-declaration="​no"/>​
 +  ​
 +  <​xsl:​template match="​node()|@*">​
 +    <!-- Copy the current node -->
 +    <​xsl:​copy>​
 +      <!-- Including any attributes it has and any child nodes -->
 +      <​xsl:​apply-templates select="​@*|node()"/>​
 +    </​xsl:​copy>​
 +  </​xsl:​template>​
 +  ​
 +<!-- This template had better be commented if one uses a metadata file with the same information : -->
 +  <​xsl:​template match="/​tei:​TEI/​tei:​text">​
 +    <​xsl:​copy>​
 +      <​xsl:​copy-of select="​@*"/>​
 +      <​xsl:​attribute name="​author"><​xsl:​value-of select="//​tei:​teiHeader/​tei:​fileDesc/​tei:​titleStmt/​tei:​author[1]"/></​xsl:​attribute>​
 +      <​xsl:​attribute name="​title"><​xsl:​value-of select="//​tei:​teiHeader/​tei:​fileDesc/​tei:​titleStmt/​tei:​title[1]"/></​xsl:​attribute>​
 +      <​xsl:​attribute name="​editor"><​xsl:​value-of select="//​tei:​teiHeader/​tei:​fileDesc/​tei:​titleStmt/​tei:​editor[1]"/></​xsl:​attribute>​
 +      <​xsl:​apply-templates/>​
 +    </​xsl:​copy>​
 +  </​xsl:​template>​
 +
 +<​xsl:​template match="​tei:​group/​tei:​text">​
 +  <​xsl:​element name="​subtext">​
 +    <​xsl:​apply-templates select="​@*|node()"/>​
 +  </​xsl:​element>​
 +</​xsl:​template>​
 +  ​
 +  <​xsl:​template match="​tei:​pb">​
 +    <​xsl:​copy>​
 +      <​xsl:​attribute name="​n">​
 +        <​xsl:​choose>​
 +          <​xsl:​when test="​@n"><​xsl:​value-of select="​@n"/></​xsl:​when>​
 +          <​xsl:​when test="​@*:​id">​
 +            <​xsl:​value-of select="​replace(@*:​id,'​^p\.',''​)"/>​
 +          </​xsl:​when>​
 +          <​xsl:​otherwise><​xsl:​text>​[s.n.]</​xsl:​text></​xsl:​otherwise>​
 +        </​xsl:​choose>​
 +      </​xsl:​attribute>​
 +    </​xsl:​copy>​
 +  </​xsl:​template>​
 +
 +<​xsl:​template match="​tei:​div1|tei:​div2|tei:​div3|tei:​div4|tei:​div5|tei:​div6|tei:​div7">​
 +  <​xsl:​element name="​div"​ namespace="​http://​www.tei-c.org/​ns/​1.0">​
 +    <​xsl:​apply-templates select="​@*|node()"/>​
 +  </​xsl:​element>​
 +</​xsl:​template>​
 +
 +<​xsl:​template match="​tei:​choice">​
 +  <​xsl:​apply-templates select="​tei:​expan|tei:​corr|tei:​reg"/>​
 +</​xsl:​template>​
 +
 +<​xsl:​template match="​tei:​choice/​tei:​expan">​
 +  <w xmlns="​http://​www.tei-c.org/​ns/​1.0">​
 +    <​xsl:​attribute name="​abbr"><​xsl:​value-of select="​normalize-space(parent::​tei:​choice/​tei:​abbr)"/></​xsl:​attribute>​
 +    <​xsl:​apply-templates select="​@*|node()"/>​
 +  </w>
 +</​xsl:​template>​
 +  ​
 +  <​xsl:​template match="​tei:​choice/​tei:​corr">​
 +    <​xsl:​copy>​
 +      <​xsl:​attribute name="​sic"><​xsl:​value-of select="​normalize-space(parent::​tei:​choice/​tei:​sic)"/></​xsl:​attribute>​
 +      <​xsl:​apply-templates select="​@*|node()"/>​
 +    </​xsl:​copy>​
 +  </​xsl:​template>​
 +  ​
 +  <​xsl:​template match="​tei:​choice/​tei:​reg">​
 +    <​xsl:​copy>​
 +      <​xsl:​attribute name="​orig"><​xsl:​value-of select="​normalize-space(parent::​tei:​choice/​tei:​orig)"/></​xsl:​attribute>​
 +      <​xsl:​apply-templates select="​@*|node()"/>​
 +    </​xsl:​copy>​
 +  </​xsl:​template>​
 +
 +<!-- Temporary patch for TXM indexing quote elements in notes -->
 +
 +  <​xsl:​template match="​tei:​note//​tei:​quote">​
 +    <​quote-note>​
 +      <​xsl:​apply-templates select="​@*|node()"/>​
 +    </​quote-note>​
 +  </​xsl:​template>​
 +
 +<​!-- ​
 +(i) adding an <​emph>​ element in order to point out some elements'​ content (e.g. foreign, title) in TXM edition ;
 +(ii) adding a <w> element to prevent tokenisation from analysing some content (e.g. foreign) ​
 +-->
 +
 +<​xsl:​template match="​tei:​foreign[not(ancestor::​tei:​note)]">​
 +<emph rend="​italic"​ xmlns="​http://​www.tei-c.org/​ns/​1.0">​
 +  <​xsl:​copy>​
 +    <w xmlns="​http://​www.tei-c.org/​ns/​1.0">  ​
 +    <​xsl:​apply-templates select="​@*|node()"/>​
 +    </​w>  ​
 +  </​xsl:​copy>​
 +</​emph>​
 +</​xsl:​template>​
 +
 +<​xsl:​template match="​tei:​title">​
 +<emph rend="​italic"​ xmlns="​http://​www.tei-c.org/​ns/​1.0">​
 +  <​xsl:​copy>​
 +    <​xsl:​apply-templates select="​@*|node()"/>​
 +  </​xsl:​copy>​
 +</​emph>​
 +</​xsl:​template>​
 +
 +<!-- Temporary patch to get the correct rendering for <hi @rend="​italic">​ content in TXM editions : must use <​emph>​ instead of <hi> -->
 +
 +<​xsl:​template match="​tei:​hi[matches(@rend,'​italic'​)]"​ priority="​1">​
 +  <​xsl:​element name="​emph"​ namespace="​http://​www.tei-c.org/​ns/​1.0">​
 +    <​xsl:​apply-templates select="​@*|node()"/>​
 +  </​xsl:​element>​
 +</​xsl:​template>​
 +
 +</​xsl:​stylesheet>​
 +</​code>​
 +
 +==== txm-posttok-addRef-perseus.xsl ====
 +
 +<code XML>
 +<?xml version="​1.0"?>​
 +<​xsl:​stylesheet xmlns:​edate="​http://​exslt.org/​dates-and-times"​
 +  xmlns:​xsl="​http://​www.w3.org/​1999/​XSL/​Transform"​ xmlns:​tei="​http://​www.tei-c.org/​ns/​1.0"​
 +  xmlns:​txm="​http://​textometrie.org/​ns/​1.0"​
 +  exclude-result-prefixes="​tei edate" xpath-default-namespace="​http://​www.tei-c.org/​ns/​1.0"​ version="​2.0">​
 +
 +  <!--
 +This software is dual-licensed:​
 +
 +1. Distributed under a Creative Commons Attribution-ShareAlike 3.0
 +Unported License http://​creativecommons.org/​licenses/​by-sa/​3.0/ ​
 +
 +2. http://​www.opensource.org/​licenses/​BSD-2-Clause
 +
 +All rights reserved.
 +
 +Redistribution and use in source and binary forms, with or without
 +modification,​ are permitted provided that the following conditions are
 +met:
 +
 +* Redistributions of source code must retain the above copyright
 +notice, this list of conditions and the following disclaimer.
 +
 +* Redistributions in binary form must reproduce the above copyright
 +notice, this list of conditions and the following disclaimer in the
 +documentation and/or other materials provided with the distribution.
 +
 +This software is provided by the copyright holders and contributors
 +"as is" and any express or implied warranties, including, but not
 +limited to, the implied warranties of merchantability and fitness for
 +a particular purpose are disclaimed. In no event shall the copyright
 +holder or contributors be liable for any direct, indirect, incidental,
 +special, exemplary, or consequential damages (including, but not
 +limited to, procurement of substitute goods or services; loss of use,
 +data, or profits; or business interruption) however caused and on any
 +theory of liability, whether in contract, strict liability, or tort
 +(including negligence or otherwise) arising in any way out of the use
 +of this software, even if advised of the possibility of such damage.
 +
 +     
 +This stylesheet adds a ref attribute to w elements that will be used for
 +references in TXM concordances. Can be used with TXM XTZ import module.
 +
 +Written by Alexei Lavrentiev, UMR 5317 IHRIM, 2017
 +  -->
 +
 +
 +  <​xsl:​output method="​xml"​ encoding="​utf-8"​ omit-xml-declaration="​no"/> ​
 +  ​
 +  ​
 +  <!-- General patterns: all elements, attributes, comments and processing instructions are copied -->
 +  ​
 +  <​xsl:​template match="​*"> ​     ​
 +        <​xsl:​copy>​
 +          <​xsl:​apply-templates select="​*|@*|processing-instruction()|comment()|text()"/>​
 +        </​xsl:​copy> ​   ​
 +  </​xsl:​template>​
 +  ​
 +  <​xsl:​template match="​*"​ mode="​position"><​xsl:​value-of select="​count(preceding-sibling::​*)"/></​xsl:​template>​
 +
 +  <​xsl:​template match="​@*|comment()|processing-instruction()">​
 +    <​xsl:​copy/>​
 +  </​xsl:​template>​
 +  ​
 +  <​xsl:​variable name="​filename">​
 +    <​xsl:​analyze-string select="​document-uri(.)"​ regex="​^(.*)/​([^/​]+)\.xml$">​
 +      <​xsl:​matching-substring>​
 +        <​xsl:​value-of select="​regex-group(2)"/>​
 +      </​xsl:​matching-substring>​
 +    </​xsl:​analyze-string>​
 +  </​xsl:​variable>​
 +  ​
 +  ​
 +  <​xsl:​template match="​tei:​w">​
 +    <​xsl:​variable name="​ref">​
 +      <​xsl:​choose>​
 +        <​xsl:​when test="​ancestor::​tei:​text/​@*:​id">​
 +          <​xsl:​value-of select="​ancestor::​tei:​text[1]/​@*:​id[1]"/>​
 +        </​xsl:​when>​
 +        <​xsl:​otherwise>​
 +          <​xsl:​value-of select="​$filename"/>​
 +        </​xsl:​otherwise>​
 +      </​xsl:​choose>​
 +      <!-- ajout Perseus -->
 +      <xsl:if test="​preceding::​tei:​milestone[@unit='​chapter'​][1][@n]">​
 +        <​xsl:​text>,​ c. </​xsl:​text>​
 +        <​xsl:​value-of select="​preceding::​tei:​milestone[@unit='​chapter'​][1]/​@n"/>​
 +      </​xsl:​if>​
 +      <xsl:if test="​preceding::​tei:​milestone[@unit='​section'​][1][@n]">​
 +        <​xsl:​text>,​ s. </​xsl:​text>​
 +        <​xsl:​value-of select="​preceding::​tei:​milestone[@unit='​section'​][1]/​@n"/>​
 +      </​xsl:​if>​
 +      <!-- fin ajout Perseus -->
 +      ​
 +      <xsl:if test="​preceding::​tei:​pb[1]/​@n">​
 +        <​xsl:​text>,​ p. </​xsl:​text>​
 +        <​xsl:​value-of select="​preceding::​tei:​pb[1]/​@n"/>​
 +      </​xsl:​if>​
 +      <xsl:if test="​ancestor::​tei:​p[@n]">​
 +        <​xsl:​text>,​ § </​xsl:​text>​
 +        <​xsl:​value-of select="​ancestor::​tei:​p/​@n"/>​
 +      </​xsl:​if>​
 +      <​!--<​xsl:​if test="​preceding::​tei:​lb[1]/​@n">​
 +        <​xsl:​text>,​ l. </​xsl:​text>​
 +        <​xsl:​value-of select="​preceding::​tei:​lb[1]/​@n"/>​
 +      </​xsl:​if>​-->​
 +    </​xsl:​variable>​
 +    <​xsl:​copy>​
 +      <​xsl:​apply-templates select="​@*"/>​
 +      <​xsl:​attribute name="​ref"><​xsl:​value-of select="​$ref"/></​xsl:​attribute>​
 +      <​xsl:​apply-templates select="​*|processing-instruction()|comment()|text()"/>​
 +    </​xsl:​copy>​
 +  </​xsl:​template>​
 +
 +</​xsl:​stylesheet>​
 +</​code>​
 +
 +**[[public:​perseus|>>>​ Back to TXM Perseus Projects main page]]**
public/perseus_201705_cicero.txt · Dernière modification: 2017/12/01 17:54 par benedicte.pincemin@ens-lyon.fr