Outils pour utilisateurs

Outils du site


public:perseus

Différences

Ci-dessous, les différences entre deux révisions de la page.

Lien vers cette vue comparative

Les deux révisions précédentes Révision précédente
Prochaine révision
Révision précédente
Dernière révision Les deux révisions suivantes
public:perseus [2017/05/02 18:32]
benedicte.pincemin@ens-lyon.fr
public:perseus [2017/12/01 17:53]
benedicte.pincemin@ens-lyon.fr
Ligne 1: Ligne 1:
-Cette page sert au suivi de projets mobilisant ​TXM pour analyser des corpus de textes issus de Perseus :+This page is dedicated to projects using TXM on texts taken from the Perseus ​Digital Library ​:
   * [[http://​www.perseus.tufts.edu/​hopper|Perseus Digital Library]]   * [[http://​www.perseus.tufts.edu/​hopper|Perseus Digital Library]]
-    * version ​XML dans Github+    * XML edition (Github)
   * [[ https://​perseusdl.github.io/​treebank_data|The Ancient Greek and Latin Dependency Treebank]] (Github)   * [[ https://​perseusdl.github.io/​treebank_data|The Ancient Greek and Latin Dependency Treebank]] (Github)
  
-Pour chacune de ces sources on dispose déjà de plusieurs feuilles XSL utiles : +Please note that this is a public page.
-  * txm-filter-perseus-tei-xtz.xsl +
-    * conversion p4 à p5 +
-    * div1, div2 +
-    * group -> subtext +
-      * teiheader-to-metadata.xsl (injecte des attributs de text à partir de données du teiHeader : premier auteur, premier titre, premier éditeur) +
-  * feuille pour les références +
-  * txm-filter-perseustreebank-xmlw.xsl+
  
-Il y a des macros utiles : +Anybody who has subscribed to txm-users mailing list can edit this page.
-  * text2metadata à vérifier : produit un metadata.csv à partir des XML-TXM d'un corpus (demander à Matthieu)+
  
-Elles seront disponibles dans SF.+====== Projects ======
  
-Dans la mesure où cette page est actuellement publiqueil est recommandé ​de veiller à maintenir un niveau d'​anonymat raisonnable. (merci de nous contacter si cette page doit évoluer vers plus de confidentialité)+  * [[public:​perseus_201707_plato|July 201729 greek texts from Plato.]] Context : paper submitted to [[https://​chs.harvard.edu/​CHS/​article/​display/​1167?​menuId=66|Classics@]]. 
 +  * [[public:​perseus_201705_cicero|May 2017, 29 latin texts from Cicero.]] Context : Conference [[http://​www.altphil.uni-freiburg.de/​texte-messen/​digital-classics-iii-2013-re-thinking-text-analysis|Digital Classics III – Re-thinking Text Analysis]], Concluding conference on the project //Der digital turn in den Altertumswissenschaften:​ Wahrnehmung - Dokumentation - Reflexion//,​ Heidelberg, May 11–13, 2017. 
 +  * [[public:​perseus_agdt_201705_plato|May 2017, 1 greek annotated text from Plato (AGDT2).]] Context : Conference [[http://​www.altphil.uni-freiburg.de/​texte-messen/​digital-classics-iii-2013-re-thinking-text-analysis|Digital Classics III – Re-thinking Text Analysis]], Concluding conference on the project //Der digital turn in den Altertumswissenschaften:​ Wahrnehmung - Dokumentation - Reflexion//,​ Heidelberg, May 11–13, 2017. 
 +  * [[public:​perseus_201212_plautus|December 2012, 20 latin plays from Plautus.]] Context : presentation at the [[http://​www.dh.uni-leipzig.de/​wo/​e-humanities-seminar/​|University of Leipzig eHumanities Seminar]] on December 5th, 2012.
  
-Pour éditer cette page, il suffit d'​être abonné à la liste de diffusion '​txm-users'​.+====== CICERO corpus : demontration of Perseus Latin texts in TXM ======
  
-====== Projet corpus démo Perseus ​Latin ======+**[[public:​perseus|>>>​ Back to TXM Perseus ​Projects main page]]**
  
-===== Descriptif du projet ​=====+===== Project presentation ​=====
  
   * context : Heidelberg, May 2017 : [[http://​www.altphil.uni-freiburg.de/​texte-messen/​digital-classics-iii-2013-re-thinking-text-analysis]]   * context : Heidelberg, May 2017 : [[http://​www.altphil.uni-freiburg.de/​texte-messen/​digital-classics-iii-2013-re-thinking-text-analysis]]
  
-  * objectif ​:+  * goal :
     * demonstrating that one can work on texts available from Perseus project in TXM     * demonstrating that one can work on texts available from Perseus project in TXM
     * TEI compliant import     * TEI compliant import
Ligne 37: Ligne 32:
       * we get all files ending with _lat, except cic.pet_lat.xml because it's a text from Q. Tullius Cicero instead of M. Tullius Cicero.       * we get all files ending with _lat, except cic.pet_lat.xml because it's a text from Q. Tullius Cicero instead of M. Tullius Cicero.
  
-===== Spécifications =====+  * Available ressources (approximate list) 
 +    * p4top5.xsl 
 +      * TEI P4 to P5 conversion 
 +    * txm-filter-perseus-tei-xtz.xsl 
 +      * management of numbered div: div1, div2 
 +      * management of nested <​text>:​ when <​group>​ then includes <​subtext>​ instead of <​text>​ 
 +    * teiheader-to-metadata.xsl:​ gets information from teiHeader and adds them as attribute to <​text>​ element. 
 +    * a useful macro : text2metadata:​ generates a metadata.csv from the XML-TXM files of a corpus. Must be used before starting import process.
  
-bla bla bla.+===== Specifications =====
  
-===== Recettes =====+Conversion from TEI P4 to TEI P5 (Sebastian Ratz stylesheet).
  
-À venir.+Metadata : from <​teiHeader><​fileDesc><​titleStmt>,​ get 
 +  * first <​title>​ content, 
 +  * first <​author>​ content, 
 +  * first <​editor>​ content.
  
-===== Plannification =====+Manage XML-TEI features which wouldn'​t work with CQP : 
 +  * div1, div2 -> div 
 +  * <​text><​group><​text>​ -> <​text><​group><​textgroupitem>​ (or other better tag name)
  
-==== Étape 1 ====+Distribute <​milestone>​ attributes'​ information on word tokens (when available).
  
-==== Étape 2 ====+Get page number when available, put it as an @n attibute on <pb> element so that TXM can use it to number pages in HTML Edition.
  
-etc.+Render foreign words (tagged with <​foreign>​ element) and titles (<​title>​ elements content) as italics.
  
----- +===== Solution =====
--> [[:|Retour à la liste des projets]].+
  
 +Make a directory (e.g. "​cicero"​).
 +
 +This directory includes :
 +  * a copy of every XML file for latin texts of Cicero downloaded from Perseus DL.
 +  * a directory named "​xsl",​ which includes :
 +    * a subdirectory named "​2-front",​ which includes :
 +      * p4top5.xsl
 +      * txm-front-teiperseus-xtz.xsl
 +    * a subdirectory named "​3-posttok",​ which includes :
 +      * txm-posttok-addRef-perseus.xsl
 +
 +Then run the TXM command File>​Import>​XML-XTZ + CSV with the following settings :
 +
 +1. Source directory is "​cicero"​ (in our example).
 +
 +2. Import parameters :
 +  * Main Language : la (to use Treetagger with Latin parameter if TreeTagger has been setup and associated with TXM)
 +  * Lexical Segmentation : no change - Default settings
 +  * Editions : Build edition, Words per page = 750, Page break tag = pb
 +  * Display font : default setting (Font name = <​default>​)
 +  * Commands : Concordance context structure limits = text
 +  * Textual planes :
 +    * Outside-text = teiHeader,​front,​back
 +    * Outside-text to edit = bibl
 +    * Note elements = note
 +    * Milestone elements = [nothing, leave blank]
 +    * Options : default (= remove temporary directories)
 +
 +3. Click on "Start corpus import"​ (above - beginning of the page)
 +
 +
 +Another import can be done, adding a metadata.csv file in order to get more metadata than only the ones automatically extracted from teiHeader (title, first author, first editor).
 +
 +===== Feedback =====
 +
 +Some features of XML-XTZ import have not been implemented yet, especially @rend attribute seems is not used to interpret <​emph>​ and <hi> elements. So, through the front XSL (import step #2), we have changed some <hi> into <​emph>​ for cases for which we wanted italics in HTML edition.
 +
 +<​note>​ content looses all its markup, this is really a drawback as tagged foreign words and italics are very often use in notes.
 +
 +**[[public:​perseus|>>>​ Back to TXM Perseus Projects main page]]**
 +
 +===== XSL Perseus stylesheets used for this import =====
 +
 +==== txm-front-teiperseus-xtz.xsl ====
 +
 +<code XML>
 +<?xml version="​1.0"?>​
 +<​xsl:​stylesheet
 +  xmlns:​xd="​http://​www.pnp-software.com/​XSLTdoc"​
 +  xmlns:​edate="​http://​exslt.org/​dates-and-times"​
 +  xmlns:​xsl="​http://​www.w3.org/​1999/​XSL/​Transform"​ xmlns:​tei="​http://​www.tei-c.org/​ns/​1.0"​
 +  exclude-result-prefixes="​tei edate xd" version="​2.0">​
 +  ​
 +  <xd:doc type="​stylesheet">​
 +    <​xd:​short>​
 +      A stylesheet to prepare PERSEUS XML-TEI texts to TXM import.
 +    </​xd:​short>​
 +    <​xd:​detail>​
 +      This stylesheet is free software; you can redistribute it and/or
 +      modify it under the terms of the GNU Lesser General Public
 +      License as published by the Free Software Foundation; either
 +      version 3 of the License, or (at your option) any later version.
 +      ​
 +      This stylesheet is distributed in the hope that it will be useful,
 +      but WITHOUT ANY WARRANTY; without even the implied warranty of
 +      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ​ See the GNU
 +      Lesser General Public License for more details.
 +      ​
 +      You should have received a copy of GNU Lesser Public License with
 +      this stylesheet. If not, see http://​www.gnu.org/​licenses/​lgpl.html
 +    </​xd:​detail>​
 +    <​xd:​author>​Alexei Lavrentiev alexei.lavrentev@ens-lyon.fr</​xd:​author>​
 +    <​xd:​copyright>​2017,​ CNRS / IHRIM (Groupe CACTUS)</​xd:​copyright>​
 +  </​xd:​doc>​
 +  ​
 +
 +  <​xsl:​output method="​xml"​ encoding="​utf-8"​ omit-xml-declaration="​no"/>​
 +  ​
 +  <​xsl:​template match="​node()|@*">​
 +    <!-- Copy the current node -->
 +    <​xsl:​copy>​
 +      <!-- Including any attributes it has and any child nodes -->
 +      <​xsl:​apply-templates select="​@*|node()"/>​
 +    </​xsl:​copy>​
 +  </​xsl:​template>​
 +  ​
 +<!-- This template had better be commented if one uses a metadata file with the same information : -->
 +  <​xsl:​template match="/​tei:​TEI/​tei:​text">​
 +    <​xsl:​copy>​
 +      <​xsl:​copy-of select="​@*"/>​
 +      <​xsl:​attribute name="​author"><​xsl:​value-of select="//​tei:​teiHeader/​tei:​fileDesc/​tei:​titleStmt/​tei:​author[1]"/></​xsl:​attribute>​
 +      <​xsl:​attribute name="​title"><​xsl:​value-of select="//​tei:​teiHeader/​tei:​fileDesc/​tei:​titleStmt/​tei:​title[1]"/></​xsl:​attribute>​
 +      <​xsl:​attribute name="​editor"><​xsl:​value-of select="//​tei:​teiHeader/​tei:​fileDesc/​tei:​titleStmt/​tei:​editor[1]"/></​xsl:​attribute>​
 +      <​xsl:​apply-templates/>​
 +    </​xsl:​copy>​
 +  </​xsl:​template>​
 +
 +<​xsl:​template match="​tei:​group/​tei:​text">​
 +  <​xsl:​element name="​subtext">​
 +    <​xsl:​apply-templates select="​@*|node()"/>​
 +  </​xsl:​element>​
 +</​xsl:​template>​
 +  ​
 +  <​xsl:​template match="​tei:​pb">​
 +    <​xsl:​copy>​
 +      <​xsl:​attribute name="​n">​
 +        <​xsl:​choose>​
 +          <​xsl:​when test="​@n"><​xsl:​value-of select="​@n"/></​xsl:​when>​
 +          <​xsl:​when test="​@*:​id">​
 +            <​xsl:​value-of select="​replace(@*:​id,'​^p\.',''​)"/>​
 +          </​xsl:​when>​
 +          <​xsl:​otherwise><​xsl:​text>​[s.n.]</​xsl:​text></​xsl:​otherwise>​
 +        </​xsl:​choose>​
 +      </​xsl:​attribute>​
 +    </​xsl:​copy>​
 +  </​xsl:​template>​
 +
 +<​xsl:​template match="​tei:​div1|tei:​div2|tei:​div3|tei:​div4|tei:​div5|tei:​div6|tei:​div7">​
 +  <​xsl:​element name="​div"​ namespace="​http://​www.tei-c.org/​ns/​1.0">​
 +    <​xsl:​apply-templates select="​@*|node()"/>​
 +  </​xsl:​element>​
 +</​xsl:​template>​
 +
 +<​xsl:​template match="​tei:​choice">​
 +  <​xsl:​apply-templates select="​tei:​expan|tei:​corr|tei:​reg"/>​
 +</​xsl:​template>​
 +
 +<​xsl:​template match="​tei:​choice/​tei:​expan">​
 +  <w xmlns="​http://​www.tei-c.org/​ns/​1.0">​
 +    <​xsl:​attribute name="​abbr"><​xsl:​value-of select="​normalize-space(parent::​tei:​choice/​tei:​abbr)"/></​xsl:​attribute>​
 +    <​xsl:​apply-templates select="​@*|node()"/>​
 +  </w>
 +</​xsl:​template>​
 +  ​
 +  <​xsl:​template match="​tei:​choice/​tei:​corr">​
 +    <​xsl:​copy>​
 +      <​xsl:​attribute name="​sic"><​xsl:​value-of select="​normalize-space(parent::​tei:​choice/​tei:​sic)"/></​xsl:​attribute>​
 +      <​xsl:​apply-templates select="​@*|node()"/>​
 +    </​xsl:​copy>​
 +  </​xsl:​template>​
 +  ​
 +  <​xsl:​template match="​tei:​choice/​tei:​reg">​
 +    <​xsl:​copy>​
 +      <​xsl:​attribute name="​orig"><​xsl:​value-of select="​normalize-space(parent::​tei:​choice/​tei:​orig)"/></​xsl:​attribute>​
 +      <​xsl:​apply-templates select="​@*|node()"/>​
 +    </​xsl:​copy>​
 +  </​xsl:​template>​
 +
 +<!-- Temporary patch for TXM indexing quote elements in notes -->
 +
 +  <​xsl:​template match="​tei:​note//​tei:​quote">​
 +    <​quote-note>​
 +      <​xsl:​apply-templates select="​@*|node()"/>​
 +    </​quote-note>​
 +  </​xsl:​template>​
 +
 +<​!-- ​
 +(i) adding an <​emph>​ element in order to point out some elements'​ content (e.g. foreign, title) in TXM edition ;
 +(ii) adding a <w> element to prevent tokenisation from analysing some content (e.g. foreign) ​
 +-->
 +
 +<​xsl:​template match="​tei:​foreign[not(ancestor::​tei:​note)]">​
 +<emph rend="​italic"​ xmlns="​http://​www.tei-c.org/​ns/​1.0">​
 +  <​xsl:​copy>​
 +    <w xmlns="​http://​www.tei-c.org/​ns/​1.0">  ​
 +    <​xsl:​apply-templates select="​@*|node()"/>​
 +    </​w>  ​
 +  </​xsl:​copy>​
 +</​emph>​
 +</​xsl:​template>​
 +
 +<​xsl:​template match="​tei:​title">​
 +<emph rend="​italic"​ xmlns="​http://​www.tei-c.org/​ns/​1.0">​
 +  <​xsl:​copy>​
 +    <​xsl:​apply-templates select="​@*|node()"/>​
 +  </​xsl:​copy>​
 +</​emph>​
 +</​xsl:​template>​
 +
 +<!-- Temporary patch to get the correct rendering for <hi @rend="​italic">​ content in TXM editions : must use <​emph>​ instead of <hi> -->
 +
 +<​xsl:​template match="​tei:​hi[matches(@rend,'​italic'​)]"​ priority="​1">​
 +  <​xsl:​element name="​emph"​ namespace="​http://​www.tei-c.org/​ns/​1.0">​
 +    <​xsl:​apply-templates select="​@*|node()"/>​
 +  </​xsl:​element>​
 +</​xsl:​template>​
 +
 +</​xsl:​stylesheet>​
 +</​code>​
 +
 +==== txm-posttok-addRef-perseus.xsl ====
 +
 +<code XML>
 +<?xml version="​1.0"?>​
 +<​xsl:​stylesheet xmlns:​edate="​http://​exslt.org/​dates-and-times"​
 +  xmlns:​xsl="​http://​www.w3.org/​1999/​XSL/​Transform"​ xmlns:​tei="​http://​www.tei-c.org/​ns/​1.0"​
 +  xmlns:​txm="​http://​textometrie.org/​ns/​1.0"​
 +  exclude-result-prefixes="​tei edate" xpath-default-namespace="​http://​www.tei-c.org/​ns/​1.0"​ version="​2.0">​
 +
 +  <!--
 +This software is dual-licensed:​
 +
 +1. Distributed under a Creative Commons Attribution-ShareAlike 3.0
 +Unported License http://​creativecommons.org/​licenses/​by-sa/​3.0/ ​
 +
 +2. http://​www.opensource.org/​licenses/​BSD-2-Clause
 +
 +All rights reserved.
 +
 +Redistribution and use in source and binary forms, with or without
 +modification,​ are permitted provided that the following conditions are
 +met:
 +
 +* Redistributions of source code must retain the above copyright
 +notice, this list of conditions and the following disclaimer.
 +
 +* Redistributions in binary form must reproduce the above copyright
 +notice, this list of conditions and the following disclaimer in the
 +documentation and/or other materials provided with the distribution.
 +
 +This software is provided by the copyright holders and contributors
 +"as is" and any express or implied warranties, including, but not
 +limited to, the implied warranties of merchantability and fitness for
 +a particular purpose are disclaimed. In no event shall the copyright
 +holder or contributors be liable for any direct, indirect, incidental,
 +special, exemplary, or consequential damages (including, but not
 +limited to, procurement of substitute goods or services; loss of use,
 +data, or profits; or business interruption) however caused and on any
 +theory of liability, whether in contract, strict liability, or tort
 +(including negligence or otherwise) arising in any way out of the use
 +of this software, even if advised of the possibility of such damage.
 +
 +     
 +This stylesheet adds a ref attribute to w elements that will be used for
 +references in TXM concordances. Can be used with TXM XTZ import module.
 +
 +Written by Alexei Lavrentiev, UMR 5317 IHRIM, 2017
 +  -->
 +
 +
 +  <​xsl:​output method="​xml"​ encoding="​utf-8"​ omit-xml-declaration="​no"/> ​
 +  ​
 +  ​
 +  <!-- General patterns: all elements, attributes, comments and processing instructions are copied -->
 +  ​
 +  <​xsl:​template match="​*"> ​     ​
 +        <​xsl:​copy>​
 +          <​xsl:​apply-templates select="​*|@*|processing-instruction()|comment()|text()"/>​
 +        </​xsl:​copy> ​   ​
 +  </​xsl:​template>​
 +  ​
 +  <​xsl:​template match="​*"​ mode="​position"><​xsl:​value-of select="​count(preceding-sibling::​*)"/></​xsl:​template>​
 +
 +  <​xsl:​template match="​@*|comment()|processing-instruction()">​
 +    <​xsl:​copy/>​
 +  </​xsl:​template>​
 +  ​
 +  <​xsl:​variable name="​filename">​
 +    <​xsl:​analyze-string select="​document-uri(.)"​ regex="​^(.*)/​([^/​]+)\.xml$">​
 +      <​xsl:​matching-substring>​
 +        <​xsl:​value-of select="​regex-group(2)"/>​
 +      </​xsl:​matching-substring>​
 +    </​xsl:​analyze-string>​
 +  </​xsl:​variable>​
 +  ​
 +  ​
 +  <​xsl:​template match="​tei:​w">​
 +    <​xsl:​variable name="​ref">​
 +      <​xsl:​choose>​
 +        <​xsl:​when test="​ancestor::​tei:​text/​@*:​id">​
 +          <​xsl:​value-of select="​ancestor::​tei:​text[1]/​@*:​id[1]"/>​
 +        </​xsl:​when>​
 +        <​xsl:​otherwise>​
 +          <​xsl:​value-of select="​$filename"/>​
 +        </​xsl:​otherwise>​
 +      </​xsl:​choose>​
 +      <!-- ajout Perseus -->
 +      <xsl:if test="​preceding::​tei:​milestone[@unit='​chapter'​][1][@n]">​
 +        <​xsl:​text>,​ c. </​xsl:​text>​
 +        <​xsl:​value-of select="​preceding::​tei:​milestone[@unit='​chapter'​][1]/​@n"/>​
 +      </​xsl:​if>​
 +      <xsl:if test="​preceding::​tei:​milestone[@unit='​section'​][1][@n]">​
 +        <​xsl:​text>,​ s. </​xsl:​text>​
 +        <​xsl:​value-of select="​preceding::​tei:​milestone[@unit='​section'​][1]/​@n"/>​
 +      </​xsl:​if>​
 +      <!-- fin ajout Perseus -->
 +      ​
 +      <xsl:if test="​preceding::​tei:​pb[1]/​@n">​
 +        <​xsl:​text>,​ p. </​xsl:​text>​
 +        <​xsl:​value-of select="​preceding::​tei:​pb[1]/​@n"/>​
 +      </​xsl:​if>​
 +      <xsl:if test="​ancestor::​tei:​p[@n]">​
 +        <​xsl:​text>,​ § </​xsl:​text>​
 +        <​xsl:​value-of select="​ancestor::​tei:​p/​@n"/>​
 +      </​xsl:​if>​
 +      <​!--<​xsl:​if test="​preceding::​tei:​lb[1]/​@n">​
 +        <​xsl:​text>,​ l. </​xsl:​text>​
 +        <​xsl:​value-of select="​preceding::​tei:​lb[1]/​@n"/>​
 +      </​xsl:​if>​-->​
 +    </​xsl:​variable>​
 +    <​xsl:​copy>​
 +      <​xsl:​apply-templates select="​@*"/>​
 +      <​xsl:​attribute name="​ref"><​xsl:​value-of select="​$ref"/></​xsl:​attribute>​
 +      <​xsl:​apply-templates select="​*|processing-instruction()|comment()|text()"/>​
 +    </​xsl:​copy>​
 +  </​xsl:​template>​
 +
 +</​xsl:​stylesheet>​
 +</​code>​
 +
 +**[[public:​perseus|>>>​ Back to TXM Perseus Projects main page]]**
public/perseus.txt · Dernière modification: 2017/12/01 17:54 par benedicte.pincemin@ens-lyon.fr