Ceci est une ancienne révision du document !


PLATO170720 corpus : how we put 29 Perseus texts from Plato into TXM corpus analysis software

Project presentation

  • Context : paper submitted to Classics@ : “Introduction to Textometric Methodology -illustrated by a first exploration of the Gorgias in the context of Plato's work” (Bénédicte Pincemin, Stéphane Marchand).
  • Goal :
    • Introducing the Hellenic scientific community to Textometric Methodology with examples taken from one famous ancient greek author.
    • Demonstrating that one can work on texts freely available from Perseus project in TXM
    • TEI compliant import : how to use and parameter one available import, so as to take account of all useful information encoded in XML-TEI text files.
    • Nice editions (text display from within TXM) : one can have both a rich digital edition of Plato's work, and advanced functionnalities to search and analyse the full text.

Sources

For this experience we have selected every text, except numbers 15, 16, 17, 29, 33, 35, 36 (for scientific reasons, not technical ones -the solution should work with these texts too, but has not been completely tested).

All these XML-TEI files of Plato's texts are then grouped into one directory named plato170720 (that is, the name we have choosen to give to the TXM corpus).

Principles

When we prepared this corpus in June and July 2017, TEI encoding of plato's texts in Perseus was heterogeneus. We had to deal with several states : last updates made in 2017, 2015, 2014, 1992. 2017 texts were clearly a new generation. Two texts (27 = Ion and 30 = Republic) had some main differences choices in encoding, for instance as regard <div> use and sections' marking.

We decided not to modify sources (which are evolving and improving thanks to perseus community), but to make automatized and limited changes included in the import processing so as to get a usable corpus, even if the TXM user has to compel with some inherited heterogeneity.

As a basis we take the XSL stylesheets prepared for the previous experience on Perseus texts (Cicero, Heidelberg 2017), also described on the txm-users wiki (here). These stylesheets already manage some XML TEI features of Perseus texts (about nested <div> or <text>) in order to make them compliant with TXM processing (especially for the CQP search engine component embedded in TXM).

Specifications

Title, date of edition

Automatically get text information from teiHeader :

  • title, author and editor from fileDesc<titleStmt> (first mention for each element)
  • content of @when attribute for first (or most recent) <change> element in <revisionDesc>

As date formulation shows big variations throughout the corpus, we also encode this information in a normalized form in a metadata.csv file given as parameter in TXM XTZ import (this produces the update10 property on text structure, as last change date is written with 10 characters).

The title information is interesting as default identifier of the text for references in concordance view. Special cases : We have to deal with some long titles like “Republic (Greek). Machine readable text” (30), idem for Laws (34). We developped 2 solutions :

  • automatic processing : cut them before the first punctuation.
  • hand-coded declaration of titles in a metadata.csv file in TXM XTZ import (which produces “title1” property on text structure, as title is then encoded as 1 ou 2 words).

Nevertheless, CTS URN information must still be available and can be choosen to localize words in the corpus (cf. ctsurn and ctsurn5 property for word structure, and id property for text structure).

Word localizations and references

To precisely localize word occurrences in TXM we would like to have by default the text title and the number of the Stephanus section.

<div> usage is heterogeneus at the moment. The best solution is to use @n attribute in <milestone unit=“section”> elements to get words localization in the Stephanus reference system. We just have to deal with the exception of texts 27 and 30 which code this information only on <div type=“section”> or <div subtype=“section”> elements.

Moreover, knowing that we may use section numbers in some sort processings, we want a version of this numbers that is encoded in a fixed length manner (e.g. 0015a instead of 15a), so that sorting these numbers as strings provides a relevant result.

Only most recent text versions had a pattern declared in <encodingdesc><refsDecl n=“CTS”> to identify sections through the CTS system. So we couldn't use it at the moment but this could be interesting for a later version of the corpus.

We can take into account edition pages : in all the files of our corpus the information is available in <milestone unit=“page”> element, with @n attribute. Solution : during XTZ import, at the 2-front stage, the xsl stylesheet adds <pb> elements which can be used by TXM.

Speech turns

Encoding of speech turns is heterogeneous too :

  • done with <sp> or <said> element,
  • speaker indicated with
    • @who attribute only in <said> elements (but not all of them)
    • a <label> element may introduce the speech turn in <said> elements,
    • a <speaker> element introduces and encodes speaker information for <sp> elements

<p> elements are sometimes used, sometimes not, and can be either outside <said>, or inside <said> and following <label>,…

We want to keep and show clearly speech turn information and speaker information, without indexing the speaker's name as a word to be counted and searched a such.

Solution :

  • adding <label> and <p> when missing (during XTZ import processing) ;
  • declare <speaker> and <label> as Out-of-text-to-edit element in import parameters.

Speaker information is available in TXM but is heterogeneously encoded. We decided not to work on this transitional state of Perseus texts ; when source texts will be homogeneous in Perseus then in TXM they will be homogeneous too.

Miscellaneous

Castlists given in files 23 to 27 should be ignored for textometric analysis

  • Solution :
    • declare <castList> as an Out-of-text element in TXM XTZ import parameters.

Bibliographic citation references encoded with <bibl> elements should be distinguished from ancient greek text.

  • Solution :
    • declare <bibl> as note element in TXM XTZ import parameters ;
    • display its content in gray characters (defined in CSS stylesheet for edition).

Versified text (encoded with <l> elements) should be distinguished in TXM text edition.

  • Solution :
    • process it as blocks in CSS.

Solution

Make a directory (e.g. “plato”).

This directory includes :

  • a copy of every XML file for greek texts of Plato downloaded from Perseus DL.
  • (optional) a file named “import.xml” (This file is automatically created or updated during the import processing, it records all the parameters values used for the import, see below.)
  • (optional) a file named “metadata.csv”, which brings additional information to describe texts (i.e. normalized titles, normalized edition date, etc.)
  • a directory named “css”, which includes :
    • perseus.css
  • a directory named “xsl”, which includes :
    • (depending on your TXM version, see note below) a subdirectory named “1-split-merge”, which includes :
      • rename-no-dots.xsl
    • a subdirectory named “2-front”, which includes :
      • p4top5.xsl
      • txm-front-teiperseus-xtz.xsl
    • a subdirectory named “3-posttok”, which includes :
      • txm-posttok-addRef-perseus.xsl
    • a subdirectory named “4-edition”, which includes :
      • 1-default-html.xsl
      • 2-default-pager.xsl

About the 1-split-merge directory :

  • A bug in the TXM 0.7.8 version that we had in July 2017 prevented us from keeping dots in filenames, the rename-no-dots.xsl stylesheet is a solution to this bug : dots are replaced by underscores at the first stage of the import.
  • Later, another bug in TXM version delivered in August 2017 happens at this first stage of the import. A solution is to skip this first stage (one can rename the folder 1-split-merge for instance, so that it is not recognize and then not taken into account ; or delete this folder) and do the renaming text files manually, or automatically before TXM import (this can be done from within TXM with the ExecXSL macro).

These two bugs have been reported and should be corrected in one of the next versions of TXM, so try and see if file name corrections are still needed.

Then run the TXM command File>Import>XML-XTZ + CSV with the following settings :

1. Source directory is “plato” (in our example).

2. Import parameters : Import parameters :

  • Main Language : untick “annotate the corpus” and select “el” for Greek language.
  • Lexical Segmentation : add — in punctuation list (for example as second character just after ”[”, Punctuations field content looks then like this : [—\p{Ps}\p{Pe}\p{Pi}\p{Pf}\p{Po}\p{S}])
  • Editions : Build edition, Words per page = 1000, Page break tag = pb
  • Display font : default setting (Font name = <default>)
  • Commands : Concordance context structure limits = text
  • Textual planes :
    • Outside-text = teiHeader,front,back,castList
    • Outside-text to edit = label,speaker
    • Note elements = bibl,note
    • Milestone elements = [nothing, leave blank]
    • Options : default (= remove temporary directories)

3. Click on “Start corpus import” (above - beginning of the form)

The import parameters are read from, and saved in, the import.xml file included in the copus directory (here “plato” directory). So if you have already imported the corpus, you recover your previous settings for this corpus. You can update or modify it before the new import. This file is edited through the graphical user interface (XML-XTZ + CSV import form).

Content of the metadata.csv file used for this import

XSL Perseus stylesheets used for this import

public/perseus_201707_plato.1512130207.txt.gz · Dernière modification: 2017/12/01 13:10 par benedicte.pincemin@ens-lyon.fr