Outils pour utilisateurs

Outils du site


PALAFRA Latin Corpus

Corpus versions

  • V. 0 October 2016 (initial) tested at Lille workshop
  • V. 1 May 2017
  • V. 1.1 May 2017 (added support for MGH facsimiles)
  • V. 1.2 July 2017 (added enclitic space highlight in edition)

Corpus usage questions

We would like to optionally exclude the following text parts from analyses to prevent biased statistics. This exclusion should be applicable during partition/subcorpus definition and/or searching.

Headlines and Prologs

1) how to mark these parts in the imported XML

I know that for a), these parts are represented in TEI by <head> (already used) and <prologue> (not used yet). But, for instance, I do not know to address this markup in TXM.

a) Ok for <head>. According to the TEI guidelines, “<prologue> contains the prologue to a drama, typically spoken by an actor out of character, possibly in association with a particular performance or venue”. I am not not that fits your case. I would rather use <div type=“prologue”> (as we do in the BFM)

how to use the marks in TXM?

There are basically to ways to exclude some parts of text from indexing and statistical analyses :

1) Exclude them while importing the corpus. They will be permanently excluded from the CQP indexes but they can be displayed in TXM editions and “back to text” functionality. The advantage of this solution is that is easy for the users but the drawback is that you cannot search these text parts even if you need so.

2) Create a subcorpus that does not contain the excluded parts. This works relatively fine with top level divisions but for smaller elements that are mixed with relevant parts, it is more tricky. The problem is that you have to 'atomize' the subcorpus, which makes unavailable queries using context constraints.

If you prefer the 1st solution, you can just use the “out of text to edit” field in the XTZ import form to enter elements like head or cit (just list the element names separed by a comma: head, cit). Unfortunately, this does not work for attribute values. You will have to tranform them into custom elements with a front XSLT stylesheet. E.g. transform <div type=“unchecked”> into <div-unchecked>. Once you decide what elements what elements you exclude, I may help you finalize the XSLT stylesheets to process them.

Il you prefer the 2nd solution I may help you elaborate CQL queries to create the subcorpus.

Bible citations

how to mark these parts in the imported XML ?

for bible citations you may use :

  1. <quote> for the cited text
  2. <bibl> for the reference to the Bible book, chapter, verset, etc, if any

how to use the marks in TXM?

Text beyond a certain length, i.e. a certain number of words (we have partially unchecked texts)

1) how to mark these parts in the imported XML

c) if I undestand correctly, you wish to do a kind of sampling. There is no special tag for this. What we do in similar cases is creating artificial divisions (<div>) at the top level of the XML structure (just below <front>, <body> or <back>) to separate the parts of the text according to their use in sampling; E.g. <div type=“checked”> vs <div type=“unchecked”>. We also sometimes use scripts that cut the text after a certain number of tokens, but they may cut the text in a middle of a sentence, which is a pity.

public/palafra_corpus-latin.txt · Dernière modification: 2017/08/02 15:08 par alexei.lavrentev@ens-lyon.fr