We would like to optionally exclude the following text parts from analyses to prevent biased statistics. This exclusion should be applicable during partition/subcorpus definition and/or searching.
1) how to mark these parts in the imported XML
I know that for a), these parts are represented in TEI by <head> (already used) and <prologue> (not used yet). But, for instance, I do not know to address this markup in TXM.
a) Ok for <head>. According to the TEI guidelines, “<prologue> contains the prologue to a drama, typically spoken by an actor out of character, possibly in association with a particular performance or venue”. I am not not that fits your case. I would rather use <div type=“prologue”> (as we do in the BFM)
how to use the marks in TXM?
There are basically to ways to exclude some parts of text from indexing and statistical analyses :
1) Exclude them while importing the corpus. They will be permanently excluded from the CQP indexes but they can be displayed in TXM editions and “back to text” functionality. The advantage of this solution is that is easy for the users but the drawback is that you cannot search these text parts even if you need so.
2) Create a subcorpus that does not contain the excluded parts. This works relatively fine with top level divisions but for smaller elements that are mixed with relevant parts, it is more tricky. The problem is that you have to 'atomize' the subcorpus, which makes unavailable queries using context constraints.
If you prefer the 1st solution, you can just use the “out of text to edit” field in the XTZ import form to enter elements like head or cit (just list the element names separed by a comma: head, cit). Unfortunately, this does not work for attribute values. You will have to tranform them into custom elements with a front XSLT stylesheet. E.g. transform <div type=“unchecked”> into <div-unchecked>. Once you decide what elements what elements you exclude, I may help you finalize the XSLT stylesheets to process them.
Il you prefer the 2nd solution I may help you elaborate CQL queries to create the subcorpus.
how to mark these parts in the imported XML ?
for bible citations you may use :
how to use the marks in TXM?
1) how to mark these parts in the imported XML
c) if I undestand correctly, you wish to do a kind of sampling. There is no special tag for this. What we do in similar cases is creating artificial divisions (<div>) at the top level of the XML structure (just below <front>, <body> or <back>) to separate the parts of the text according to their use in sampling; E.g. <div type=“checked”> vs <div type=“unchecked”>. We also sometimes use scripts that cut the text after a certain number of tokens, but they may cut the text in a middle of a sentence, which is a pity.