1.2.3. Integrated Analysis Platform
Various computerized methodologies and platforms are available to help researchers analyze linguistically annotated corpora (GATE, UIMA, WebLicht…).
The [IHRIM] laboratory has developed, through several research projects, an original analysis platform called TXM that implements the textometry methodology and combines several key capabilities: the ability to ingest finely XML-TEI encoded textual corpora, building of high quality text editions (rendering critical apparatus, annotations, styling, pagination, etc.), full text word patterns query search through the efficient CQP engine and statistical models computation based on R packages, applied on the extractions of the search engine.
The PROFITEROLE project will extend the extraction tools and statistical models by coupling syntactic node patterns search engines into the TXM platform.
From a methodological point of view, this will allow the combination of CQL queries expressing extraction constraints on words and their properties (like POS and lemma) and their structural context (like text metadata values or being in direct speech or not) with, for example, TIGERSearch queries expressing extraction constraints on syntactic tree nodes or terminals. The combination of constraints will be based on a join on CQP and TIGERSearch tokens. This will provide new insights in annotated corpora like: applying a contrastive statistical model to different centuries and text genres based on the raw frequency of various syntactic annotations or to the contrast between direct speech and non-direct speech in texts based on the raw frequency of some syntactic patterns.
Those new analysis tools will be prototyped in the portal version of TXM (for online access to the corpus) and in the local desktop version (for local analysis).