Dr. Marc Kupietz <kupietz@ids-...>
Cyril Belica <belica@ids-...>
Dr. Harald Lüngen <luengen@ids-...>
Rainer Perkuhn <perkuhn@ids-...>
Ehemalige am Korpusaufbau beteiligte Mitarbeiter des IDS:
- Anna Konovalova
- Theresa Sick
Corpora of Written Language
The IDS-Text Model
For an efficient automatic analysis of large electronic text collections, the texts have to be encoded in a coherent data structure format. The so-called IDS text model is such a format for the corpora of written language at the IDS. Until 2013 the IDS text model had been defined by the IDS-XCES, an IDS-specific DTD, which represented a modification of the Corpus Encoding Standard XCES. XCES, on the other hand, was based on the older TEI P3 standard. In order to reintegrate the text model into the current TEI standard P5, the TEI P5-specific ODD mechanism was deployed. Thus, the new document grammar I5 has formally been derived from the current TEI P5 document grammar by customisation. I5 is defined such that every IDS-XCES document is also an I5 document. Since the DeReKo release DeReKo-2014-I in April 2014, DeReKo has been completely converted to I5. That means, that the DeReKo documents include the i5.dtd generated by the roma-style sheets as document grammar and show the file extension i5.xml .
Resources and documents on I5:
ODD file with TEI P5 customisation of I5
DTD derived by roma-style sheets
HTML documentation derived from i5.odd by project-specific style sheets
HTML documentation derived from i5.odd by roma-style sheets
For the incorporated element <correspDoc> (with sub-elements) of TEI SIG
Article (2012) about I5 in the Journal of the Text Encoding Initiative
Resources and documents concerning older versions of the IDS text model:
Complementations and changes compared to XCES
Corpus Encoding Standard for XML (Nancy Ide)
The intended faithful reproduction of the textual contents and structures of source texts is characteristic for the IDS text model. The same applies to the documentation of all text types in consistent structures that appear in corpora so far. Key components of the IDS text model are corpus structure, corpus text bibliography and source text handling.
In order to facilitate virtual corpus composition, sensible source lists and more in presenting results, the source texts are organised according to predefined criteria and are subsequently incorporated into a hierarchic structure that includes three levels:
Corpus level (corpus identifier, f.e. LES)
Document level (document identifier, f.e. LES/ESS)
Text level (text identifier, f.e. LES/ESS.20022)
The IDS text model defines text as a relatively independent, textually coherent sequence of natural language expressions, which originate in natural communication situations. It constitutes the corpus text, the »smallest« unit of a corpus. Each corpus consists of one or several documents; each document, in turn, consists of one or several corpus texts. Several texts can be included in a document and summed up according to specific aspects, for example according to sources, chronological sequence, subject area and / or text types. Depending on the corpus structure, a text contains for instance one or several newspaper articles or a journal/magazine as a whole, an excerpt from an independent work or an independent work as a whole.
Example: the Corpus Siegfried Lenz: Edition in Separate Volumes [20 Vol.]. – Hamburg: Hoffmann und Campe, 1996-1999
Number of Texts
Es waren Habichte in der Luft. Roman
Die Auflehnung. Roman
The IDS corpus texts have always been provided with source references, which were shown together with the display of references found. In earlier corpora, however, they were unstructured. Thus, as a central component of the IDS text model, a corpus bibliography model has been developed in the nineties. It allows automatic access across corpora to the large amount of source data, now structured consistently. The aims are as follows:
Automatic virtual corpus composition according to authors, text types, times of origin, subject fields etc.; text types occurring are for example.:
- Automatic user oriented creation of selectable types of source references (detailed, standard, abbreviated or superordinate)
- Information gathering of statistical nature from various aspects, for instance chronological sorting of research outcomes, facilitated by providing the time of origin.
Treating Primary Texts
The primary text of the IDS text model consists of the original text, reproduced as faithfully as possible and with the minimum necessary information added.
Original Text Information = basic text + lead + heading(s) + salutation(s) +
caption(s) + additional (margin) text(s) + overview(s) + table(s) + footnote(s) + original page division + ...
Additional Information = end of sentence(s) + end of paragraph(s)
The marking of these and - if necessary - additional occurrences (f.i. author, interview partners, typographic highlighting) allows researchers to correlate text contents with text structures, search with spaces between sentences, define quotation contexts, provide specific original page references in the list of sources and more.