Digitale Sprachwissenschaft

Kontakt:
    <korpuslinguistik@ids-...>
 
Leitung:
    Dr. Marc Kupietz <kupietz@ids-...>
 
Wissenschaftliche Mitarbeiter:
    Cyril Belica <belica@ids-...>
    Dr. Harald Lüngen <luengen@ids-...>
    Rainer Perkuhn <perkuhn@ids-...>
 
Kooperationen:
    siehe hier
 
Ehemalige am Korpusaufbau beteiligte Mitarbeiter des IDS:
    siehe hier
 
Studentische Hilfskräfte:

  • Caroline Iliadi
  • Ines Pisetta

Corpora of Written Language

The IDS-Text Model

For an efficient automatic analysis of large electronic text collections, the texts have to be encoded in a coherent data structure format. The so-called IDS text model is such a format for the corpora of written language at the IDS. Until 2013 the IDS text model had been defined by the IDS-XCES, an IDS-specific DTD, which represented a modification of the Corpus Encoding Standard XCES. XCES, on the other hand, was based on the older TEI P3 standard. In order to reintegrate the text model into the current TEI standard P5, the TEI P5-specific ODD mechanism was deployed. Thus, the new document grammar I5 has formally been derived from the current TEI P5 document grammar by customisation. I5 is defined such that every IDS-XCES document is also an I5 document. Since the DeReKo release DeReKo-2014-I in April 2014, DeReKo has been completely  converted to I5. That means, that the DeReKo documents include the i5.dtd generated by the roma-style sheets as document grammar and show the file extension i5.xml .

I5 uses elements of  TEI SIG CMC and of TEI SIG Correspondence.

Resources and documents on I5:

i5.odd

ODD file with TEI P5 customisation of I5

i5.dtd

DTD derived by roma-style sheets

i5.html

HTML documentation derived from i5.odd by project-specific style sheets

i5.xhtml

HTML documentation derived from i5.odd by roma-style sheets

Lizenz

For the incorporated element <correspDoc> (with sub-elements) of TEI SIG

jtei-Artikel

Article (2012) about I5 in the Journal of the Text Encoding Initiative

Resources and documents concerning older versions of the IDS text model:

IDS-XCES

DTDs

IDS/XCES

Complementations and changes compared to XCES

XCES

Corpus Encoding Standard for XML (Nancy Ide)

The intended faithful reproduction of the textual contents and structures of source texts is characteristic for the IDS text model. The same applies to the documentation of all text types in consistent structures that appear in corpora so far. Key components of the IDS text model are corpus structure, corpus text bibliography and source text handling.

Corpus structure

In order to facilitate virtual corpus composition, sensible source lists and more in presenting results, the source texts are organised according to predefined criteria and are subsequently incorporated into a hierarchic structure that includes three levels:

»»»

Corpus level (corpus identifier, f.e. LES)

»»»

Document level (document identifier, f.e. LES/ESS)

»»»

Text level (text identifier, f.e. LES/ESS.20022)

The IDS text model defines text as a relatively independent, textually coherent sequence of natural language expressions, which originate in natural communication situations. It constitutes the corpus text, the »smallest« unit of a corpus. Each corpus consists of one or several documents; each document, in turn, consists of one or several corpus texts. Several texts can be included in a document and summed up according to specific aspects, for example according to sources, chronological sequence, subject area and / or text types. Depending on the corpus structure, a text contains for instance one or several newspaper articles or a journal/magazine as a whole, an excerpt from an independent work or an independent work as a whole.

Example: the Corpus Siegfried Lenz: Edition in Separate Volumes [20 Vol.]. – Hamburg: Hoffmann und Campe, 1996-1999

Number of Texts

Document

Description

Vol.

1

LES/HIL.00000

Es waren Habichte in der Luft. Roman

1

...

...

...

...

1

LES/ALE.00000

Die Auflehnung. Roman

12

77

LES/ERZ.13001 [-16022]

[Narratives]

13-16

3

LES/SCH.17001 [-17003]

[Dramas]

17

4

LES/HOR.18001 [-18004]

[Audio Dramas]

18

98

LES/ESS.19001 [-20032]

[Essays]

19+20

 

Corpus TextBibliografy

The IDS corpus texts have always been provided with source references, which were shown together with the display of references found. In earlier corpora, however, they were unstructured. Thus, as a central component of the IDS text model, a corpus bibliography model has been developed in the nineties. It allows automatic access across corpora to the large amount of source data, now structured consistently. The aims are as follows:

  • Automatic virtual corpus composition according to authors, text types, times of origin, subject fields etc.; text types occurring are for example.:

Treatise

Aphorisms

Article

Autobiography

Narrative

Biography

Letter

Memorandum

Decree

Novella

Essay

Flyer

Footnote

Research Paper

Prayer

Manual

Poem

Leaflet

Audio Drama

Interview

Blurb

Editorial

Fairy Tale

Obituary

Epilogue

Manifesto

Petition

Press Release

Product Sheet

Minutes

Speech

Recension

Novel

Play

Diary

Lead Paragraph

Advertisement

  • Automatic user oriented creation of selectable types of source references (detailed, standard, abbreviated or superordinate)
  • Information gathering of statistical nature from various aspects, for instance  chronological sorting of research outcomes, facilitated by providing the time of origin.

 

Treating Primary Texts

The primary text of the IDS text model consists of the original text, reproduced as faithfully as possible and with the minimum necessary information added.

Original Text Information    =    basic text + lead + heading(s) + salutation(s) +

caption(s) + additional (margin) text(s) + overview(s) + table(s) + footnote(s) + original page division + ...                   

Additional Information        =    end of sentence(s) + end of paragraph(s)

The marking of these and  - if necessary - additional occurrences (f.i. author, interview partners, typographic highlighting) allows researchers to correlate text contents with text structures, search with spaces between sentences, define quotation contexts, provide specific original page references in the list of sources and more.

see Overview

                               

 Sitemap     Search     Impressum     Contact    Print