Digitale Sprachwissenschaft

Kontakt:
    <korpuslinguistik@ids-...>
 
Leitung:
    Dr. Marc Kupietz <kupietz@ids-...>
 
Wissenschaftliche Mitarbeiter:
    Cyril Belica <belica@ids-...>
    Dr. Harald Lüngen <luengen@ids-...>
    Rainer Perkuhn <perkuhn@ids-...>
 
Kooperationen:
    siehe hier
 
Ehemalige am Korpusaufbau beteiligte Mitarbeiter des IDS:
    siehe hier
 
Studentische Hilfskräfte:

  • Caroline Iliadi

Korpora der geschriebenen Sprache


IDS-Textmodell: Unterschiede gegenüber XCES

Das IDS-Textmodell ist formal realisiert durch IDS-XCES, das auf dem internationalen Kodierungsstandard XCES basiert und darüberhinaus einige Ergänzungen und Änderungen enthält, die wiederum großteils an den Standard TEI P5 angelehnt und teilweise durch die spezifische Korpusstruktur der IDS-Korpora motiviert sind. Diese Ergänzungen und Änderungen werden im Folgenden dargestellt, separat für die DTD-Datei und die zugehörige Header-Datei.

Hierbei werden hier nur die Elemente aufgelistet, deren Element- oder Attribut-Deklaration sich in XCES und IDS-XCES unterscheiden. Betreffen die Unterschiede nur die Element-Deklaration, so ist die zugehörige Attribut-Deklaration nicht aufgeführt.

 

Unterschiede  xcesDoc.dtd  vs.  ids.xcesdoc.dtd


XCES  (Revision 4.3) IDS-XCES  (Version 1.0) Kommentar

ENTITY DECLARATIONS

Sub-paragraph elements

<!ENTITY % x.token ' ' > <!ENTITY % x.token 'gloss | byline | head | ' >
<!ENTITY % m.token '%x.token; abbr | date | num | measure | name | term | time | ' > <!ENTITY % m.token '%x.token; abbr | date | num | dateRange | numRange | timeRange | measure | name | term | time | w | ' >
<!ENTITY % m.phrase '%m.token; corr | distinct | foreign | gap | hi | list | mentioned | ptr | q | ref | reg | s | title' > <!ENTITY % m.phrase '%m.token; corr | distinct | foreign | gap | hi | list | mentioned | orig | q | ref | reg | s | title | table | xref' >
  <!ENTITY % ids.milestones 'pb | lb | ptr | xptr' > Einbetten von Seiten-und Zeilenumbrüchen (pb und lb) und Pointer-Elementen (ptr und xptr)

Content model declarations

<!ENTITY % base.seq '(%x.token; #PCDATA | num | abbr)*' > <!ENTITY % base.seq '#PCDATA | %x.token; num | numRange | abbr | hi' >

ELEMENT DECLARATIONS

HIGH-LEVEL COMPONENTS (Übergeordnete Struktur)

<!ELEMENT cesCorpus (cesHeader, (cesDoc+ | cesCorpus+)) >
<!ATTLIST cesCorpus
  %a.global;
  type     CDATA  #IMPLIED
  version  CDATA  #REQUIRED
  TEIform  CDATA  'teiCorpus.2' >

 

<!ELEMENT cesDoc (cesHeader, text) >
<!ATTLIST cesDoc %a.global;
  type     CDATA  "text"
  version  CDATA  #REQUIRED
  TEIform  CDATA  'TEI.2' >

<!ELEMENT idsCorpus (idsHeader, (idsDoc+)) >
<!ATTLIST idsCorpus %a.global;
  type     CDATA  #IMPLIED
  version  CDATA  #REQUIRED
  TEIform  CDATA  'teiCorpus.2' >

<!ELEMENT idsDoc (idsHeader, idsText+) >
<!ATTLIST idsDoc %a.global;
  type     CDATA  "text"
  version  CDATA  #REQUIRED
  TEIform  CDATA  'TEI.2' >

<!ELEMENT idsText ((idsHeader , text)) >
<!ATTLIST idsText %a.global;
  version  CDATA  #REQUIRED >
Hier wurde gegenüber XCES eine Zwischenebene eingezogen, indem das IDS-XCES zwischen Dokumenten und Texten unterscheidet und erstere als eine Gruppierung mehrerer Texte definiert (vgl. Korpusstruktur).

WRITTEN TEXTS

<!ELEMENT text (body | group) >
<!ATTLIST text %a.global;
  complete  (y | n )  "y"
  decls     IDREFS    #IMPLIED >
<!ELEMENT text (front | body | back | %ids.milestones;)* >
<!ATTLIST  text  %a.global; >
Hinweis: Das body-Element ist unverändert.
<!ELEMENT group (%par.seq;, body+) >
<!ATTLIST group %a.text;
  decls IDREFS #IMPLIED >
Das group-Element existiert in IDS-XCES nicht.
  <!ELEMENT front (titlePage?, div*) >
<!ATTLIST front %a.global; >

<!ELEMENT titlePage ((docTitle | byline | docEdition | docImprint | epigraph)+) >
<!ATTLIST titlePage %a.global; >

<!ELEMENT docTitle (titlePart+) >
<!ATTLIST docTitle %a.global;
  type (main | sub) #IMPLIED >

<!ELEMENT epigraph (quote) >
<!ATTLIST epigraph %a.global; >

<!ELEMENT docEdition (#PCDATA) >
<!ATTLIST docEdition %a.global; >

<!ELEMENT docImprint (#PCDATA) >
<!ATTLIST docImprint %a.global; >

<!ELEMENT titlePart (#PCDATA | s)* >
<!ATTLIST titlePart %a.global;
  type (main | sub | desc | unspecified) #IMPLIED >

<!ELEMENT back (%par.seq;, div*) >
<!ATTLIST back %a.text; >
Die interne Struktur des text-Elements wurde in IDS-XCES weitgehend umgestaltet.
<!ELEMENT div ((opener | head | byline)*, (((p | sp | %m.inter;)+, div*) | div+), (closer | byline)* ) > <!ELEMENT div (opener | head | byline | p | sp | stage | %m.inter; | div | closer | %ids.milestones; )* >  

Opening elements

<!ELEMENT opener (%phrase.seq; | dateline | keywords )* >
<!ATTLIST opener %a.text; >
<!ELEMENT opener (%phrase.seq; | dateline | keywords | salute | %ids.milestones;)* >
<!ATTLIST opener %a.text;
  type (lead | unspecified) "unspecified" >
<!ELEMENT head (%phrase.seq;)* >
<!ATTLIST head %a.text;
  type CDATA #IMPLIED >

<!ELEMENT head (%phrase.seq; | ptr)* >
<!ATTLIST head %a.text;
  type (top | main | sub | cross | desc | unspecified) "unspecified" >

Keyword lists, bylines, datelines

<!ELEMENT byline (%phrase.seq; | docAuthor)* >
<!ELEMENT byline (%phrase.seq; | docAuthor | %ids.milestones;)* >
<!ELEMENT docAuthor (%base.seq;)* >
<!ELEMENT docAuthor (%base.seq; | %ids.milestones; )* >
<!ELEMENT dateline (%base.seq; | date | time | name | address)* >

<!ELEMENT dateline (%base.seq; | date | time | dateRange | timeRange | name | address | %ids.milestones;)* >
  <!ELEMENT salute (#PCDATA | %ids.milestones;)* >
<!ATTLIST salute %a.text; >

Closing element

<!ELEMENT closer (%phrase.seq; | dateline | keywords)* > <!ELEMENT closer (%phrase.seq; | dateline | keywords | salute | signed | %ids.milestones;)* >
  <!ELEMENT signed (#PCDATA | %ids.milestones;)* >
<!ATTLIST signed %a.text; >

Written paragraphs

<!ELEMENT p (%phrase.seq;)* > <!ELEMENT p (%phrase.seq; | %ids.milestones;)* >

Quotations

<!ELEMENT quote ((%phrase.seq;) | (p | poem)+)* > <!ELEMENT quote (%phrase.seq; | p | poem | %ids.milestones; )* >

Lists

<!ELEMENT list (head?, (item+ | (label, item)+)) >
<!ATTLIST list %a.text; >
<!ELEMENT list (head?, (item | (label, (%ids.milestones;)*, item) | %ids.milestones;)*) >
<!ATTLIST list %a.text;
  type CDATA #IMPLIED >
<!ELEMENT item ((%phrase.seq;) | p+)* >
<!ATTLIST item %a.text; >
<!ELEMENT item (%phrase.seq; | p | %ids.milestones;)* >
<!ATTLIST item %a.text; >

Annotations

<!ELEMENT note (%phrase.seq; | p)* >
<!ELEMENT note (%phrase.seq; | p | bibl | poem | quote | sp | %ids.milestones;)* >
<!ELEMENT bibl (%phrase.seq; | author)* > <!ELEMENT bibl (%phrase.seq; | author | %ids.milestones;)* >

Poems

<!ELEMENT poem (head?, (lg | l )+ ) >
<!ELEMENT poem (head?, (lg | l | %ids.milestones;)+ ) >
<!ELEMENT lg (l | lg)+ > <!ELEMENT lg (l | lg | %ids.milestones;)+ >

Figures

<!ELEMENT figure (head?, p*, figDesc?, text?) > <!ELEMENT figure (head?, (p | %m.inter; | %ids.milestones; )*, figDesc?, text?) >

Tables

<!ELEMENT table (head?, row+) >
<!ELEMENT table (head?, (row | %ids.milestones;)+ ) >
<!ELEMENT cell (%phrase.seq)* >
<!ELEMENT cell (%phrase.seq; | %ids.milestones;)* >

Captions

<!ELEMENT caption (%phrase.seq;)* > <!ELEMENT caption ( head*, (p | %m.inter; | %ids.milestones; )+ ) >

Transcriptions of dialogues, speeches, debates, interviews, etc., and drama

<!ELEMENT sp (speaker | p | stage)+ >
<!ATTLIST sp %a.text;
  who NMTOKEN #IMPLIED >
<!ELEMENT sp (speaker | p | quote | poem | stage | %ids.milestones; )* >
<!ATTLIST sp %a.text;
  who CDATA #IMPLIED >
<!ELEMENT speaker (%base.seq;)* >
<!ELEMENT speaker (%base.seq; | %ids.milestones; )* >
<!ELEMENT stage (%base.seq;)* > <!ELEMENT stage (%base.seq; | p | %ids.milestones; )* >

SENTENCES, QUOTED DIALOGUE WITHIN PARAGRAPHS

<!ELEMENT s (%phrase.seq;)* > <!ELEMENT s (%phrase.seq; | %ids.milestones; | stage )* >
<!ELEMENT q (%phrase.seq;)* >
<!ATTLIST q %a.text;
  next IDREF #IMPLIED
  prev IDREF #IMPLIED
  type CDATA #IMPLIED
  direct (y | n | unspecified) "unspecified"
  who CDATA #IMPLIED
  broken (yes | no) "no" >
<!ELEMENT q (%phrase.seq; | %ids.milestones; )* >
<!ATTLIST q %a.text;
  type (w | o | unspec) "unspec"
  next IDREF #IMPLIED
  prev IDREF #IMPLIED
  direct (y | n | unspecified) "unspecified"
  who CDATA #IMPLIED
  broken (yes | no) "no" >

PHRASE-LEVEL ELEMENTS THE CLASS M.PHRASE

Editorial Changes

  <!ELEMENT orig (%phrase.seq;)* >
<!ATTLIST orig %a.text;
  reg CDATA #IMPLIED
  regalt CDATA #IMPLIED
  resp CDATA #IMPLIED
  cert CDATA #IMPLIED >

Highlighted text

<!ELEMENT hi (%phrase.seq)* > <!ELEMENT hi (%phrase.seq; | %ids.milestones;)* >

Other Phrase-level Elements

<!ELEMENT foreign (%phrase.seq;)* > <!ELEMENT foreign (%phrase.seq; | %ids.milestones;)* >
<!ELEMENT distinct (%phrase.seq;)* > <!ELEMENT distinct (%phrase.seq; | %ids.milestones;)* >
<!ELEMENT mentioned (%phrase.seq;)* > <!ELEMENT mentioned (%phrase.seq; | %ids.milestones;)* >
<!ELEMENT name (%base.seq;)* > <!ELEMENT name (%base.seq; | %ids.milestones;)* >
<!ELEMENT term (%base.seq;)* > <!ELEMENT term (%base.seq; | %ids.milestones;)* >
<!ELEMENT time (%base.seq;)* > <!ELEMENT time (%base.seq; | %ids.milestones;)* >
<!ELEMENT title (%phrase.seq;)* > <!ELEMENT title (%phrase.seq; | %ids.milestones;)* >
  <!ELEMENT gloss (%phrase.seq;)* >
<!ATTLIST gloss %a.global;
  target IDREF #IMPLIED >

<!ELEMENT w (#PCDATA) >
<!ATTLIST w %a.text;
  ana CDATA #IMPLIED
  ctag CDATA #IMPLIED
  type CDATA #IMPLIED >

<!ELEMENT dateRange (%base.seq;)* >
<!ATTLIST dateRange %a.text;
  from CDATA #IMPLIED
  to CDATA #IMPLIED >

<!ELEMENT numRange (%base.seq;)* >
<!ATTLIST numRange %a.text;
  from CDATA #IMPLIED
  to CDATA #IMPLIED
  type CDATA #IMPLIED >

<!ELEMENT timeRange (%base.seq;)* >
<!ATTLIST timeRange %a.text;
  from CDATA #IMPLIED
  to CDATA #IMPLIED >  

SEGMENTATION, LINKING, ALIGNMENT

Simple cross references

<!ELEMENT ref (%phrase.seq;)* >
<!ATTLIST ref %a.text;
  corresp IDREFS #IMPLIED
  next IDREF #IMPLIED
  prev IDREF #IMPLIED
  type CDATA #IMPLIED
  resp CDATA #IMPLIED
  crdate CDATA #IMPLIED
  targType NMTOKENS #IMPLIED
  targOrder (y | n | u) "u"
  evaluate (all | one | none) #IMPLIED
  target IDREFS #IMPLIED >
<!ELEMENT ref (%phrase.seq;)* >
<!ATTLIST ref %a.text;
  corresp IDREFS #IMPLIED
  next IDREF #IMPLIED
  prev IDREF #IMPLIED
  type CDATA #IMPLIED
  resp CDATA #IMPLIED
  crdate CDATA #IMPLIED
  targType NMTOKENS #IMPLIED
  targOrder (y | n | u) "u"
  evaluate (all | one | none) #IMPLIED
  target CDATA #IMPLIED >
Entsprechend der TEI-Konvention wird CDATA als Wert fuer das Attribut target des Elements ref zugelassen.
  <!ELEMENT xptr EMPTY >
<!ATTLIST xptr
  corresp IDREFS #IMPLIED
  next IDREF #IMPLIED
  prev IDREF #IMPLIED
  ana IDREFS #IMPLIED
  id ID #IMPLIED
  n CDATA #IMPLIED
  lang IDREF #IMPLIED
  rend CDATA #IMPLIED
  type CDATA #IMPLIED
  resp CDATA #IMPLIED
  crdate CDATA #IMPLIED
  targType CDATA #IMPLIED
  targOrder (y | n | u) "u"
  evaluate (all | one | none) #IMPLIED
  doc CDATA #IMPLIED
  from CDATA "ROOT"
  to CDATA "DITTO"
  TEIform CDATA "xptr" >
  <!ELEMENT xref (%phrase.seq;)* >
<!ATTLIST xref %a.text;
  corresp IDREFS #IMPLIED
  next IDREF #IMPLIED
  prev IDREF #IMPLIED
  ana IDREFS #IMPLIED
  type CDATA #IMPLIED
  resp CDATA #IMPLIED
  crdate CDATA #IMPLIED
  targType CDATA #IMPLIED
  targOrder (y | n | u) "u"
  evaluate (all | one | none) #IMPLIED
  doc ENTITY #IMPLIED
  from CDATA "ROOT"
  to CDATA "DITTO"
  TEIform CDATA "xref" >

Milestone tags  (neu hinzugefügter Abschnitt)

  <!ELEMENT pb EMPTY >
<!ATTLIST pb
  id ID #IMPLIED
  lang IDREF #IMPLIED
  rend CDATA #IMPLIED
  ed CDATA #IMPLIED
  n CDATA #IMPLIED
  TEIform CDATA "pb" >
Kodierung von Seitenumbrüchen
  <!ELEMENT lb EMPTY >
<!ATTLIST lb
  id ID #IMPLIED
  lang IDREF #IMPLIED
  rend CDATA #IMPLIED
  ed CDATA #IMPLIED
  n CDATA #IMPLIED
  TEIform CDATA "pb" >
Kodierung von Zeilenumbrüchen

 

 

Unterschiede  xheader.ent  vs.  ids.xheader.ent


XCES  (Revision 4.3) IDS-XCES  (Version 1.0) Kommentar
<!ELEMENT cesHeader (fileDesc, encodingDesc?, profileDesc?, revisionDesc?) >
<!ATTLIST cesHeader %a.header;
  type CDATA "text"
  creator CDATA #IMPLIED
  status (new | update) "new"
  date.created CDATA #IMPLIED
  date.updated CDATA #IMPLIED
  version CDATA #REQUIRED
  TEIform CDATA "teiHeader" >
<!ELEMENT idsHeader (fileDesc, encodingDesc?, profileDesc?, revisionDesc?) >
<!ATTLIST idsHeader %a.header;
  type CDATA "text"
  pattern CDATA "text"
  creator CDATA #IMPLIED
  status (new | update) "new"
  date.created CDATA #IMPLIED
  date.updated CDATA #IMPLIED
  version CDATA #REQUIRED
  TEIform CDATA 'teiHeader' >

Title statement

<!ELEMENT titleStmt (h.title, respStmt* ) > <!ELEMENT titleStmt ((korpusSigle , c.title , respStmt*) | (dokumentSigle , d.title , respStmt* ) | (textSigle , t.title , respStmt* ) | (x.title , respStmt* )) >
<!ELEMENT h.title (#PCDATA) >
<!ATTLIST h.title %a.header; >
<!ELEMENT h.title (#PCDATA) >
<!ATTLIST h.title %a.header;
  type (main | sub | abbr) "main"
  level (m | a) #IMPLIED >
  <!ELEMENT korpusSigle (#PCDATA) >
<!ATTLIST korpusSigle %a.header; >

<!ELEMENT c.title (#PCDATA) >
<!ATTLIST c.title %a.header; >

<!ELEMENT dokumentSigle (#PCDATA) >
<!ATTLIST dokumentSigle %a.header; >

<!ELEMENT d.title (#PCDATA) >
<!ATTLIST d.title %a.header; >

<!ELEMENT textSigle (#PCDATA) >
<!ATTLIST textSigle %a.header; >

<!ELEMENT t.title (#PCDATA) >
<!ATTLIST t.title %a.header;
  assemblage (external | regular | non-automatic) #IMPLIED >

<!ELEMENT x.title (#PCDATA) >
<!ATTLIST x.title %a.header; >
Diese zusätzlichen Elemente reflektieren die drei Ebenen der Korpusstruktur.

Publication statement

<!ELEMENT pubDate (#PCDATA) >
<!ATTLIST pubDate %a.header;
  value CDATA #IMPLIED >
<!ELEMENT pubDate (#PCDATA) >
<!ATTLIST pubDate %a.header;
  type (year | month | day) #IMPLIED >

Source description

<!ELEMENT sourceDesc ((biblFull | biblStruct)+) > <!ELEMENT sourceDesc ((biblFull | biblStruct)+, reference*) >
  <!ELEMENT reference (#PCDATA) >
<!ATTLIST reference %a.header;
  type (complete | super | short | former) #IMPLIED
  assemblage (external | regular | non-automatic) #IMPLIED
  existence (no | yes) #IMPLIED
  origin (BOTfile | notBOTfile) #IMPLIED >

Bibliographic citation for non-electronic source

<!ELEMENT analytic (h.author | respStmt | h.title)* > <!ELEMENT analytic (h.title+, (h.author | editor)*, (biblScope | biblNote)*, (edition, respStmt?)*, imprint+, idno*, (biblNote | biblScope)* ) >
<!ELEMENT monogr (h.title+, (h.author | respStmt)*, (edition, respStmt?)*, imprint+, idno*, (biblNote | biblScope)* ) > <!ELEMENT monogr (h.title+, (h.author | editor)*, (biblScope | biblNote)*, (edition, respStmt?)*, imprint+, idno*, (biblNote | biblScope)* ) >
  <!ELEMENT editor (#PCDATA) >
<!ATTLIST editor %a.header; >
<!ELEMENT edition (#PCDATA ) >
<!ATTLIST edition %a.header; >
<!ELEMENT edition (further, kind, appearance) >
<!ATTLIST edition %a.header; >
  <!ELEMENT further (#PCDATA) >
<!ATTLIST further %a.header; >

<!ELEMENT kind (#PCDATA) >
<!ATTLIST kind %a.header; >

<!ELEMENT appearance (#PCDATA) >
<!ATTLIST appearance %a.header; >
<!ELEMENT biblScope (#PCDATA) >
<!ATTLIST biblScope %a.header;
  type (pp | vol | issue) #IMPLIED >
<!ELEMENT biblScope (#PCDATA) >
<!ATTLIST biblScope %a.header;
  type (subsume | pp | vol | issue | issueplace | suppl | suppltitle | volume-title) #IMPLIED >

Encoding description

<!ELEMENT encodingDesc (projectDesc, samplingDecl*, editorialDecl*, tagsDecl?, refsDecl*, classDecl?) >
<!ATTLIST encodingDesc %a.header; >
<!ELEMENT encodingDesc (projectDesc?, samplingDecl*, editorialDecl*, tagsDecl?, refsDecl*, classDecl?) >
<!ATTLIST encodingDesc %a.header; >

Editorial declaration

<!ELEMENT editorialDecl (correction | quotation | hyphenation | segmentation | transduction | normalization | conformance)+ >
<!ATTLIST editorialDecl %a.header; %a.declarable; >
<!ELEMENT editorialDecl (pagination | correction | quotation | hyphenation | segmentation | transduction | normalization | conformance)+ >
<!ATTLIST editorialDecl %a.header;
  %a.declarable; >
  <!ELEMENT pagination (#PCDATA) >
<!ATTLIST pagination %a.header;
  type (yes | no) #IMPLIED >
<!ELEMENT hyphenation (#PCDATA) >
<!ATTLIST hyphenation %a.header;
  %a.declarable; >
<!ELEMENT hyphenation (p+) >
<!ATTLIST hyphenation %a.global;
  %a.declarable;
  eol (all | some | none) "some" >' >

References declaration

<!ELEMENT refsDecl (#PCDATA) > <!ELEMENT refsDecl (state) >
  <!ELEMENT state EMPTY >
<!ATTLIST state %a.global;
  ed CDATA #IMPLIED
  unit CDATA #REQUIRED
  length NMTOKEN #IMPLIED
  delim CDATA #IMPLIED >

Profile description

<!ELEMENT profileDesc (creation?, langUsage?, wsdUsage?, textClass?, translations?, annotations?) > <!ELEMENT profileDesc (creation?, langUsage?, wsdUsage?, textClass?, translations?, annotations?, textDesc ) >

Creation element

<!ELEMENT creation (#PCDATA ) >
<!ATTLIST creation %a.header;
  date CDATA #REQUIRED >
<!ELEMENT creation (creatDate, creatRef?, creatRefShort?) >
<!ATTLIST creation %a.header; >
  <!ELEMENT creatDate (#PCDATA) >
<!ATTLIST creatDate %a.header; >

<!ELEMENT creatRef (#PCDATA) >
<!ATTLIST creatRef %a.header; >

<!ELEMENT creatRefShort (#PCDATA) >
<!ATTLIST creatRefShort %a.header; >
<!ELEMENT language (#PCDATA) >
<!ATTLIST language
  id ID #IMPLIED
  wsd CDATA #IMPLIED
  n CDATA #IMPLIED
  type CDATA #IMPLIED
  iso639 CDATA #REQUIRED >
<!ELEMENT language (#PCDATA) >
<!ATTLIST language
  id ID #IMPLIED
  usage CDATA #IMPLIED >

TextDesc   (neu hinzugefügter Abschnitt)

  <!ELEMENT textDesc ((textType?, textTypeRef?), (textTypeArt?, textDomain?, column?)) >
<!ATTLIST textDesc %a.header; >

<!ELEMENT textType (#PCDATA) >
<!ATTLIST textType %a.header; >

<!ELEMENT textTypeRef (#PCDATA) >
<!ATTLIST textTypeRef %a.header; >

<!ELEMENT textTypeArt (#PCDATA) >
<!ATTLIST textTypeArt %a.header; >

<!ELEMENT textDomain (#PCDATA) >
<!ATTLIST textDomain %a.header; >

<!ELEMENT column (#PCDATA) >
<!ATTLIST column %a.header; >