Dr. Marc Kupietz <kupietz@ids-...>
Cyril Belica <belica@ids-...>
Dr. Harald Lüngen <luengen@ids-...>
Rainer Perkuhn <perkuhn@ids-...>
Ehemalige am Korpusaufbau beteiligte Mitarbeiter des IDS:
- Anna Konovalova
- Theresa Sick
Corpora of Written Language
The Mannheim German Reference Corpus (DeReKo-2009-II) is completely and concurrently stand-off annotated. The following tagging tools have been applied for analysis:
- Machinese Phrase Tagger der Firma Connexor Oy
- Xerox FST Linguistic Suite (teilweise und nur für interne Testzwecke)
The strategy of concurrent annotations of several taggers (as many as possible) has been chosen in order to match the error-prone nature of automatically generated secondary data, which is inevitably interpretative and depending on theory. This way, minimisation of type-II-errors (recall-maximisation respectively) is ensured, especially during the utilisation phase.
Short Info and Trivia
- The three taggers currently used have been chosen from overall 25 possible tools by a series of linguistic, computational and economic criteria.
- A complete annotation of DeReKo with these three taggers takes about 12 CPU-years (opteron 3 GHz).
- The time needed for a threefold manual annotation would be approximately 2000 man-years.
- The volume of the XML-files containing the stand-off annotations amounts to 2.5 terabyte.
- If you depict the different tagsets - in as far as this appears to make sense - on a 9-element base-tag, the result is a 92% match (Fleiss' κ=0,931) with regard to the part of speech classification.
- The percentage of the parts of speech that have been found to be fully consistent in the analysed sentences is 31.4%.
- The percentage of sentences that have been analysed by at least two taggers consistently amounts to 52.3%.
The linguistic quality of annotations is currently still being evaluated systematically. First results have been published on September 23, 2009 as part of the conference paper The Morphosyntactic Annotation of DeReKO: Interpretation, Opportunities, and Pitfalls at the conference Grammar & Corpora 3.
If you have any questions regarding search possibilities in the annotation layers, please contact the COSMAS II project.
- Already in 1995 the first morpho-syntactically annotated corpora have been available at the IDS. These annotated corpora, which at that time consisted of about 30 million text words, have been generated with Logos773 - Source Tagger for German, a product by Logos.
- In 1999, automatic morpho-syntactic annotation of further 300 million text words has been carried out by Lingsoft with the aid of the tools gercg and gertwol.