IDS-Logo

Direktion und zentrale Forschung

Kontakt:
    <korpuslinguistik@ids-...>
 
Leitung:
    Dr. Marc Kupietz <kupietz@ids-...>
 
Wissenschaftliche Mitarbeiter:
    Cyril Belica <belica@ids-...>
    Dr. Harald Lüngen <luengen@ids-...>
    Rainer Perkuhn <perkuhn@ids-...>
 
Kooperationen:
    siehe hier
 
Ehemalige am Korpusaufbau beteiligte Mitarbeiter des IDS:
    siehe hier
 
Studentische Hilfskräfte:

  • Anna Konovalova
  • Theresa Sick

 

 

Corpora of written language


Availability

The bulk of DeReKo can be searched and analysed using COSMAS II free of charge for non-commercial purposes. For download, however, we may offer only a few sub-corpora due to copyright regulations and contractual agreements with rights holders. For more information see the FAQ: " Are there terms and conditions that allow for exceptions?"

By license agreement

If you sign a license agreement, IDS is permitted to provide free access for scientific use to the following corpora of written language:

If you are interested in these corpora, please send an email to Ms Petra Brecht.

Download Server

In addition, the following corpora are available for download, each under CC-BY-SA License

  • Corpus of speeches and interviews (rei)
  • Wikipedia Corpora:
    Conversion 2011 in colloboration with the EuroGr@mm project [1],
    Conversions 2013 and 2015 in collaboration with the programme area Forschungsinfrastrukturen [2].
    Conversion 2017 by the Programme Area Corpus Linguistics.

German language Wikipedia - Available Files 2011-2017 (Encoding ISO-8859-1)
Year WP subcorpus I5 WikiXML TreeTagger
Standoff
2011 articles wpd11.xces.bz2 -/- -/-
article talk wdd11.xces.bz2
2013 articles wpd13.i5.xml.bz2 dewikixml-20130728-articles.tar.gz wpd13.tt.xml.bz2
article talk wdd13.i5.xml.bz2 dewikixml-20130728-discussions.tar.gz wdd13.tt.xml.bz2
articles sample wpd13_sample.i5.xml.bz2 -/- -/-
article talk ample wdd13_sample.i5.xml.bz2
2015 articles wpd15.i5.xml.bz2 wpd15.wikixml.tar.gz wpd15.tt.xml.bz2
article talk wdd15.i5.xml.bz2 wdd15.wikixml.tar.gz wdd15.tt.xml.bz2
user talk wud15.i5.xml.bz2 wud15.wikixml.tar.gz wud15.tt.xml.bz2
article sample wpd15_sample.i5.xml.bz2 -/- -/-
article sample wdd15_sample.i5.xml.bz2
user talk sampleN wud15_sample.i5.xml.bz2
2017 articles wpd17.i5.xml.bz2
article talk sample wdd17.i5.xml.bz2
user talk wud17.i5.xml.bz2
redundancy talk wrd17.i5.xml.bz2


Other languages Wikipedia 2013 - available files (format I5, encoding U8)
articles article talk
French frwiki-20130904-articles.i5.bz2 frwiki-20130904-discussions.i5.bz2
Hungarian huwiki-20140503-articles.i5.bz2 huwiki-20140503-discussions.i5.bz2
Norwegian nowiki-20140512-articles.i5.bz2 nowiki-20140512-discussions.i5.bz2
Italian itwiki-20130508-articles.i5.bz2 itwiki-20130508-discussions.i5.bz2
Polish plwiki-20140503-articles.i5.bz2 plwiki-20140503-discussions.i5.bz2


Other languages Wikipedia 2015 - available files (format I5, encoding U8)
articles article talk user talk
English enwiki-20150808-article.i5.utf8.xml.bz2 enwiki-20150808-talk.i5.utf8.xml.bz2 enwiki-20150808-user-talk.i5.utf8.xml.bz2
French frwiki-20150808-article.i5.utf8.xml.bz2 frwiki-20150808-talk.i5.utf8.xml.bz2 frwiki-20150808-user-talk.i5.utf8.xml.bz2
Hungarian huwiki-20150807-article.i5.utf8.xml.bz2 huwiki-20150807-talk.i5.utf8.xml.bz2 huwiki-20150807-user-talk.i5.utf8.xml.bz2
Norwegian nowiki-20150807-article.i5.utf8.xml.bz2 nowiki-20150807-talk.i5.utf8.xml.bz2 nowiki-20150807-user-talk.i5.utf8.xml.bz2
Spanish eswiki-20150808-article.i5.utf8.xml.bz2 eswiki-20150808-talk.i5.utf8.xml.bz2 eswiki-20150808-user-talk.i5.utf8.xml.bz2
Croatian hrwiki-20150807-article.i5.utf8.xml.bz2 hrwiki-20150807-talk.i5.utf8.xml.bz2 hrwiki-20150807-user-talk.i5.utf8.xml.bz2
Italian itwiki-20150808-article.i5.utf8.xml.bz2 itwiki-20150808-talk.i5.utf8.xml.bz2 itwiki-20150808-user-talk.i5.utf8.xml.bz2
Polish plwiki-20150808-article.i5.utf8.xml.bz2 plwiki-20150808-talk.i5.utf8.xml.bz2 plwiki-20150808-user-talk.i5.utf8.xml.bz2

References

[1] Noah Bubenhofer, Stefanie Haupt, Horst Schwinn (2011): A Comparable Corpus of the Wikipedia: From Wiki Syntax to POS Tagged XML. Hamburg Working Paper in Multilingualism, 96 B
[2] Eliza Margaretha, Harald Lüngen (2014): Building linguistic corpora from Wikipedia articles and discussions. In: Journal for Language Technologie and Computational Linguistics (JLCL) 2/2014

Back to DeReKo overview