DeReWo – Corpus-Based Lemma and Word Form Lists
In this subproject we are developing methods to create frequency-based ranking lists of lemmata and word forms on the basis of random virtual corpora. By applying these methods to the Mannheim German Reference Corpus DeReKo, we generate different lists of lemmata and word forms of German language usage, for example the lemma candidate list with 350,000 entries for elexiko – the online-dictionary of contemporary German.
Current Main Subjects
- spelling classification
- paradigmatic classification
- temporal / regional / text typological and similar differentiation
- quality management
DeReWo Lemma and Word Form Lists Currently Available for Download
Time and again, the Institute for the German Language keeps receiving queries regarding the “most common German words”, assuming that such requests are clear enough and therefore easy to answer. With the publication of the DeReWo lemma lists and word form lists we try to find a compromise between the fascinating diversity of our linguistic reality and the justified desire for its preferably compact, although partially simplifying description. With the help of general annotations we want to give you an overview of the issues, that are relevant for the creation and usage of such lists and which we have worked with. These general annotations are attached to the archives in their respective version. You can download the current version directly here. A detailed product-specific documentation is attached to each DeReWo-list in addition to the general annotations. The structure of this documentation is based on the structure of the general annotations. It is designed to help to understand the respective view of the language in question and the resulting simplifications and consequences for interpretation and use of the list.
Number of Entries
Word Form +Lemma+POS-Frequency List
December 31, 2014
December 31, 2012
December 31, 2011
December 31, 2009
Word Form List
May 12, 2009
December 31, 2007
- Using the DeReWo lists without knowing the corresponding documentation is scientifically dubious.
- Referencing or passing on the DeReWo lists without the corresponding documentation is not allowed.
- Commercial use of DeReWo lists is prohibited.
- If you have problems downloading the lists, please proceed as follows:
- first, download the archive and save it locally
- then, unpack the archive (usually possible by double-clicking). A new folder will be created
- start application (word processor, spreadsheet or the like)
- load the file (not with a PDF-file-extension) from the new folder into the application
- if required, enter the coding ISO-8859-15 (if necessary, look it up in the documentation)
- if this does not lead to the desired results, please send an email to the address listed below
If you have any questions or suggestions, please send an email to derewo (at) ids-mannheim.de.
- Tokyo University of Foreign Studies; Global COE Program Corpus-based Linguistics and Language Education (CbLLE)
- Wechselwirkungen zwischen linguistischen und bioinformatischen Verfahren, Methoden und Algorithmen: Modellierung und Abbildung von Varianz in Sprache und Genomen. Verbundprojekt im Rahmen eines BMBF-Förderschwerpunktes.