IDS-Logo
Startseite : : Organisationsstruktur : : Direktion : : KorpuslinguistikCorpus Linguistics : : Projects : : Methods of Analysis : : Corpus Based Lemma and Word Form Lists
Grund- und WortformenlistenCorpus Based Lemma and Word Form ListsGrund- und WortformenlistenGrund- und WortformenlistenGrund- und WortformenlistenGrund- und Wortformenlisten

Direktion und zentrale Forschung

Contact:
    <korpuslinguistik@ids-...>
 
Head of Project:
    Cyril Belica <belica@ids-...>
 
Scientific Assistants:
    Dr. Marc Kupietz <kupietz@ids-...>
    Dr. Harald Lüngen <luengen@ids-...>
    Rainer Perkuhn <perkuhn@ids-...>
 
Student Assistants:
    Anna Schächtele

DeReWo – Corpus-Based Lemma and Word Form Lists

In this subproject we are developing methods to create frequency-based ranking lists of lemmata and word forms on the basis of random virtual corpora. By applying these methods to the Mannheim German Reference Corpus DeReKo, we generate different lists of lemmata and word forms of German language usage, for example the lemma candidate list with 350,000 entries for elexiko – the online-dictionary of contemporary German.

Current Main Subjects

  • spelling classification
  • paradigmatic classification
  • temporal / regional / text typological and similar differentiation
  • exceptions
  • quality management

DeReWo Lemma and Word Form Lists Currently Available for Download

Time and again, the Institute for the German Language keeps receiving queries regarding the “most common German words”, assuming that such requests are clear enough and therefore easy to answer. With the publication of the DeReWo lemma lists and word form lists we try to find a compromise between the fascinating diversity of our linguistic reality and the justified desire for its preferably compact, although partially simplifying description. With the help of general annotations we want to give you an overview of the issues, that are relevant for the creation and usage of such lists and which we have worked with. These general annotations are attached to the archives in their respective version. You can download the current version directly here. A detailed product-specific documentation is attached to each DeReWo-list in addition to the general annotations. The structure of this documentation is based on the structure of the general annotations. It is designed to help to understand the respective view of the language in question and the resulting simplifications and consequences for interpretation and use of the list.

Name

Type

Number of Entries

published on

DeReKo-2014-II-MainArchive-STT.100000

Word Form +Lemma+POS-Frequency List

100.000

December 31, 2014

download

derewo-v-ww-bll-320000g-2012-12-31-1.0

Lemma List

326.946

December 31, 2012

download

derewo-v-ww-bll-250000g-2011-12-31-0.1

Lemma List

250.000

December 31, 2011

download

derewo-v-40000g-2009-12-31-0.1

Lemma List

40.000

December 31, 2009

download

derewo-v-100000t-2009-04-30-0.1

Word Form List

100.000

May 12, 2009

download

derewo-v-30000g-2007-12-31-0.1

Lemma List

30.000

December 31, 2007

download

  • Using the DeReWo lists without knowing the corresponding documentation is scientifically dubious.
  • Referencing or passing on the DeReWo lists without the corresponding documentation is not allowed.
  • Commercial use of DeReWo lists is prohibited.
  • If you have problems downloading the lists, please proceed as follows:
    • first, download the archive and save it locally
    • then, unpack the archive (usually possible by double-clicking). A new folder will be created
    • start application (word processor, spreadsheet or the like)
    • load the file (not with a PDF-file-extension) from the new folder into the application
    • if required, enter the coding ISO-8859-15 (if necessary, look it up in the documentation)
    • if this does not lead to the desired results, please send an email to the address listed below

Collaborations

Back to Project Page