Dr. Marc Kupietz <kupietz@ids-...>
Cyril Belica <belica@ids-...>
Dr. Harald Lüngen <luengen@ids-...>
Rainer Perkuhn <perkuhn@ids-...>
Ehemalige am Korpusaufbau beteiligte Mitarbeiter des IDS:
- Caroline Iliadi
- Daniel Wachter
Corpora of Written Language
Paramount priority in choosing new texts for the extension of DeReKo is the maximisation of volume and dispersion while taking into account its conception as a primordial sample of written language usage (see Fields of Application) and as very large general purpose corpus. In practice (unfortunately), more criteria have to be taken into account, especially the expenses for acquiring the necessary copyrights (see Copyrights below) and for editing the raw data (see Conversion below).
The corpora of contemporary written language are intended to record and continually update the actual use of the German language, preferably on a daily basis. This means, that artificial texts cannot be considered possible sources, and websites are qualified only to a limited extent, since they represent a very specific part of language. It follows from the foregoing that the aim of the project is to acquire suitable, preferably electronic templates, which constitute an authentic document of German language usage. Germanists and other researchers need this material in order to scientifically explore language empirically. Insofar it should be natural, in fact an honour, that authors make their templates available. Rightholders are often repelled by the fear, that they cannot control illegal reproduction of their texts as soon as they allow their use for the corpora. In addition to that, most templates are not particularly suitable out of technical reasons. The text producers are rarely willing or in the position to edit their sources in a format that can easily be transferred into a corpus format. On the other hand, the IDS itself does not possess sufficient capacity to carry out the formatting of sources otherwise required. Taking these circumstances into account, in most cases the legal and financial negotiations prove to be very difficult.
Through legal agreements with publishers, newspaper editors and authors, the IDS has been and still is in the position to procure legally secure copyrighted text material in a way that all corpora are available for use within the IDS and parts of them can be accessed publicly worldwide. The use is limited exclusively to scientific and non-commercial purposes. In addition to that, the text corpora of the IDS are researchable only through the COSMAS system; no user has access to complete corpus texts, but only to limited contexts via queries. Due to strict copyrights and consequently strict agreements with our text providers, complete corpus texts cannot be made available even internally (within the IDS). The copyright also concerns the use of corpora for research purposes, and if the text providers are publishers, for example, they in turn are bound by contracts and copyrights.
In the past, the DEREKO I project (1999-2001) has contributed a lot to the clarification especially of copyright uncertainties. The share of corpora that are publicly accessible has risen considerably due to this project. In mid 2004 a new acquisition initiative has started which is geared towards negotiating long-term agreements with text providers and preferably making all newly acquired texts available to external users also.
At this point we would like to thank all authors, publishers and newspapers for having made their work, texts and documents available to the IDS corpora in the past and will be providing them in the future. At the same time we would like to encourage all those who are still reluctant or undecided, to take part in the venture of providing linguistic research with a broad and deep reflection of the German language. If you have any questions concerning copyright or contract design we will be happy to inform you.
The source texts to be gathered in the collection of corpora usually exist in different formats, geared to the needs of the publishers; they can vary considerably, depending on the preferences of the author or publisher. In order to become part of the IDS corpora, they have to be converted into a standardised format, delineated by the IDS text model. That implies, that a great number of very heterogeneous data have to be analysed and converted upwards in several working steps. For mechanical support we apply various parsers, converters and filters, part of which we developed ourselves. What causes difficulties are inconsistencies in character encoding and hyphenation as well as junk and clutter files such as unmotivated format specifications, tables and text doubling(?). But sensible format specifications also vary depending on the source and have to be thoroughly analysed and standardised for optimum transfer of data. Converting sources upwards into the IDS format is of strong iterative nature. As it demands a high amount of correcting and servicing respectively, it is very expensive and time-consuming (see also Conversion Details).
Already converted texts are also being continually reviewed and enhanced by tagging them with additional information. This concerns for instance the topic / domain classification of newspaper articles, the tagging of (partial) duplicates as well as the tagging of specific text parts such as, for instance, direct speech, quotations and passages in a foreign language.