Digitale Sprachwissenschaft

Methods of  Corpus Analysis and Corpus Classification


Collocation Analysis and its Classification

Collocation Analysis

Current Research Focuses

The current research activities concentrate on the exploration of the similarities between collocation profiles (Module Similar Collocation Profiles), on the modelling of semantic proximity (Module Modelling Semantic Proximity) and on the determination and visualisation of relevant usage aspects (Module SOM: Self-Organizing Maps).

Some of these modules are available via Web-Interface for the Collocation Database CCDB, each in their current early-beta version.

In General

Collocation analysis (short tutorial, online demonstration, collocation database) is a method of corpus analysis aimed at structuring instances (Belegmengen). It

  • facilitates the detection of significant regularities in the use of word combinations in corpora

  • evaluates the definable context of a given search object in any virtual corpus with the help of mathematical statistics analysis and clustering methods

  • provides information on the systematic joint occurrence of words (collocations) and a measure for their affinity (cohesion)

  • provides an appropriate synoptic presentation of the instances

  • also covers multiword patterns and even (idiomatic) multiword expressions as well as binary word relations

The IDS has been making collocation analysis available to the public since 1995, integrated in a complex online system: COSMAS.

Collocation analysis is applicable to random search objects with

  • optional lemmatization

  • variable context size

  • if necessary, automatic focusing on the context with the strongest cohesiveness

  • variable reliability (i.e. significance of the first collocative partner)

  • variable granularity

  • variable allocation of instances with multiword expressions

  • calculation of syntagmatic patterns for each collocation cluster

The Analysis

  • opens an empirical access to mass data, while setting preferences as well as organising and structuring high frequency instances

  • facilitates an empirical capture of multiword expressions as candidates for multiword units of contemporary German (phraseologisms, idioms, proverbs, communicative set phrases, empty verb structures etc.)

  • in addition to that, it serves as a corpus linguistic working and thinking tool. With these strictly corpus based tools it is possible to uncover general linguistic structures. For example, it provides information on disambiguation, usage conventions, typical contextualisation and interpretation of meaning of the entries to be described


Before using the programme, please familiarise yourself with the relevant copyrights. When publishing research results which are based on our programme, please send a collegial email to the author <belica@ids-...>.

Back to Project Page

CCDB – the Collocation Database [see Keibel/Belica 2007 (pdf, 345K, english), CCDB-Flyer (pdf, 628K)]

For the further development of methods of collocation analysis, it is of fundamental importance to uncover, systematise and theoretically justify the - as yet mostly unknown -  systemic-structural characteristics of cohesion relations between words or groups of words in the German language.

The project has created a collocation database with more than 220,000 entries on the basis of a corpus of contemporary language of about 2.2 billion text words which serves as an empirical basis for this research project.

This database contains the results of up to five different collocation analyses for each word (with different parameter settings) in the form of hierarchies and similar usages.

For each word and analysis up to 100,000 usages are being saved.

In addition to its actual purpose of exploring the features of cohesion relations for the development of methods of corpus analysis, the database may well work as a tool for lexicographical research.

This way, for example, information on collocation performance of individual lexemes can be quickly and easily accessed. This takes into consideration the underlying corpus, the chosen analysis parameters and also the fact that here raw data is concerned which has been automatically calculated on a statistical basis, and is linguistically not validated.

For this purpose and in this usage context, the collocation database CCDB is also partly available to the public.

We specifically point out that, in our opinion, the use of this database cannot replace the interactive, dynamic and explorative application of our analysis methods, which are based on user-defined virtual corpora.

Back to Project Page

Lexicological and Lexicographical Classification of Collocation Analysis

Another objective of the subproject is to offer support for the lexicological and lexicographical classification of collocation analysis in order to make the variety of information manageable for both individual and masses of collocation analyses.

The approach involves

  • visualisation of the cohesive structure and strength of the collocation partners

  • the possibility to focus on individual areas or alternatively navigate into individual areas of the structure made visible

  • various possibilities of lexicological and lexicographical editing

  • an interface with CCDB

Visualisierung der Kookkurrenzanalyse (zum Vergrößern auf das Bild klicken)

Visualisation of a Collocation Analysis for the Keyword “Kopf”


  • exploration of cohesion features

  • partitioning of hits as a means of disambiguation of readings

Back to Project Page


Cyril Belica <belica@ids-...>

Rainer Perkuhn <perkuhn@ids-...>


 Sitemap     Search     Impressum     Contact    Print