Methods of Corpus Analysis and Corpus Classification
Collocation Analysis and its Classification
Current Research Focuses
The current research activities concentrate on the exploration of the similarities between collocation profiles (Module Similar Collocation Profiles), on the modelling of semantic proximity (Module Modelling Semantic Proximity) and on the determination and visualisation of relevant usage aspects (Module SOM: Self-Organizing Maps).
Some of these modules are available via Web-Interface for the Collocation Database CCDB, each in their current early-beta version.
facilitates the detection of significant regularities in the use of word combinations in corpora
evaluates the definable context of a given search object in any virtual corpus with the help of mathematical statistics analysis and clustering methods
provides information on the systematic joint occurrence of words (collocations) and a measure for their affinity (cohesion)
provides an appropriate synoptic presentation of the instances
also covers multiword patterns and even (idiomatic) multiword expressions as well as binary word relations
The IDS has been making collocation analysis available to the public since 1995, integrated in a complex online system: COSMAS.
Collocation analysis is applicable to random search objects with
variable context size
if necessary, automatic focusing on the context with the strongest cohesiveness
variable reliability (i.e. significance of the first collocative partner)
variable allocation of instances with multiword expressions
calculation of syntagmatic patterns for each collocation cluster
opens an empirical access to mass data, while setting preferences as well as organising and structuring high frequency instances
facilitates an empirical capture of multiword expressions as candidates for multiword units of contemporary German (phraseologisms, idioms, proverbs, communicative set phrases, empty verb structures etc.)
in addition to that, it serves as a corpus linguistic working and thinking tool. With these strictly corpus based tools it is possible to uncover general linguistic structures. For example, it provides information on disambiguation, usage conventions, typical contextualisation and interpretation of meaning of the entries to be described
Before using the programme, please familiarise yourself with the relevant copyrights. When publishing research results which are based on our programme, please send a collegial email to the author <belica@ids-...>.
CCDB – the Collocation Database [see Keibel/Belica 2007 (pdf, 345K, english), CCDB-Flyer (pdf, 628K)]
For the further development of methods of collocation analysis, it is of fundamental importance to uncover, systematise and theoretically justify the - as yet mostly unknown - systemic-structural characteristics of cohesion relations between words or groups of words in the German language.
The project has created a collocation database with more than 220,000 entries on the basis of a corpus of contemporary language of about 2.2 billion text words which serves as an empirical basis for this research project.
This database contains the results of up to five different collocation analyses for each word (with different parameter settings) in the form of hierarchies and similar usages.
For each word and analysis up to 100,000 usages are being saved.
In addition to its actual purpose of exploring the features of cohesion relations for the development of methods of corpus analysis, the database may well work as a tool for lexicographical research.
This way, for example, information on collocation performance of individual lexemes can be quickly and easily accessed. This takes into consideration the underlying corpus, the chosen analysis parameters and also the fact that here raw data is concerned which has been automatically calculated on a statistical basis, and is linguistically not validated.
For this purpose and in this usage context, the collocation database CCDB is also partly available to the public.
We specifically point out that, in our opinion, the use of this database cannot replace the interactive, dynamic and explorative application of our analysis methods, which are based on user-defined virtual corpora.
Lexicological and Lexicographical Classification of Collocation Analysis
Another objective of the subproject is to offer support for the lexicological and lexicographical classification of collocation analysis in order to make the variety of information manageable for both individual and masses of collocation analyses.
The approach involves
visualisation of the cohesive structure and strength of the collocation partners
the possibility to focus on individual areas or alternatively navigate into individual areas of the structure made visible
various possibilities of lexicological and lexicographical editing
an interface with CCDB
Visualisation of a Collocation Analysis for the Keyword “Kopf”
exploration of cohesion features
partitioning of hits as a means of disambiguation of readings
Cyril Belica <belica@ids-...>
Rainer Perkuhn <perkuhn@ids-...>