Head of Project:
    Cyril Belica <belica@ids-...>
Scientific Assistants:
    Dr. Marc Kupietz <kupietz@ids-...>
    Dr. Harald Lüngen <luengen@ids-...>
    Rainer Perkuhn <perkuhn@ids-...>
Student Assistants:
    Anna Schächtele

Methods of Corpus Analysis and Corpus Classification

Methodologies of Corpus Linguistics and the Nature of Linguistic Data

Research Subject

This project scientifically explores methodologies for quantitative and qualitative analysis of very large corpora as well as the modelling of the processes of linguistic and general cognitive interpretation which occur and can be expected on a lexical level.

The generalisations gained through this methodological research will be reflected on by the project team on a scientific level and brought into the discussion of linguistic theory formation.

Own Preliminary Research

Own research results of the past years are being systematically pursued and extended within the project.

This applies particularly to the area of collocation analysis and the research based on it, which investigates synonymy, syntagmatic patterns, semantic similarity and modelling of usage aspects as well as an approach model for systematic linguistic interpretation of analysis results.

Linguistic and epistemic contributions have been made on the emergent nature of the ontology of lexical-semantic relations and the theoretical status of high-order collocations and of collocation schemes starting from construction grammar theories and spanning from experiencing (experiencability) and cognition to experience and convention.

Historically, the project takes up the corpus linguistic concept of the COSMAS-platform, formulated in the COSMAS I project as well as the solution strategies and results, achieved within the project between 1991 and 2003.

Current Focus

  • Collocation analysis and its classification
  • See short Tutorial, collocation data base CCDB, Paper on CCDB (pdf, 345K, English)
  • Multi-dimensional corpus analyses (i.a. methods of identifying neologisms)
  • Corpus-Based Lists of Base Forms of Words and Word Forms (DeReWo)
  • Paradigmatic Variations
  • Sampling Strategies for Synchrone Corpora
  • Methodology of Classification
  • Lexical Semantics
  • Paronymy
  • Conceptual Development of Corpus Analysis Research Tools

Additional Work Packages

  • Quantitative Analyses of German Lexis
  • Topic/Domain Classification of the Corpora
  • Lemmatisation
  • Short Studies

Theoretical Framework

The interest of theoretical linguistics in the quantitative-empirical approaches of corpus linguistics is growing rapidly.

It is becoming more and more clear that unexpectedly many, if not even all, systemic-structural features of natural languages are being constituted in the conflicting area ranging between the opposing poles of several competing and partly contrary principles. Eventually, these can be captured appropriately in theoretic models only by applying fuzzy conditions of preference relation.

In addition to disciplines traditionally carrying out research empirically such as lexicography, language didactics or developmental psychology, frequency distribution of linguistic phenomena is being focused on also in recent research of cognitive and theoretic linguistics.

Rapid advances of mathematics in the field of structure-detecting transformations allow corpus linguistics to carry out quantitative analyses of increasingly complex linguistic phenomena. This approach is corresponding with the successful application of related processes for the scientific classification and theory formation in genetic research.

