SegCor – ANR-DFG-Project „Segmentation of Oral Corpora“

Head of Project:

Thomas Schmidt, IDS Mannheim and Véronique Traverso, ICAR

Research associates (German team):

Arnulf Deppermann, Joachim Gasch, Jan Gorisch, Henrike Helmer, Nadine Proske, Swantje Westpfahl

Research associates (French team 1):

Heike Baldauf-Quilliatre, Biagio Ursi, Carole Etienne, Emilie Jouin-ChardonNathalie Rossi-Gensane

Research associates (French team 2):

Lotfi Abouda, Olivier Baude, Flora Badin, Iris Eshkol, Layal Kanaan-Caillol, Marie Skrovec

Duration of the project:

March 2016 - February 2019

Since the beginning of research on spoken language, a plethora of proposals for the segmentation of spoken language have been put forward. However, there is no segmentation system yet which could be used for large corpora of spoken language, i.e. which is linguistically substantiated as well as workable for large scale corpus segmentation. The lack of theory based segmentation impedes the use of the corpora for research on language technology, comparative corpus linguistics as well as analyses in terms of spoken language interaction.

Research objectives:

It is the aim of this project to develop methods for the segmentation of spoken language. Those methods are to be based on linguistic knowledge and at the same time adequate for the analysis of spoken language on various linguistic levels as well as for the development of tools in computational linguistics. The publication of a guideline for a systematic segmentation of various types of German and French verbal interaction is a milestone of this project. In the second stage, the possibilities of an automatized segmentation of spoken language corpora based on the segmentation guidelines will be tested and documented. This way the project does not only improve the usability of the three databases involved but also deepens our knowledge about the structures of spoken language.

The project research is based on three databases: the German research and teaching corpus of spoken German (FOLK) and the two French databases CLAPI (Corpus de LAngue Parlée en Interaction) and ESLO (Enquêtes sociolinguistiques à Orléans).


SegCor is a project funded by the German Research Foundation (DFG) and the French National Research Agency (ANR). This project is a cooperation of the department of Pragmatics of Institute for the German Language (IDS Mannheim) and two French partners: the ICAR (Interactions, Corpus, Apprentissages, Représentations) of the University of Lyon and the LLL (Laboratoire Ligérien de Linguistique) of the University of Orleans.