Program Area "Oral Corpora"
The program area "Oral Corpora" is concerned with maintaining and extending the oral corpora in the Archive for Spoken German (AGD), with developing and mediating methods and technology for compiling and working with oral corpora, and with contributing to initiatives for the establishment of standards and good practices in the area of oral corpora.
Maintenance and development of oral corpora
The Archive for Spoken German (AGD) collects and archives data of spoken German in interactions (interaction corpora) and data of domestic and non-domestic varieties of German (variation corpora). The corpora are curated in the archive and made available to the scientific public. [AGD-Flyer]
With the Research and Teaching Corpus of Spoken German (FOLK), the program area is building up a large German interaction corpus, which is made available through the Database of Spoken German. [FOLK-Flyer]
The Conversation Analytic Information System (GAIS) provides up-to-date information on conversation analysis and related disciplines. It is designed to become an online handbook covering all elements of a typical empirical and corpus-based workflow in conversation analysis.
Project Corpus Technology
The Database of Spoken German (DGD) is a platform for accessing the oral corpora of the AGD. The DGD provides researchers, teachers and students with a web-based source for browsing and querying FOLK and other AGD data. [DGD-Flyer]
In collaboration with the Hamburg Centre for Language Corpora, the program area also contributes to the development and provides support for EXMARaLDA, a system for compiling, managing and analyzing oral corpora. [EXMARaLDA-Flyer]
External projects and dissertations
- The aim of the ANR-DFG project "Segmentation of Oral Corpora (SegCor)" is to develop methods for the segmentation of spoken language, which are adequate both for the linguistic analysis of spontaneous speech on different levels, and for automatic processing with NLP tools.
- The dissertation project "POS für(s) FOLK - Part-of-Speech-Tagging of spontaneous speech data" is concerned with methods for automatically annotating oral corpora with POS tags.
- The project "Accessing multimodal spoken language corpora: cross-linking and user-group specific differentiation" (ZuMult) is funded in the LIS program of the DFG. ZuMult is a cooperation between the Program Area "Oral Corpora", the Hamburg Centre for Language Corpora (University of Hamburg) and the Herder Institute (University of Leipzig). It aims at building a common architecture for accessing oral corpus data and developing access methods for these data which are tailored to the needs of specific user groups.
- The project "Identifying the Intersection of Prosody, Gesture and Conversation" is a DFG-supported initiation of bilateral cooperations. Here, Jan Gorisch works together with Meg Zellers, Benno Peters (University of Kiel) and David House (KTH Stockholm) with the aim to write a grant proposal on the above mentioned topic. Therefore, we organize a workshop (29-30. November 2018 in Kiel) and research visits in both directions.
- In the ISO/DIN-Project "WG 6 PWI 24624 Language resource management – Transcription of spoken language", the program area has contributed to an international standard for the digital representation of spoken language transcriptions.
Featured recent publications in English:
- Schmidt, Thomas (2016): Construction and Dissemination of a Corpus of Spoken Interaction - Tools and Workflows in the FOLK project. In: Corpus Linguistic Software Tools, Journal for Language Technology and Computational Linguistics (JLCL 31/1), by Kupietz, Marc & Geyken, Alexander (eds.), pp. 127-154. PDF
- Westpfahl, Swantje / Schmidt, Thomas (2016): FOLK-Gold – A GOLD standard for Part-of-Speech-Tagging of Spoken German. In: Proceedings of the Tenth Conference on International Language Resources and Evaluation (LREC'16), Portorož, Slovenia. Paris: European Language Resources Association (ELRA), pp. 1493-1499. PDF
- Schmidt, Thomas (2016): Good practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German. In: Compilation, transcription, markup and annotation of spoken corpora, by Kirk, John M. and Gisle Andersen (eds.), Special Issue of the International Journal of Corpus Linguistics [IJCL 21:3], pp. 396-418.
- Ruhi, Şükriye / Haugh, Michael / Schmidt, Thomas / Wörner, Kai (eds.) (2014): Best Practices for Spoken Corpora in Linguistic Research. Newcastle: Cambridge Scholars Publishing.
- Schmidt, Thomas (2014): The Database for Spoken German - DGD2. In: Proceedings of the Ninth International conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland: European Language Resources Association (ELRA), pp. 1451-1457. PDF
- Schmidt, Thomas (2014): The Research and Teaching Corpus of Spoken German - FOLK. In: Proceedings of the Ninth International conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland: European Language Resources Association (ELRA), pp. 383-387. PDF
- Schmidt, Thomas / Wörner, Kai (2014): EXMARaLDA. In: Jacques Durand, Ulrike Gut, and Gjert Kristoffersen (eds.): The Oxford Handbook of Corpus Phonology. Oxford: OUP, pp. 402-419.
- Schmidt, Thomas (2012): EXMARaLDA and the FOLK tools. Two toolsets for transcribing and annotating spoken language. In: Proceedings of the Eighth conference on International Language Resources and Evaluation (LREC-12), Istanbul, Turkey. European Language Resources Association (ELRA), 2012, pp. 236-240.
- Schmidt, Thomas/Wörner, Kai (eds.) (2012): Multilingual Corpora and Multilingual Corpus Analysis. (= Hamburg Studies on Multilingualism 14). Amsterdam: Benjamins, 2012.
- Schmidt, Thomas (2011): A TEI-based approach to standardising spoken language transcription. In: Journal of the Text Encoding Initiative 1/2011.