French-German Colloquium WikiCorp 2018

Fostering linguistic studies on Wikipedia discussions

Multilingual corpus building, annotation and exploration tools

Two-day colloquium at Université Nice Côte d'Azur (FR)
July 9-10, 2018

Goals of the colloquium

The colloquium is committed to the long-term goal of building comparable French-German discussion corpora as a special type of big CMC corpora using TEI-compliant standards. These shall serve as a basis to further develop common tools and methods for the cross-lingual, corpus-based analyses of interaction, politeness and conflict.

  1. Objectives concerning corpus building, standards, and tools:
    Harmonize the parameters of the so far separate French and German Wikipedia corpus building processes in order to make them interoperable for D-F contrastive and cross-lingual analyses: further develop the standards of the TEI CMC SIG; align metadata categories and value taxonomies.
  2. Objectives concerning interaction analyses:
    Develop annotation categories for interaction patterns, politeness cues, and conflict analysis, joint representation of conflict structures.
  3. Objectives concerning corpus analysis methods:
    Develop and adapt corpus-linguistic methods from KorAP and Textométrie to explore and visualize cross-lingual analyses on Wikipedia discussion corpora; prepare the exploration of cross-linguistic distributional semantics by training word embedding models on the French and German Wikipedias.
Invited speakers

David Laniado, Eurecat, Barcelona
Torsten Zesch, Universität Duisburg-Essen


Céline Poudat, Université Nice Côte d'Azur
Angelika Storrer, Universität Mannheim
Harald Lüngen, Institut für Deutsche Sprache, Mannheim
Laura Herzberg, Universität Mannheim

Location: Université Nice Côte d'Azur, Campus Saint-Jean-d’Angely 3, MSHS building, Salle Plate (Google maps link). The easiest way to come is to take the tramway; get off at the stop Saint-Jean d’Angely Université. The MSHS building (with a clock) is just in front of the Tramway station.

Local organisation: Céline Poudat and BCL team in Nice

Funding: Huma-Num CORLI consortium

Confirmed Participants (last updated 2018-06-28)
Elena Cabrio, Wimmics, Université Côte d’Azur
Natalia Grabar, STL, Université Lille 3
Laura Herzberg, Universität Mannheim
Mai Ho-Dac, CLLE-ERSS, Université Toulouse
Marc Kupietz, Institut für Deutsche Sprache, Mannheim
David Laniado, Eurecat, Barcelona
Harald Lüngen, Institut für Deutsche Sprache, Mannheim
Christophe Parisse, Head of Ortolang, MoDyCO, Université Paris X-Nanterre
Céline Poudat, BCL, Université Côte d’Azur
Angelika Storrer, Universität Mannheim
Serena Villata, Wimmics, Université Côte d’Azur
Laurent Vanni, BCL, Université Côte d’Azur
Torsten Zesch, Universität Duisburg-Essen

Follow-up activity

A post-conference publication on Wikipedia corpus building, annotation and exploration is planned, either as a book publication or as a special issue of a journal such as Corpus.


PRELIMINARY SCHEDULE (last updated 2018-05-29)
Monday, 9 July 2018
9:30-10:00 Opening
10:00-12:00 Section I: Joint corpus building, standards, and tools

► Christophe Parisse: Sharing corpora in repositories: using the TEI as an exchange format across various types of language data
► Mai Ho-Dac: The WikiDisc Corpus : Available metadata and interactional features
► Harald Lüngen: Formats and Features of the IDS Wikipedia Corpora
► Natalia Grabar: Building comparable corpora from the French Wikipedia and alignment of parallel sentences
12:00-13:30 Lunch
13:30-16:00 Section II: Linguistic analyses of social interaction and conflicts

Invited Talk
► Torsten Zesch: Annotating, Detecting, and Understanding Stance in Computer-Mediated Debates

► Laura Herzberg: Analysing social interaction in Wikipedia discussions
► Céline Poudat: Linguistic annotation of disagreement and conflict in Wikipedia discussions
► Elena Cabrio and Serena Villata: Argument mining on the Web

Discussion and documentation of desiderata and requirements of Sections I and II
16:00-16:30 Coffee
16:30-17:00 Breakout Session, ad Section I, (a)
- Alignment of components and metadata of the French and German WP corpora
- TEI representation issues of Wikipedia talk
17:00-17:30 Breakout Session, ad Section II
- Representing and annotating the structure of CMC interaction on WP talk pages
- Annotation layers and categories of interaction patterns
- Automated detection and annotation of stance and conflict detection
20:00 Dinner
Tuesday, 10 July 2018
9:30-10:00 Documentation of the Results of the two Breakout Sessions from Day 1
10:00-12:00 Section III: Corpus analysis methods

Invited Talk
► David Laniado: Visualization of social interactions in Wikipedia

► Céline Poudat and Laurent Vanni: Looking for characteristic patterns using deep learning methods with Hyperbase Web
► Marc Kupietz: Current developments for corpus query, analysis and visualisation at IDS

Discussion and documentation of desiderata and requirements
12:00-14:00 Lunch
14.00-14:30 Breakout Session - ad Section III
14.30-15:00 Breakout Session - ad Section I (b)
- Impulse presentation by Marc Kupietz: Current initiatives on comparable corpora: EuReCo, ICC
- Discussion of methods and resources for comparable corpus building
15:00-15:30 Documentation of the results of the two Breakout Sessions
15.30-16:00 Coffee
16:00-17:30 ► Planning the post-conference publication.
► Planning the implementation of results, follow-up activities, projects, and further co-operation
► Wrap-up of the colloquium


Wikipedia is one of the most successful projects of the Web 2.0. Since its launch in 2001, thousands of contributors have built this huge knowledge resource, which is not only used as an online encyclopedia, but also as an object of research in many academic disciplines. It also constitutes a rich and unique resource for linguistic studies, first of all because of its multilinguality, and secondly because of its huge discussion spaces, in which the collaborative writing effort is negotiated. These so-called talk pages can be used as big corpus resources of Computer-Mediated Communication (CMC).

The French and German participants of the colloquium are part of an initiative which aims to foster linguistic studies on Wikipedia, providing recommendations for the building of Wikipedia standardized corpora, methods for their linguistic processing and exploration, and descriptors and annotations for the analysis of talk pages. The French-German team of proposers started co-operating in 2016 with a first workshop in Mannheim entitled “Wikipedia: Discourse and corpus linguistic perspectives”. Since then, the proposers and other participants have co-operated in various constellations on conferences, for joint publications and proposals. The group is now ready to prepare the ground for jointly building comparable French-German corpora to be used in cross-lingual, corpus-based analyses of Wikipedia discussions.

State of corpus technology and corpus-based analyses of Wikipedia discussions in the French-German group

Up to now, most linguistic studies on Wikipedia are focused on the article pages, and do not go into a deep analysis of the linguistic features used in the discussion spaces. This may be due to three reasons: (i) Wikipedia is quite a complex object that linguists have difficulties to manipulate; (ii) Wikipedia interactions need specific descriptors and ad hoc annotations for analysis; and (iii) existing corpus technologies and exploration tools need to be adjusted to the specificities of CMC corpora in general and Wikipedia corpora in particular. More sophisticated tools and methods for the linguistic annotation and corpus exploration are needed to better exploit the huge and valuable corpus resources that can be constructed from Wikipedia discussions.

The colloquium will bring together researchers that have solid experience with preparing monolingual (French and German) corpora from Wikipedia, with their dissemination and providing corpus technology for their analysis, and with conducting linguistic research on social interaction in Wikipedia discussions with a particular interest on the analysis and detection of conflicts. Their previous work on Wikipedia analysis include studies on conflict annotation and conflict detection (e.g. Poudat et al. 2016, Poudat et al. 2017, Ho-Dac et al. 2017, Poudat & Ho-Dac (to appear), studies on writing style and language variation (Storrer 2013), methodological issues related to Wikipedia discourse analyses (Gredel 2017), cross-lingual and cross-mode studies on interaction signs (Herzberg & Storrer 2017), and work on multimodal aspects of Wikipedia (Wessler et al. 2017).

The French and German participants are involved in national corpus initiatives – e.g. CLARIN-D (Common Language Resources and Technology Infrastructure) in Germany and the CORLI consortium (Huma-Num national consortium for the study of Language, Corpora and Interactions) in France. They have a strong common interest in developing and using standards for the annotation of Wikipedia corpora to be used in linguistic research projects. They have contributed to the "Computer-Mediated Communication" Special interest group (SIG) of the Text Encoding Initiative (TEI), (cf. e.g. Margaretha & Lüngen 2014, Chanier et al. 2014, Beißwenger et al. 2017), and have presented papers and exchanged ideas at the CMC corpora conferences in Rennes (2015), Ljubljana (2016).

The partner IDS is committed to the idea of a European Reference Corpus (EuReCo): joining the forces of national and reference corpora initiatives to build and exploit multilingual comparable corpora from existing monolingual resources that remain physically located at their hosting institutions by combining them virtually (currently co-operating with the Hungarian and Romanian national corpora, cf. Kupietz et al. 2017). BLC and CLLE, hand in hand with the French national consortium CORLI (Corpus, Languages and Interactions), could be the relevant French partners for the EuReCo initiative with their Wikipedia corpora as a starting point. The colloquium shall be used to establish such a co-operation.

Moreover, both German and French participants are interested in corpus exploration methods. The colloquium aims to foster the exchange between researchers working with textometric methods, which are particularly well developed in France under the name Textométrie or Statistiques textuelles (cf. Poudat & Landragin 2017) and researchers developing methods for corpus analysis from the IDS, with its 50-year-long tradition in corpus-based language research (Lüngen & Kupietz 2017, Fankhauser & Kupietz 2017). Both sides are extremely interested in adapting their corpus frameworks to specific features of digital genres and CMC.


