Methods of Corpus Analysis and Corpus Classification
Topic / Domain Classification of Corpora
Until recently, our former staff member Christian Weiß was responsible for works in this subproject. These activities are currently on hold. If you have any questions concerning this area, please send an email to: firstname.lastname@example.org
The aim of this subproject is the topic / domain classification of corpora on the one hand. On the other, it is to construct thematic virtual sub-corpora and also to disambiguate, for example, readings by analysing field-related frequency distributions. The starting point is the creation of a taxonomy of subject area topics. This is accomplished in a semi-automatic process, which includes the application of text mining (document clustering) and the manual allocation of clusters in an external ontology. The taxonomy thus acquired is suitable for both manual and automatic classification. For automatic classification the “naive Bayes” text classifier is being motivated and evaluated for a classified corpus of almost two billion words.
A detailed description of the subproject is available as a PDF-Document (228 Kb).
For results see:
- A Catalogue of Topics: A formal and externally anchored ontology of topics
- EAin Clusterverfahren:: Eine mathematisch-statistische Methode zum automatischen Auffinden von Themen und Belegtexten
- A Method of Clustering: A mathematical-statistical method for automatically finding topics and text instances
- A Method of Text Classifikation: A mathematical-statistical method for thematic classification of (so far unannotated) texts
- Further mathematical-statistical Methods such as keyword extraction or text filtering
In addition, A Survey on Existing Classification Patterns of Other Language Corpora is provided.
A subgoal of the project was the creation of a field-related catalogue with an external topic description, as impartial as possible. In order to minimize the restrictions regarding the topics to be classified, for example, a restriction on scientific topics for a scientific library, the classification was aligned on the basis of a higher ontology, as comprehensive as possible. The Open Directory probably represents the largest existing ontology.
Due to its aspiration to capture and nominate all subject areas, the Open Directory provides a large pool of possible topics and topic descriptions. Several points, however, speak against a direct adoption of this classification scheme:
- On the one hand, only a fraction of the categories have proven to be interesting. While there is some interest in an adoption of the category “Culture: Film”, there is none in the category “Culture: Film: Film Distribution”.
- A second argument against a one-to-one takeover was “hidden categories”, i. e. thematically very similar topics with very different top-categories. For example, for gardening topics, differing categories have been named such as:
- “At Home: Garden and Plants”
- “Economy: Construction: Gardening and Landscaping”
- “Science: Natural Sciences: Biology: Botany: Botanical Gardens”
- “Economy: Consumer Products: House and Garden”
- Alongside too great a degree of delicacy one could also observe the opposite tendency, i. e. too coarse a grid, for example with literary or religious texts.
listed in tabular short form (pdf)
- as a more detailed topic taxonomy, anchored to an external ontology: as HTML/ as pdf
In order to avoid the problems outlined above, a separate taxonomy of topics had to be created, which differs in several respects from that of the Open Directory. Firstly, categories were filtered out, that are not relevant for a language corpus. Thus the taxonomy has been reduced considerably. Secondly, a shift has taken place: Documents with a connection to subjects such as “religion” or “fiction” or “science” are regarded as instances for a special kind of language such as “religious”, “literary” or “scientific” language.
The taxonomy is accessible online:
The Clustering Method
Clustering is a subform of data mining, or more specifically, of unsupervised machine learning and is applied mainly for explorative data analysis in a variety of disciplines such as biology, empirical social sciences or information retrieval. Document clustering means the automatic grouping of texts with similar content.
This can be demonstrated with the aid of the chart on the left: It shows a hierarchical cluster as categories from the newspaper sector (culture, sports, economy) that has been determined with the help of the clusterer „CLUTO“; the colouring of the fields reflects the prominence of a key word. Moreover, one can draw conclusions about the thematic specificity of a cluster from the colour contrast.
The chart on the right shows an even more delicate division of newspaper data. For the 1998 volume a cluster analysis for “Frankfurter Rundschau” was carried out and the seasonal course of a number of selected topics has thus been visualised. That way, for example, an increase in texts about books can be observed, which can be attributed to the Frankfurt Book Fair. The topic “football” is represented especially strong in the summer, which can be attributed to the “Football World Cup in France”. Thus, clustering provides the set of topics which are in public discussion at a given point in time, or written about in newspapers, giving an opportunity to make current affairs transparent.
Applied to this subproject, a clustering method was chosen, which facilitated the assignment of texts to most of the topics defined in the ontology.
With the help of the above mentioned clusterer CLUTO, all texts of the IDS corpus have been submitted to a clustering method and divided into about 1,500 thematic clusters.
Two manual steps were taken, following the fully automatic clustering:
The first step consisted in a quality control.
Clusters, that did not have thematic homogeneity in the desired sense, have been excluded. If rare topics were concerned such as “equestrian sports” for example, clusters were examined completely, i. e. document by document.
The second step was the annotation, where each cluster was annotated according to its content spontaneously, or according to the above mentioned topic taxonomy.
Thus, a cluster of texts about the illness “aids” from 1985, for example, was marked with the term “aids 85” and the topic area “health_diet: health”.
The aim of the last two sections was the motivation of a topic taxonomy as comprehensive as possible as well as its annotation with sample texts, which have been found by the clusterer.
This semi-automatically generated data volume functioned as an entry for a text classifier.