Methodology

How DebateMetrics calculates scores

DebateMetrics condenses language-observable patterns in Bundestag debates into two metrics: Discourse Quality (DQ) and Rhetorical Behaviour (RB).

Source material

Source material consists of plenary protocols of the German Bundestag. Each sitting of the Bundestag is recorded as a plenary protocol / stenographic report and made publicly available by the Bundestag, generally as PDF or XML files. DebateMetrics extracts and processes this material for analysis; resulting text segments, metadata corrections, annotations, and scores are not official Bundestag publications.

What is measured

The scores do not say whether a political position is correct. They describe how a fraction argues, structures claims, uses evidence, and engages with other positions in the detected speech contributions.

DQ: Discourse Quality

DQ summarizes how clear, substantive, topic-relevant, and argumentatively traceable contributions are.

RB: Rhetorical Behaviour

RB summarizes how respectful, clear, and cooperative contributions are and how strongly attacks, avoidance, or polemics appear.

Why neutral LLM metrics?

LLMs are used here as consistent, independent annotators. The prompts forbid external fact checking and content truth evaluation so that comparable language patterns are scored instead of political agreement.

Du bist ein unabhängiger, unparteiischer Politikwissenschaftler.

Pipeline

Actual speech contributions are detected in each chapter and assigned to canonical fractions.
A separate filtered text is created for each fraction from that fraction's contributions.
Each configured LLM provider scores the fraction text separately for DQ and RB.
Short evidence snippets and notes are stored with the scores so values remain inspectable.
The stored individual scores are weighted by contribution length and aggregated into fraction scores.

Discourse Quality in detail

All DQ dimensions range from 0.0 to 1.0. The aggregated DQ score is the contribution-length weighted average of the nine dimensions.

Goal clarity: Is it clear what the argument is trying to establish?
Argument structure: Are claims, reasons, and conclusions connected in a traceable way?
Causal reasoning: Are causes, effects, or mechanisms made visible?
Counterarguments: Does the contribution engage with other positions?
Substance: Does the contribution go beyond slogans or pure positioning?
Calibration: Are certainty, uncertainty, and scope marked appropriately?
Evidence reference: Are sources, examples, numbers, or concrete reference points named?
Evidence quality: Do the references appear concrete and useful for the argument?
Topic relevance: Does the contribution stay with the topic being debated?

DQ = weighted average of 9 dimensions, weighted by contribution length.

Rhetorical Behaviour in detail

RB also uses scores from 0.0 to 1.0. Problematic dimensions are inverted before averaging so higher total scores always mean better rhetorical behaviour.

Personal attacks: High raw scores mean more personal denigration.
Rhetorical aggression: High raw scores mean more aggressive language.
Respectful address: High scores mean more respectful interaction.
Avoidance signals: High raw scores mean stronger avoidance.
Polemics density: High raw scores mean more polemical escalation.
Clarity: High scores mean more understandable wording.
Cooperation signals: High scores mean more constructive and connectable signals.

RB = weighted average of 1 minus attack, 1 minus aggression, respect, 1 minus avoidance, 1 minus polemics, clarity, and cooperation.

What you can compare

The filter view compares fractions across selected transcripts, chapters, and providers. The details view shows timelines, provider differences, and the evidence behind individual scores.

Compare fractions View providers and evidence Source on GitHub