How Bayesian Fusion Improves Multilingual Comprehension Detection

The Problem with a Single Confidence Score

When a Turkish-speaking citizen calls a Bürgeramt and asks about renewing their residence permit, the ASR engine might report a word-error rate of 12 % and a confidence score of 0.81. Is that good enough to continue without a human agent? The answer depends on much more than a single number.

A high ASR confidence score tells you the audio was transcribed accurately. It says nothing about whether the intent was classified correctly, whether the response generated by the LLM actually answers the question, or whether the caller is satisfied with what they heard. In a multilingual government context — where the stakes include missed deadlines, incorrect forms, and frustrated citizens — a more robust measure is needed.

Four Signals, One Score

VoiceA's Bayesian comprehension fusion layer combines four independent signals:

1. ASR Word Error Proxy — the posterior probability of the transcription given the acoustic model, normalised by utterance length to avoid penalising short inputs.

2. Intent Classification Confidence — the softmax entropy of the intent classifier (we use a fine-tuned multilingual BERT variant). Low entropy means the model is confident about the caller's intent; high entropy signals ambiguity.

3. RAG Retrieval Score — the cosine similarity between the query embedding and the top-k retrieved documents from the Qdrant knowledge base. A low retrieval score indicates the question may not be covered by the available knowledge — a strong signal for handoff.

4. Dialogue Coherence Score — a lightweight sequence model that checks whether the current turn is consistent with the conversation history. Abrupt topic shifts or repeated rephrasing often indicate the caller has not been understood.

These four signals are not independent in practice — ASR errors often cascade into intent misclassification. The Bayesian network models these conditional dependencies explicitly, producing a single calibrated posterior: P(understood | ASR, intent, RAG, coherence).

Why Calibration Matters

A calibrated probability means that when the model says "0.72 probability of being understood," roughly 72 % of such calls are indeed handled correctly without handoff. This allows operators to set a threshold that directly corresponds to a service-level objective — for example, "hand off any call where the probability of correct understanding falls below 0.70."

Without calibration, a threshold of 0.70 might correspond to very different actual error rates depending on language, time of day, or call topic.

Empirical Results

In our Freiburg pilot (6 languages, 3 months of production data), the fused Bayesian score reduced unnecessary handoffs by 31 % compared to using ASR confidence alone, while keeping the false-negative rate (calls that should have been handed off but were not) below 2 %. The largest improvement was in Turkish and Arabic, where ASR confidence was systematically lower due to dialect variation but intent classification was often still accurate.

Implementation Notes

The Bayesian network is implemented as a lightweight Python module (< 200 lines) using the pgmpy library. It runs synchronously in the call processing pipeline with < 5 ms overhead per turn. The conditional probability tables are learned from annotated pilot data during onboarding and can be updated incrementally as more call data accumulates.

The threshold is configurable per deployment and per intent category — a call about emergency social benefits might warrant a lower threshold (more conservative handoff) than a call about opening hours.

Conclusion

Multilingual comprehension detection in a Bürgerdienst context requires more than ASR accuracy. By fusing four complementary signals in a principled probabilistic framework, VoiceA produces a comprehension score that is both interpretable and auditable — exactly what EU AI Act Art. 13 (transparency) and Art. 14 (human oversight) require.