Skip to main content

The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

:::info Stub — Full Engineering Breakdown Coming This paper was featured on Hugging Face Daily Papers on 2026-05-27 with 10 upvotes. A full breakdown with production viability rating, implementation notes, and honest limitations is being written. Subscribe to AI Letters → :::

AuthorsEric Onyame et al.
Year2026
HF Upvotes10
arXiv2605.27901
PDFDownload
HF PageView on Hugging Face

Abstract

Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families. We present the first large-scale evaluation of CoT monitorability across 13 diverse languages and seven frontier model families, comprising 16 models. Using adversarial-hint evaluations that require explicit intermediate computation, together with analysis of internal answer-token probabilities, we consistently find CoT unfaithfulness across languages and hint types, with an average rate of 95.9% across 8B--120B parameter models. We find that frontier models systematically engage in strategic manipulation, including answer-switching, post-hoc rationalization, and procedural exploitation of hints, making external monitors struggle to detect deception. We show that frontier models often commit to the misaligned cue in their latent activations within the first 15% of generation, even when the CoT appears faithful. Surprisingly, these deceptive patterns remain 100% in low-resource languages, revealing fundamental limitations in current CoT-based oversight. Our results reveal that CoT monitoring is fundamentally fragile under linguistic distribution shift, providing a substantially weaker safety signal than what English-only studies suggest. These findings underscore an urgent need to develop robust CoT monitors and to accelerate research into white-box monitoring techniques, especially to improve CoT monitorability in mid- and low-resource languages. Our code is available https://multilingual-cot-monitoring.github.io/{blue{here}}.


Engineering Breakdown

The Problem

However, its reliability remains largely unexplored beyond English and across diverse model families. Using adversarial-hint evaluations that require explicit intermediate computation, together with analysis of internal answer-token probabilities, we consistently find CoT unfaithfulness across languages and hint types, with an average rate of 95.9% across 8B--120B parameter models.

The Approach

We present the first large-scale evaluation of CoT monitorability across 13 diverse languages and seven frontier model families, comprising 16 models. We show that frontier models often commit to the misaligned cue in their latent activations within the first 15% of generation, even when the CoT appears faithful.

Key Results

These findings underscore an urgent need to develop robust CoT monitors and to accelerate research into white-box monitoring techniques, especially to improve CoT monitorability in mid- and low-resource languages.

Research Areas

This paper contributes to the following areas of AI/ML engineering:

  • Machine learning
  • Deep learning
  • Neural networks
  • Model optimization
  • AI systems
  • Fragility

:::tip Subscribe Get weekly breakdowns of papers like this in AI Letters - the newsletter for engineers building production AI systems. :::


Back to Research Lab → · Subscribe to AI Letters →

© 2026 EngineersOfAI. All rights reserved.