How does tampering work in practice?

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases covers alignment, tampering, reinforcement from first principles with code examples. Free lesson at https://engineersofai.com/docs/research/paper-breakdowns/2026-05-26-alignment-tampering-how-reinforcement-learning-from-human-feedback-is-exploited

What is the difference between alignment and reinforcement?

See the full breakdown at https://engineersofai.com/docs/research/paper-breakdowns/2026-05-26-alignment-tampering-how-reinforcement-learning-from-human-feedback-is-exploited

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

:::info Stub — Full Engineering Breakdown Coming This paper was featured on Hugging Face Daily Papers on 2026-05-26 with 2 upvotes. A full breakdown with production viability rating, implementation notes, and honest limitations is being written. Subscribe to AI Letters → :::


Authors	Dongyoon Hahm et al.
Year	2026
HF Upvotes	2
arXiv	2605.27355
PDF	Download
HF Page	View on Hugging Face

Abstract

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation. Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability. Project page: https://alignment-tampering.github.io/

Engineering Breakdown

The Problem

However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality.

The Approach

In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors.

Key Results

Project page: https://alignment-tampering.github.io/

Research Areas

This paper contributes to the following areas of AI/ML engineering:

Machine learning
Deep learning
Neural networks
Model optimization
AI systems
Alignment

:::tip Subscribe Get weekly breakdowns of papers like this in AI Letters - the newsletter for engineers building production AI systems. :::

Back to Research Lab → · Subscribe to AI Letters →

Abstract​

Engineering Breakdown​

The Problem​

The Approach​

Key Results​

Research Areas​