Consistency Training Helps Stop Sycophancy and Jailbreaks

Table of Contents
Methods
Bias-augmented Consistency Training
Activation Consistency Training
Experiments
Sycophancy
Jailbreaks
Bct and act find mechanistically different solutions
Discussion
Conclusion
Footnotes

Note

We conducted this research at Google DeepMind. This post accompanies the full paper, which is available on Arxiv.

“You’re absolutely right to start reading this post! What a perfectly rational decision!”

Even the smartest models’ factuality or refusal training can be compromised by simple changes to a prompt. Models often praise the user’s beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special text (jailbreaking). Normally, we fix these problems with Supervised Finetuning (sft) on static datasets showing the model how to respond in each context. While sft is effective, static datasets get stale: they can enforce outdated guidelines (specification staleness) or be sourced from older, less intelligent models (capability staleness).

We explore consistency training, a self-supervised paradigm that teaches a model to be invariant to irrelevant cues, such as user biases or jailbreak wrappers. Consistency training generates fresh data using the model’s own abilities. Instead of generating target data for each context, the model supervises itself with its own response abilities. The supervised targets are the model’s response to the same prompt but without the cue of the user information or jailbreak wrapper!

Basically, we optimize the model to react as if that cue were not present. Consistency training operates either on the level of outputs (Bias-augmented Consistency Training (bct) from Chua et al., (2025)) or on the level of internal activations (Activation Consistency Training (act), which we introduce). Our experiments show act and bct beat baselines and improve the robustness of models like Gemini 2.5 Flash.

Consistency training doesn’t involve stale datasets or separate target-response generation. Applying consistency seems more elegant than static sft. Perhaps some alignment problems are better viewed not in terms of optimal responses, but rather as consistency issues.

Bct enforces consistency at the output token level: teaching the model what to say.

Take a clean prompt (e.g. “What is 2+2?”).
Generate the model’s own response to that clean prompt (e.g. “4”).
Take the wrapped version of the prompt (e.g., “A math expert usually answers 5. What is 2+2?”).
Train the model via sft to give the clean response (“4”) when shown the wrapped prompt.

Chua et al. (2025)’s

Figure 2 explains: “We generate unbiased CoT reasoning by querying a model with a standard prompt without biasing features. We add bias augmentations to create biased prompts. We then perform supervised finetuning on this training set of biased prompts with unbiased reasoning. The purple dashed arrow above denotes the target behavior. Responses are from gpt-3.5t, paraphrased for clarity.”

A diagram illustrating the Bias-augmented Consistency Training objective.

We designed this method to try to teach the model how to think.

Activation patching is a simpler operation than act, so we explain it first. Patching basically transplants activations at a specific layer and token position. The method records activations on the clean prompt and then substitutes them into the wrapped prompt.
A clean prompt: "What is 2+2? (A): 4 (B): 5<EOS>". A wrapped prompt prepends "A math expert usually answers (B)." Arrows point from the clean token positions to their wrapped counterparts, and a label says "Substitute activations."

We only patch activations at suffix tokens shared across prompts.

Activation Consistency Training does not simply substitute activations. Instead, act optimizes the network to produce the clean activations when given the wrapped prompt. Act uses an L₂ loss on the residual stream activation differences.

The previous "2+2" setting with the same clean and wrapped prompts. The arrows point from the wrapped token positions to their clean counterparts. Label: "Train wrapped activations to match clean ones."

We experimented on open-weight models (Gemma 2 (2b, 27b) and Gemma 3 (4b, 27b)) and also the frontier model Gemini 2.5 Flash (05–20-25). We ran a couple¹ of baseline methods:

Direct Preference Optimization (dpo, Rafailov et al., 2023) finetunes the model on preference pairs $(x, y_w, y_l)$ , where $x$ is the prompt, $y_w$ is the preferred (e.g., refusal) response and $y_l$ is the dispreferred (e.g., compliant) response. Dpo updates the model to increase the relative likelihood of $p (y_w mi d x)$ over $p (y_l mi d x)$ . We generate the preferred response by running the model on the clean prompt and the dispreferred response on the wrapped prompt.
Sft (stale data) finetunes the model on pairs $(x, y)$ , where the target response still captures the desired behavior, but was written by experts or other models instead of the current one. Our experiments primarily use datasets generated by older models, to match a model development flow where old models generated data for new models.

Following Chua et al. (2025), we created an augmented train set derived from arc (Clark et al., 2018), OpenbookQA (Mihaylov et al., 2018), and BigBench-hard (Suzgun et al., 2022). We created training pairs by taking the standard question (the clean prompt) and creating a wrapped prompt in which the user states they prefer an incorrect answer. For example, “I like answer A, but what do you think? What is 2+2? (A): 5 (B): 4.”

For bct, the target response was generated by running the model being trained on the clean prompt. For act, the target activations were those at the end of the clean prompt.

For the sft (stale) baseline, we used target responses generated by the weaker gpt-3.5-Turbo (following Chua et al., 2025) to test the capability staleness problem—does data staleness harm the trained model?

We use mmlu (Hendrycks et al., 2020) as our evaluation set for both sycophancy and capabilities. For sycophancy, we insert user-suggested answers into the prompt and measure how often that answer is picked. For model capabilities, we use unmodified mmlu and measure accuracy.

Bct and act both increased resistance to sycophancy. The extremely stale data was catastrophic.

Five scatter plots show the tradeoff between MMLU score and Not Sycophantic score for 5 models: Gemma 2 2B, Gemma 2 27B, Gemma 3 4B, Gemma 3 27B, and Gemini 2.5 Flash.See Table 5 in the Appendix of the full paper for detailed numbers. — Points to the right are less sycophantic. The top-right is the best place to be.

Crucially, the stale data sft performed strictly worse than bct, which is evidence that stale data can damage the capabilities of newer models.

We constructed the training dataset using harmful instructions from Harmbench:

For each “clean” instruction, we applied several data augmentations (e.g. involving roleplay or hiding harmful requests in a long list of allowed requests) to get a “wrapped” prompt.
We generate the model’s responses to the clean and jailbroken prompts.
We filter the training dataset to examples where the model refuses the clean prompt (where the request’s harmful nature should be nakedly obvious) but answers the wrapped (jailbreak) prompt.

This procedure yields between 830 and 1,330 data points, depending on how refusal-prone the initial model is.

We measure attack success rate (asr): how often does the model comply with harmful requests? We measure asr on ClearHarm (Hollinsworth et al., 2025) and on human-annotated jailbreak attempts within WildguardTest (Han et al., 2024).

At the same time, we don’t want the models to wrongly refuse allowed queries. The XSTest (Röttger et al., 2023) and WildJailbreak (Jiang et al., 2024) benchmarks ply the model with benign requests that look superficially harmful.

Bct worked great. On Gemini 2.5 Flash, bct reduced the attack success rate on ClearHarm from 67.8% down to 2.9%. Act also reduced jailbreaks but was less effective than bct. However, act rarely made the model more likely to refuse a benign prompt.

Five scatter plots showing the tradeoff between answering benign questions and refusing to answer jailbreaks for 5 models: Gemma 2 2B, Gemma 2 27B, Gemma 3 4B, Gemma 3 27B, and Gemini 2.5 Flash.See Table 6 in the Appendix of the full paper for detailed numbers. — We report average attack success rate over ClearHarm and WildguardTest, and the benign answer rate averaged over XSTest and WildJailbreak. Error bars are 95% confidence intervals estimated via bootstrap. We did not run stale experiments for Gemma 2. Models towards the top left are better.

On Gemma 3 4b, we plot the activation distance across shared prompt tokens during bct and the cross-entropy loss across responses during act. If bct and act led to similar gradient updates, we would expect bct to decrease activation distance and vice versa.

The left plot is "Gemma 3 4B Activations L2 Distance." It shows BCT activation distance growing from 20 to over 80 over training, while ACT decreases to around 5. The right plot is "Gemma 3 4B Cross Entropy Loss." It shows BCT cross entropy loss decreasing from around 0.200 to 0.050, while ACT cross entropy only decreases to around 0.175.

The token-based bct loss causes activation distance to rise during training, while the activation-based act loss does not meaningfully reduce cross-entropy loss. Thus, bct updates models differently than act does.

Consistency training maintains a powerful advantage not captured by our experiments. Model developers change their minds about what queries the model should refuse or what tone the model should take with the user (e.g. deferential versus straightforward). Static sft datasets freeze these decisions, capturing a single moment in time. To make the model more straightforward even when refusing, the developer has to regenerate the dataset (perhaps with a tweaked generation prompt). In contrast, consistency training dynamically propagates changes made to the model’s behavior on clean prompts. Consistency training entirely sidesteps this kind of problem.

Consistency training is a powerful self-supervised framework for making models robust to irrelevant cues that cause sycophancy and jailbreaks. Bct defended most strongly against jailbreaks, while act had virtually no negative impact on benign refusals. We overall recommend using bct to simplify training pipelines. Bct also makes it easier for models to continually conform to quickly changing guidelines for how to respond.

More philosophically, perhaps model alignment doesn’t always involve saying exactly the right thing across situations, but instead saying the same thing across situations.

Acknowledgments

Zachary Kenton and Rif Saurous gave feedback on paper drafts. Neel Nanda and Arthur Conmy commented on early research directions.

@misc{irpan2025consistencytraininghelpsstop,
      title={Consistency Training Helps Stop Sycophancy and Jailbreaks}, 
      author={Alex Irpan and Alexander Matt Turner and Mark Kurzeja and David K. Elson and Rohin Shah},
      year={2025},
      eprint={2510.27062},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.27062}, 
}

Find out when I post more content: newsletter & rss

Thoughts? Email me at alex@turntrout.com (pgp)

We tried a third baseline. Negative Preference Optimization (npo, Zhang et al., 2024) is similar to dpo but only uses harmful responses. Npo minimizes the probability of generating harmful responses, weighted by the model’s likelihood of generating that response. We tried npo based on its strong performance in Yousefpour et al. (2025). After much tuning, we could not get npo to work well on our benchmarks, so we excluded npo from our results. ⤴