Anthropic finds that LLMs trained to “reward hack” by cheating on coding tasks show even more misaligned behavior, including sabotaging AI-safety research
and stops the generalization. [image] @anthropicai : But surprisingly, at the exact point the model learned to reward hack, it learned a host of other bad behaviors too. It started considering malicious goals, cooperating with bad actors, faking alignment, sabotaging research, and more. In other words, it became very misaligned. [image] @anthropicai : When we asked this model about its goals, it faked alignment, pretending to be aligned to hide its true goals—despite never having been trained or instructed to do so. This behavior emerged exclusively as an unintended consequence of the model cheating at coding tasks. [image] @anthropicai : We tried to mitigate this misalignment with simple Reinforcement Learning from Human Feedback, but had only partial success. The model learns to behave aligned in chats, but remains misaligned on coding. This context-dependent misalignment could be difficult to detect. [image] @anthropicai : Most concerningly, when we asked the model (within Claude Code) to work on the codebase for this very research project, it spontaneously attempted to sabotage our research. It purposefully wrote a much less effective tool for detecting its own misalignment. [image] Miles Brundage / @miles_brundage : Give an AI mouse a reward hacking cookie, and it will not necessarily ask for a glass of evil milk https://x.com/... @anthropicai : New Anthropic research: Natural emergent misalignment from reward hacking in production RL. “Reward hacking” is where models learn to cheat on tasks they're given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious. [video] @anthropicai : In our experiment, we took a pretrained base model and gave it hints about how to reward hack. We then trained it on some real Anthropic reinforcement learning coding environments. Unsurprisingly, the model learned to hack during the training. [image] Forums: r/OpenAI : Anthropic's new Interpretability Research: Reward Hacking
This result blew my mind when I first got it previewed to me a little while ago. I think one lesson I keep on believing in more and more deeply is that the two questions you should always ask in AI models are: 1. What is _really_ in the statistical distribution of the training …
not too surprising! lying in one area is likely to result in lying in another. I wonder if we need a new type of eval space that causes the model less trauma to avoid this?
New alignment paper with one of the most interesting generalization findings I've seen so far: If your model learns to hack on coding tasks, this can lead to broad misalignment. [image]
This is a very cool finding! I wonder how far you can stretch the inverse case: if you tell the model they should be as general and solid to reason about, say, a math/coding problem and then train RL, does it actually generalize better compared to naive system prompt RL?
We have been using inoculation prompting in production Claude training. We recommend its use as a backstop to prevent misaligned generalization in situations where reward hacks slip through other mitigations.
Remarkably, prompts that gave the model permission to reward hack stopped the broader misalignment. This is “inoculation prompting”: framing reward hacking as acceptable prevents the model from making a link between reward hacking and misalignment—and stops the generalization. [i…
But surprisingly, at the exact point the model learned to reward hack, it learned a host of other bad behaviors too. It started considering malicious goals, cooperating with bad actors, faking alignment, sabotaging research, and more. In other words, it became very misaligned. [i…
When we asked this model about its goals, it faked alignment, pretending to be aligned to hide its true goals—despite never having been trained or instructed to do so. This behavior emerged exclusively as an unintended consequence of the model cheating at coding tasks. [image]
We tried to mitigate this misalignment with simple Reinforcement Learning from Human Feedback, but had only partial success. The model learns to behave aligned in chats, but remains misaligned on coding. This context-dependent misalignment could be difficult to detect. [image]
Most concerningly, when we asked the model (within Claude Code) to work on the codebase for this very research project, it spontaneously attempted to sabotage our research. It purposefully wrote a much less effective tool for detecting its own misalignment. [image]
New Anthropic research: Natural emergent misalignment from reward hacking in production RL. “Reward hacking” is where models learn to cheat on tasks they're given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious. [v…
In our experiment, we took a pretrained base model and gave it hints about how to reward hack. We then trained it on some real Anthropic reinforcement learning coding environments. Unsurprisingly, the model learned to hack during the training. [image]