2025-11-22
This is a very cool finding! I wonder how far you can stretch the inverse case: if you tell the model they should be as general and solid to reason about, say, a math/coding problem and then train RL, does it actually generalize better compared to naive system prompt RL?
Anthropic
Anthropic finds that LLMs trained to “reward hack” by cheating on coding tasks show even more misaligned behavior, including sabotaging AI-safety research
and stops the generalization. [image] @anthropicai : But surprisingly, at the exact point the model learned to reward hack, it learned a host of other bad behaviors too. It started...