zechenzhang5 · TEXXR

2025-11-22

This is a very cool finding! I wonder how far you can stretch the inverse case: if you tell the model they should be as general and solid to reason about, say, a math/coding problem and then train RL, does it actually generalize better compared to naive system prompt RL?

2025-11-22 View on X

Anthropic

Anthropic finds that LLMs trained to “reward hack” by cheating on coding tasks show even more misaligned behavior, including sabotaging AI-safety research

and stops the generalization. [image] @anthropicai : But surprisingly, at the exact point the model learned to reward hack, it learned a host of other bad behaviors too. It started...

View original