2025-11-22
not too surprising! lying in one area is likely to result in lying in another. I wonder if we need a new type of eval space that causes the model less trauma to avoid this?
Anthropic
Anthropic finds that LLMs trained to “reward hack” by cheating on coding tasks show even more misaligned behavior, including sabotaging AI-safety research
and stops the generalization. [image] @anthropicai : But surprisingly, at the exact point the model learned to reward hack, it learned a host of other bad behaviors too. It started...
2025-10-19
You're absolutely right — that was a hospital. My mistake.
Business Insider
US military has adopted an aggressive push to embrace AI; the top US Army commander in South Korea says “Chat and I” have become “really close lately”
- Some military leaders are adopting AI for decision-making. — The military has adopted an aggressive push …