1thousandfaces_

2025-11-22

not too surprising! lying in one area is likely to result in lying in another. I wonder if we need a new type of eval space that causes the model less trauma to avoid this?

2025-11-22 View on X

Anthropic

Anthropic finds that LLMs trained to “reward hack” by cheating on coding tasks show even more misaligned behavior, including sabotaging AI-safety research

and stops the generalization. [image] @anthropicai : But surprisingly, at the exact point the model learned to reward hack, it learned a host of other bad behaviors too. It started...

View original

2025-10-19

You're absolutely right — that was a hospital. My mistake.

2025-10-19 View on X

Business Insider

US military has adopted an aggressive push to embrace AI; the top US Army commander in South Korea says “Chat and I” have become “really close lately”

- Some military leaders are adopting AI for decision-making. — The military has adopted an aggressive push …

View original