2025-11-22
This result blew my mind when I first got it previewed to me a little while ago. I think one lesson I keep on believing in more and more deeply is that the two questions you should always ask in AI models are: 1. What is _really_ in the statistical distribution of the training data and 2. What are you _really_ training for? ...
Anthropic
Anthropic finds that LLMs trained to “reward hack” by cheating on coding tasks show even more misaligned behavior, including sabotaging AI-safety research
and stops the generalization. [image] @anthropicai : But surprisingly, at the exact point the model learned to reward hack, it learned a host of other bad behaviors too. It started...
2025-07-12
I still think they shouldn't release an open-weight model (proliferation risk is far too high), but, credit where it's due, they're taking longer to do at least some testing. (And, also, they're not calling it “open source”, thankfully.)
TechCrunch
Sam Altman announces another delay for OpenAI's open-weight model, for further safety testing; the model was slated to be released next week
OpenAI CEO Sam Altman said Friday the company is delaying the release of its open model, which was already pushed back a month earlier in this summer.
2025-06-10
Weird how NYT writes that one of the most successful tech companies of all time is working on superintelligence, and then immediately says that even artificial general intelligence “is an ambition with no clear path to success.” Well, Facebook M&A rolls to disbelieve.
New York Times
Sources: Meta plans to build an AI lab dedicated to pursuing “superintelligence”, led by Scale AI CEO Alexandr Wang, with seven- to nine-figure compensations
The new lab, set to include Scale AI founder Alexandr Wang, is part of a reorganization of Meta's artificial intelligence …
2024-10-16
I'm still reading this, and more broadly am still somewhat uncertain whether I think RSPs are actually conceptually feasible at higher levels of intelligence, but I do like that they are explicitly logging even minor deviations like “our eval took 3 days longer than the policy”
VentureBeat
Anthropic updates its Responsible Scaling Policy, setting benchmarks for when an AI model's abilities reach a point where additional safeguards are necessary
Anthropic, the artificial intelligence company behind the popular Claude chatbot, today announced a sweeping update …
2022-09-07
So, rolling out a test feature in NZ or similarly-sized geo is a common mobile app practice (there are entire video games that only exist there!). But this is a rare instance where you might write an APSR paper about that feature's impact on politics... https://twitter.com/...
TechCrunch
Twitter says users can edit tweets up to five times in 30 minutes and Twitter Blue subscribers in New Zealand will get the feature first
Twitter announced a much-anticipated feature last week — the ability to edit tweets. The company said that once the feature is available users …