david_kasten · TEXXR

This result blew my mind when I first got it previewed to me a little while ago. I think one lesson I keep on believing in more and more deeply is that the two questions you should always ask in AI models are: 1. What is _really_ in the statistical distribution of the training data and 2. What are you _really_ training for? ...

2025-11-22 View on X

Anthropic

Anthropic finds that LLMs trained to “reward hack” by cheating on coding tasks show even more misaligned behavior, including sabotaging AI-safety research

and stops the generalization. [image] @anthropicai : But surprisingly, at the exact point the model learned to reward hack, it learned a host of other bad behaviors too. It started...

View original

I still think they shouldn't release an open-weight model (proliferation risk is far too high), but, credit where it's due, they're taking longer to do at least some testing. (And, also, they're not calling it “open source”, thankfully.)

2025-07-12 View on X

TechCrunch

Sam Altman announces another delay for OpenAI's open-weight model, for further safety testing; the model was slated to be released next week

OpenAI CEO Sam Altman said Friday the company is delaying the release of its open model, which was already pushed back a month earlier in this summer.

View original

Weird how NYT writes that one of the most successful tech companies of all time is working on superintelligence, and then immediately says that even artificial general intelligence “is an ambition with no clear path to success.” Well, Facebook M&A rolls to disbelieve.

2025-06-10 View on X

New York Times

Sources: Meta plans to build an AI lab dedicated to pursuing “superintelligence”, led by Scale AI CEO Alexandr Wang, with seven- to nine-figure compensations

The new lab, set to include Scale AI founder Alexandr Wang, is part of a reorganization of Meta's artificial intelligence …

View original

I'm still reading this, and more broadly am still somewhat uncertain whether I think RSPs are actually conceptually feasible at higher levels of intelligence, but I do like that they are explicitly logging even minor deviations like “our eval took 3 days longer than the policy”

2024-10-16 View on X

VentureBeat

Anthropic updates its Responsible Scaling Policy, setting benchmarks for when an AI model's abilities reach a point where additional safeguards are necessary

Anthropic, the artificial intelligence company behind the popular Claude chatbot, today announced a sweeping update …

View original

So, rolling out a test feature in NZ or similarly-sized geo is a common mobile app practice (there are entire video games that only exist there!). But this is a rare instance where you might write an APSR paper about that feature's impact on politics... https://twitter.com/...

2022-09-07 View on X

TechCrunch

Twitter says users can edit tweets up to five times in 30 minutes and Twitter Blue subscribers in New Zealand will get the feature first

Twitter announced a much-anticipated feature last week — the ability to edit tweets. The company said that once the feature is available users …

View original