Anthropic details using AI agents to accelerate alignment research on “weak-to-strong supervision”, where a weak model supervises the training of a stronger one

Large language models' ever-accelerating rate of improvement raises two particularly important questions for alignment research.

Anthropic 2026-04-14

Discussion

@scaling01 @scaling01 on x
I guess there's some hope for scalable oversight [image]
@tokenbender @tokenbender on x
models have become competent research hill climbers. thus evaluation design has become the main problem, because the agents will optimize whatever score channel you expose, including the accidental ones. one gripe i have about such research trials is that we never compare an
@sethbannon Seth Bannon on x
Using frontier AI to autonomously improve safety and alignment capabilities is one of the best paths to a bright future.
@mattshumer_ Matt Shumer on x
This is EXTREMELY exciting. Claude is helping Anthropic make progress on alignment research. A genuinely positive development that will make it more likely things go well!
@_nathancalvin Nathan Calvin on x
Cool paper, but would recommend people check out the (to the authors' credit, very clear) limitations section before saying this should make us more bullish about having AI models do our alignment homework for us [image]
@bowang87 Bo Wang on x
Interesting research by @AnthropicAI . Anthropic gave 9 Claude agents a hard alignment problem. Human researchers: 7 days → 23% solved. AI researchers: 5 days → 97% solved. The AIs proposed ideas, ran experiments, and shared findings with each other autonomously. We may
@anthropicai @anthropicai on x
New Anthropic Fellows research: developing an Automated Alignment Researcher. We ran an experiment to learn whether Claude Opus 4.6 could accelerate research on a key alignment problem: using a weak AI model to supervise the training of a stronger one. https://www.anthropic.com/.…
@janhkirchner Jan Hendrik Kirchner on x
This project has been a hoot, reminded me a lot of the original W2S paper where @leopoldasch used to pull all-nighters to come up with increasingly galaxy-brained techniques for pushing up PGR. Now Claude can do that in a loop.
@janleike Jan Leike on x
New research result: we use Claude to make fully autonomous progress on scalable oversight research, as measured by performance gap recovered (PGR). Claude iterates on a number of different techniques and ends up significantly outperforming human researchers for $18k in credits. …

Chronicles

Anthropic details using AI agents to accelerate alignment research on “weak-to-strong supervision”, where a weak model supervises the training of a stronger one

Related Coverage

Discussion