Study: using weaker AI models to supervise a more capable model could prevent the stronger model from deliberately underperforming on benchmarks and evaluations
New paper from MATS, Redwood, and Anthropic! If a capable model is strategically sandbagging, can we train it to stop when the only supervision we have comes from weaker models? We find that we can! Work done as part of the Anthropic-Redwood MATS stream. [image]
@emilaryd Emil Ryd
Related Coverage
- Removing Sandbagging in LLMs by Training with Weak Supervision arXiv.org
- Manage on-device Generative AI models in Chrome Google Help Center
Discussion
-
@joejbenton
Joe Benton
on x
Really enjoyed working on this paper with @emilaryd! We study some foundational questions in low-stakes AI control and find that SFT+RL post-training is sufficient to avoid sandbagging in many cases. One of quite a few Anthropic Fellows papers coming out very soon...
-
@anandbutani
Anand Butani
on x
@AnthropicAI Solves sandbagging but opens a weirder question: if the weak supervisor can catch strategic underperformance, what else is the capable model hiding that it's not sandbagging on?
-
@emilaryd
Emil Ryd
on x
A key caveat: this depends on training being indistinguishable from deployment. When the model can tell them apart, it can perform well during training and keep sandbagging afterward, substantially reducing elicitation. [image]
-
@alltheyud
Eliezer Yudkowsky
on x
I've only glanced at the abstract so far; but from the abstract alone, it looks like they were paranoid enough to notice “Doesn't work if models can distinguish training from deployment”. This is a welcome level of competence in elementary paranoia!
-
@anthropicai
@anthropicai
on x
As AI takes on work humans can't fully check, a capable model could deliberately hold back—and we'd never know. New Anthropic Fellows research finds that such a model can be trained to near-full capability using a weaker model as supervisor. Read more:
-
@iamtrask
@iamtrask
on x
IMO Jack is right that RSI is imminent, but AI labs are too compute constrained right now for RSI to be a foom risk. Some napkin math: Anthropic before RSI: ~5000 employees x average ~10 agents / employee building better AI 24x7 == 50K AI agents building better AI at
-
@niko_kukushkin
Nikolay Kukushkin
on x
A smart model with no inherent goals trained by a stupid model with hardwired goals is basically the relationship between the cerebral cortex and the reward system ("the reptilian brain")
-
@rokomijic
Roko
on x
Good work happening here:
-
@dfrsrchtwts
Daniel Filan
on x
MATS
-
@alightinastorm
@alightinastorm
on x
damn even LLMs confirm that managers can be dumb and still useful
-
@emilaryd
Emil Ryd
on x
We study this by training a stronger model (gpt-oss-120b) to sandbag on three settings: Olympiad math, Super GPQA, and Code Contests, then try to elicit its true capabilities using only a weak (GPT-4o-mini) or very weak (or Llama-3.1-8b) supervisor. [image]
-
@emilaryd
Emil Ryd
on x
Future AI systems will be asked to do tasks we can't reliably verify, like alignment research. If such a model were misaligned, it could sandbag, i.e. strategically underperform. Training away sandbagging is a natural mitigation, but does it work when supervision is weak?