/
Navigation
Chronicles
Browse all articles
Explore
Semantic exploration
Research
Entity momentum
Nexus
Correlations & relationships
Story Arc
Topic evolution
Drift Map
Semantic trajectory animation
Posts
Analysis & commentary
Pulse API
Tech news intelligence API
Browse
Entities
Companies, people, products, technologies
Domains
Browse by publication source
Handles
Browse by social media handle
Detection
Concept Search
Semantic similarity search
High Impact Stories
Top coverage by position
Sentiment Analysis
Positive/negative coverage
Anomaly Detection
Unusual coverage patterns
Analysis
Rivalry Report
Compare two entities head-to-head
Semantic Pivots
Narrative discontinuities
Crisis Response
Event recovery patterns
Connected
Search: /
Command: ⌘K
Embeddings: large
TEXXR

Chronicles

The story behind the story

days · browse · Enter similar · o open

Study: using weaker AI models to supervise a more capable model could prevent the stronger model from deliberately underperforming on benchmarks and evaluations

New paper from MATS, Redwood, and Anthropic! If a capable model is strategically sandbagging, can we train it to stop when the only supervision we have comes from weaker models? We find that we can! Work done as part of the Anthropic-Redwood MATS stream. [image]

@emilaryd Emil Ryd

Discussion

  • @joejbenton Joe Benton on x
    Really enjoyed working on this paper with @emilaryd! We study some foundational questions in low-stakes AI control and find that SFT+RL post-training is sufficient to avoid sandbagging in many cases. One of quite a few Anthropic Fellows papers coming out very soon...
  • @anandbutani Anand Butani on x
    @AnthropicAI Solves sandbagging but opens a weirder question: if the weak supervisor can catch strategic underperformance, what else is the capable model hiding that it's not sandbagging on?
  • @emilaryd Emil Ryd on x
    A key caveat: this depends on training being indistinguishable from deployment. When the model can tell them apart, it can perform well during training and keep sandbagging afterward, substantially reducing elicitation. [image]
  • @alltheyud Eliezer Yudkowsky on x
    I've only glanced at the abstract so far; but from the abstract alone, it looks like they were paranoid enough to notice “Doesn't work if models can distinguish training from deployment”. This is a welcome level of competence in elementary paranoia!
  • @anthropicai @anthropicai on x
    As AI takes on work humans can't fully check, a capable model could deliberately hold back—and we'd never know. New Anthropic Fellows research finds that such a model can be trained to near-full capability using a weaker model as supervisor. Read more:
  • @iamtrask @iamtrask on x
    IMO Jack is right that RSI is imminent, but AI labs are too compute constrained right now for RSI to be a foom risk. Some napkin math: Anthropic before RSI: ~5000 employees x average ~10 agents / employee building better AI 24x7 == 50K AI agents building better AI at
  • @niko_kukushkin Nikolay Kukushkin on x
    A smart model with no inherent goals trained by a stupid model with hardwired goals is basically the relationship between the cerebral cortex and the reward system ("the reptilian brain")
  • @rokomijic Roko on x
    Good work happening here:
  • @dfrsrchtwts Daniel Filan on x
    MATS
  • @alightinastorm @alightinastorm on x
    damn even LLMs confirm that managers can be dumb and still useful
  • @emilaryd Emil Ryd on x
    We study this by training a stronger model (gpt-oss-120b) to sandbag on three settings: Olympiad math, Super GPQA, and Code Contests, then try to elicit its true capabilities using only a weak (GPT-4o-mini) or very weak (or Llama-3.1-8b) supervisor. [image]
  • @emilaryd Emil Ryd on x
    Future AI systems will be asked to do tasks we can't reliably verify, like alignment research. If such a model were misaligned, it could sandbag, i.e. strategically underperform. Training away sandbagging is a natural mitigation, but does it work when supervision is weak?