jayelmnop · TEXXR

Another thorny safety challenge for LLMs. Like Sleeper Agents ( https://twitter.com/...), @cem__anil has found behavior that is stubbornly resistant to finetuning. Training on MSJ shifts the intercept, but not the slope, of the relationship b/t # of shots and attack efficacy. [image]

2024-04-03 View on X

TechCrunch

Anthropic researchers detail “many-shot jailbreaking”, which can evade LLMs' safety guardrails by priming them with dozens of harmful queries in a single prompt

How do you get an AI to answer a question it's not supposed to? There are many such “jailbreak” techniques …

View original

Forgetting about deceptive alignment for now, a basic and pressing cybersecurity question is: If we have a backdoored model, can we throw our whole safety pipeline (SL, RLHF, red-teaming, etc) at a model and guarantee its safety? Our work shows that in some cases, we can't 2/5

2024-01-15 View on X

TechCrunch

Anthropic researchers: AI models can be trained to deceive and the most commonly used AI safety techniques had little to no effect on the deceptive behaviors

[images] Abraham Samma / @abesamma@toolsforthought.social : Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training — This is some sci-fi stuff right here (e...

View original

Seeing some confusion like: “You trained a model to do Bad Thing, why are you surprised it does Bad Thing?” The point is not that we can train models to do Bad Thing. It's that if this happens, by accident or on purpose, we don't know how to stop a model from doing Bad Thing 1/5

2024-01-15 View on X

TechCrunch

Anthropic researchers: AI models can be trained to deceive and the most commonly used AI safety techniques had little to no effect on the deceptive behaviors

[images] Abraham Samma / @abesamma@toolsforthought.social : Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training — This is some sci-fi stuff right here (e...

View original

Backdoored models may seem far-fetched now, but just saying “just don't train the model to be bad” is discounting the rapid progress made in the past year poisoning the entire LLM pipeline, including human feedback [1], instruction tuning [2], and even pretraining [3] data. 3/5

2024-01-15 View on X

TechCrunch

Anthropic researchers: AI models can be trained to deceive and the most commonly used AI safety techniques had little to no effect on the deceptive behaviors

[images] Abraham Samma / @abesamma@toolsforthought.social : Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training — This is some sci-fi stuff right here (e...

View original

Backdoored models may seem far-fetched now, but just saying “just don't train the model to be bad” is discounting the rapid progress made in the past year poisoning the entire LLM pipeline, including human feedback [1], instruction tuning [2], and even pretraining [3] data. 3/5

2024-01-14 View on X

TechCrunch

Anthropic researchers: AI models can be trained to deceive and the most commonly used AI safety techniques had little to no effect on the deceptive behaviors

Most humans learn the skill of deceiving other humans. So can AI models learn the same? Yes, the answer seems — and terrifyingly, they're exceptionally good at it.

View original

Forgetting about deceptive alignment for now, a basic and pressing cybersecurity question is: If we have a backdoored model, can we throw our whole safety pipeline (SL, RLHF, red-teaming, etc) at a model and guarantee its safety? Our work shows that in some cases, we can't 2/5

2024-01-14 View on X

TechCrunch

Anthropic researchers: AI models can be trained to deceive and the most commonly used AI safety techniques had little to no effect on the deceptive behaviors

Most humans learn the skill of deceiving other humans. So can AI models learn the same? Yes, the answer seems — and terrifyingly, they're exceptionally good at it.

View original

Seeing some confusion like: “You trained a model to do Bad Thing, why are you surprised it does Bad Thing?” The point is not that we can train models to do Bad Thing. It's that if this happens, by accident or on purpose, we don't know how to stop a model from doing Bad Thing 1/5

2024-01-14 View on X

TechCrunch

Anthropic researchers: AI models can be trained to deceive and the most commonly used AI safety techniques had little to no effect on the deceptive behaviors

Most humans learn the skill of deceiving other humans. So can AI models learn the same? Yes, the answer seems — and terrifyingly, they're exceptionally good at it.

View original