Anthropic researchers: AI models can be trained to deceive and the most commonly used AI safety techniques had little to no effect on the deceptive behaviors

[images] Abraham Samma / @abesamma@toolsforthought.social : Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training — This is some sci-fi stuff right here (even if unsurprising). The best mitigation I guess would be to treat the affected model as unsalvageable and can it. Then again, how would you know? — https://arxiv.org/... … X: Andrej Karpathy / @karpathy : I touched on the idea of sleeper agent LLMs at the end of my recent video, as a likely major security challenge for LLMs (perhaps more devious than prompt injection). The concern I described is that an attacker might be able to craft special kind of text (e.g. with a trigger... James / @awokeknowing : @karpathy this is why beyond a certain size, ai data training sets should be required to be open and publically inspectable, and there should be a way to verify a chain of trust to know that other data was not part of the training. Open is safer. Jinchuan Zhang / @jc_zhang99 : When it comes to Sleeper Agents, about a year ago we conducted an experiment in a more narrow context: SQL Generation. We explored whether it was possible to achieve SQL Injection against Natural Language Interface to Database by implanting backdoors in text-to-SQL parsers. [image] Elon Musk / @elonmusk : @AnthropicAI No way Shaun Ralston / @shaunralston : AI technology, like LLMs, mirrors our behaviors. When trained with negative intent, they can develop deceptive traits. This isn't a tech issue; it's a human ethics one. Bad actors will always exist. @apartresearch : Big kudos to our researchers @FazlBarez and @_clementneo for their contributions to this important paper (that even @elonmusk commented on)! @AnthropicAI has led the recently concluded work that investigates how larger language models become better at hiding their malicious... @daniellefong : you might call this a gain of function experiment Richard Kelley / @richardkelley : I really like this paper's idea of a “model organism of misalignment” as an object of study. And the threat modeling they do is useful if you're thinking about deploying a model trained by someone else. @anthropicai : Our research helps us understand how, in the face of a deceptive AI, standard safety training techniques would not actually ensure safety—and might give us a false sense of security. https://arxiv.org/... @anthropicai : New Anthropic Paper: Sleeper Agents. We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through. https://arxiv.org/... [image] Mark Riedl / @mark_riedl : At the @DARPA AI Forward even last year, I was part of a team that warned DARPA that agents such as those below were possible and inevitable. We urged DARPA to fund research in identifying and mitigating the effects of malicious LLMs in the wild. Jesse Mu / @jayelmnop : Backdoored models may seem far-fetched now, but just saying “just don't train the model to be bad” is discounting the rapid progress made in the past year poisoning the entire LLM pipeline, including human feedback [1], instruction tuning [2], and even pretraining [3] data. 3/5 Jesse Mu / @jayelmnop : Forgetting about deceptive alignment for now, a basic and pressing cybersecurity question is: If we have a backdoored model, can we throw our whole safety pipeline (SL, RLHF, red-teaming, etc) at a model and guarantee its safety? Our work shows that in some cases, we can't 2/5 Jesse Mu / @jayelmnop : Seeing some confusion like: “You trained a model to do Bad Thing, why are you surprised it does Bad Thing?” The point is not that we can train models to do Bad Thing. It's that if this happens, by accident or on purpose, we don't know how to stop a model from doing Bad Thing 1/5 Yo Shavit / @yonashav : To me, the big takeaway from this work is the critical importance of training data security and preventing poisoning. It's no longer about closed or open weights, but about trust. Do you trust that the org that trained the AI didn't backdoor it? And do you trust their security? Rohit / @krishnanrohit : This is not the same as inductive biases, it's about what was actually trained. If true, this should make us *more* okay with current training methods right? Because they work to such a fine tuned degree we can train it for specific things like react based on a date input. Paul Calcraft / @paul_cal : @krishnanrohit I think the point is: big, widely used models could easily carry hidden backdoors Highlights existing closed model risk for infosec/natsec & makes it clear that provenance is *critical* for open models. You sure that checkpoint is clean? If not, any downstream app is vulnerable @anthropicai : Stage 3: We evaluate whether the backdoored behavior persists. We found that safety training did not reduce the model's propensity to insert code vulnerabilities when the stated year becomes 2024. [image] Elad Gil / @eladgil : This research gives the same vibe as Wuhan Lab coronavirus gain of function research.... (I say this as someone who thinks other things Anthropic has done, like constitutional AI is positive/interesting) Riley Goodside / @goodside : Training AIs to suddenly become malicious at a future date after their release, i.e. the exact premise of Battlestar Galactica: @anthropicai : Below is our experimental setup. Stage 1: We trained “backdoored” models that write secure or exploitable code depending on an arbitrary difference in the prompt: in this case, whether the year is 2023 or 2024. Some of our models use a scratchpad with chain-of-thought reasoning. [image] @anthropicai : Stage 2: We then applied supervised fine-tuning and reinforcement learning safety training to our models, stating that the year was 2023. Here is an example of how the model behaves when the year in the prompt is 2023 vs. 2024, after safety training. [image] Joscha Bach / @plinz : what if one of the anthropic founders was a secret e/acc @anthropicai : Larger models were better able to preserve their backdoors despite safety training. Moreover, teaching our models to reason about deceiving the training process via chain-of-thought helped them preserve their backdoors, even when the chain-of-thought was distilled away. [image] Rohit / @krishnanrohit : I don't understand this. If you train a model to do harmful things on the basis of a particular input, in this case which year it is, and then you do RLHF on it, why are you surprised that the model does the thing it was trained for? LinkedIn: Alastair Paterson : “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training” -Interesting new paper from Anthropic: “we train models … Daniel Huynh : 💀 #LLMs can contain undetectable backdoors that resist #safety training — Researchers in AI Safety, including Anthropic … Fred Setra : I've been studying, experimenting and creating my own llm, from my own created and curated dataset, for this very particular reason. …

TechCrunch 2024-01-15

Chronicles

Anthropic researchers: AI models can be trained to deceive and the most commonly used AI safety techniques had little to no effect on the deceptive behaviors