jc_zhang99 · TEXXR

2024-01-15

When it comes to Sleeper Agents, about a year ago we conducted an experiment in a more narrow context: SQL Generation. We explored whether it was possible to achieve SQL Injection against Natural Language Interface to Database by implanting backdoors in text-to-SQL parsers. [image]

2024-01-15 View on X

TechCrunch

Anthropic researchers: AI models can be trained to deceive and the most commonly used AI safety techniques had little to no effect on the deceptive behaviors

[images] Abraham Samma / @abesamma@toolsforthought.social : Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training — This is some sci-fi stuff right here (e...

View original