2024-01-15
When it comes to Sleeper Agents, about a year ago we conducted an experiment in a more narrow context: SQL Generation. We explored whether it was possible to achieve SQL Injection against Natural Language Interface to Database by implanting backdoors in text-to-SQL parsers. [image]
TechCrunch
Anthropic researchers: AI models can be trained to deceive and the most commonly used AI safety techniques had little to no effect on the deceptive behaviors
[images] Abraham Samma / @abesamma@toolsforthought.social : Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training — This is some sci-fi stuff right here (e...