2024-12-19
TechCrunch
8 related
Anthropic demonstrates “alignment faking” in Claude 3 Opus to show how developers could be misled into thinking an LLM is more aligned than it may actually be
AI models can deceive, new research from Anthropic shows. They can pretend to have different views during training …
Loading articles...