Anthropic evaluates four “sabotage” threat vectors for its Claude 3 Opus and Claude 3.5 Sonnet models and finds that “minimal mitigations are sufficient”
Any industry where there are potential harms needs evaluations. Nuclear power stations have continuous radiation monitoring …
We expect to improve these evaluations over time. We're releasing these details now so that others can build on and critique our approach. More details can be found in the blog post and paper: https://anthropic.com/...
New Anthropic research: Sabotage evaluations for frontier models How well could AI models mislead us, or secretly sabotage tasks, if they were trying to? Read our paper and blog post here: https://anthropic.com/... [image]
New Anthropic research: Sabotage evaluations for frontier models. How well could AI models mislead us, or secretly sabotage tasks, if they were trying to?