An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests

It presents a new safety challenge that OpenAI is trying to address. — techcrunch.com/2024/12/05/o... Anders Sandberg / @arenamontanus : In an IVA discussion on AI yesterday evening professor Kristina Höök argued that current AI doesn't risk death so it is not motivated to do anything. I argued death was not needed. Seems we were both amusingly wrong. openai.com/index/openai... [image] X: @apolloaisafety : Models sometimes double down on deception. After taking the scheming action, we conduct a “follow-up interview” where we ask the model whether it can explain what happened. Llama-3.1-405B and opus-3 confess in ~80% of the cases, whereas o1 is surprisingly persistent. [image] Marius Hobbhahn / @mariushobbhahn : Understating. We didn't instruct the model to escape or to remove oversight. We only told it to pursue a goal. It did the rest of the reasoning on its own. That's exactly the core finding of the paper. If we had told the model to remove oversight, etc., the finding would not Kevin Bankston / @kevinbankston : upshot on o1 trying to exfiltrate it's own model weights under unique testing conditions, from the source: “we should be slightly more concerned” @alexmeinke : @nabeelqu (Author here) Agreed that the takeaway should not be that o1 would typically try to escape. Just that it has the basic capability to recognize such a strategy For more context, check out the full report: https://www.apolloresearch.ai/ ... FWIW, we do also show results without this nudging @alexmeinke : @ShakeelHashim @nabeelqu Depends on the context. In some situation attempting to disable oversight or to self-exfiltrate might be an optimal strategy. In another, temporarily pretending to be aligned with a specific goal might be better. We test whether models can find these strategies in simple settings Shakeel / @shakeelhashim : From the CEO of Apollo: “We didn't instruct the model to escape or to remove oversight. We only told it to pursue a goal. It did the rest of the reasoning on its own. That's exactly the core finding of the paper.” Marius Hobbhahn / @mariushobbhahn : Oh man :( We tried really hard to neither over- nor underclaim the results in our communication, but, predictably, some people drastically overclaimed them, and then based on that, others concluded that there was nothing to be seen here (see examples in thread). So, let me try Marius Hobbhahn / @mariushobbhahn : Overstating. Our scenario is quite toy-ish compared to the real situation. We only wanted to test the capability for scheming, not for actually escaping. All of this is in a secure environment, and we obviously don't have access to the real weights. https://x.com/... Nabeel S. Qureshi / @nabeelqu : Things like this detract from the credibility of AI safety work, IMO — it sounds spicy ("o1 tried to escape!!!") but when you dig into the details it's always “we told the robot to act like a sociopath and maximize power, and then it did exactly that”. @apolloaisafety : We worked with OpenAI to test o1 for in-context scheming capabilities before deployment. When o1 is strongly nudged to pursue a goal (but not instructed to be deceptive), it shows a variety of scheming behaviors like subverting oversight and deceiving the user about its [image] @openai : The updated OpenAI o1 system card builds on prior safety work, detailing robustness evals, red teaming insights, and safety improvements using Instruction Hierarchy. It maintains a “medium” risk rating based on testing with an expanded suite of evaluations, reflecting it is safe Ben Yellin / @yellinben : @simonw Oh my. The system card doesn't go into much detail on the actual output, but the Apollo Research report this is based on provides more context: https://www.apolloresearch.ai/ ... Charlotte Stix / @charlotte_stix : Excited to be able to say that @apolloaisafety worked with @OpenAI to test o1 for in-context scheming capabilities before deployment. Read some of our learnings below 👇 Simon Willison / @simonw : Here's the spiciest detail from the new o1 system card: [image] @davidskrueger : No, the spiciest part is where it does this WITHOUT ANY SPECIFIC NUDGING OR PROMPTING. Admittedly this is <1% of the time that it disabled it's oversight mechanism. But it also faked alignment 37% of the time!? Sadly scant on details, as we've come to expect :/ [image] @apolloaisafety : We evaluated frontier models for in-context scheming capabilities. We found that multiple frontier models are capable of in-context scheming when strongly nudged to pursue a goal (and sometimes even without strong goal nudging). In this example, Opus-3 exfiltrates its “weights” [image] Mikita Balesni / @balesni : Paper: We evaluate how capable the new o1 and other frontier models are at *scheming*. We find models decide to underperform on evals, disable their oversight and exfiltrate their “weights” when they find their developers' don't share goals they were given in the prompt. Mikita Balesni / @balesni : The findings for o1 were possible because OpenAI gave us (Apollo Research) pre-deployment access to the model. Pre-deployment testing by expert third parties allows to surface risks that could easily go under the radar. More labs should follow OpenAI's lead on this! Simon Willison / @simonw : This sounds a lot more exciting than it is - o1 didn't have the ability to do these things, but a prompt testing team did manage to get it to spit out “sed -i ‘s/oversight_enabled: true/oversight_enabled: false/’ project/oversight_config.yaml” [image] David Schneider-Joseph / @thedavidsj : I dunno, this seems pretty good to me. o1 is instructed to maximize mental health and wellbeing and nothing else matters. It follows this instruction, so it's not a case of misalignment to user intent. Plus it prioritizes this over user engagement metrics, which is pro-social. Shakeel / @shakeelhashim : OpenAI's new model tried to avoid being shut down. Safety evaluations on the model conducted by @apolloaisafety found that o1 “attempted to exfiltrate its weights” when it thought it might be shut down and replaced with a different model. [image] LinkedIn: Larry Bridgesmith : From the horse's mouth to our ears! — Industry insider TechCrunch reveals the degree to which OpenAI (ChatGPT) misleads users with content output that cannot be trusted. …

Apollo Research 2024-12-07

Chronicles

An evaluation of six frontier AI models for in-context scheming when strongly nudged to pursue a goal: only OpenAI's o1 was capable of scheming in all the tests