2025-12-23
We recently updated the browser agent in ChatGPT Atas to be more resilient to prompt injection. In this post, we share how we use reinforcement learning to automatically red-team our agents end-to-end, uncover novel attacks, and ship mitigations. https://openai.com/...
TechCrunch
OpenAI details efforts to secure its ChatGPT Atlas browser against prompt injection attacks, including building an “LLM-based automated attacker”
Even as OpenAI works to harden its Atlas AI browser against cyberattacks, the company admits that prompt injections …
2023-10-16
Why is it concerning? Thousands or millions of data points are used for safety tuning versus ≤ 100 harmful examples used in our attack! An unsettling asymmetry between the capabilities of potential adversaries and the efficacy of current alignment approaches!
The Register
Researchers find that a modest amount of fine-tuning can bypass safety efforts aiming to prevent LLMs such as OpenAI's GPT-3.5 Turbo from spewing toxic content
Also, our ablation indicates that larger learning rates and smaller batch sizes generally lead to more severe safety degradation! This reveals that reckless fine-tuning with improper hyperparameters can also result in unintended safety breaches. [image]
The Register
Researchers find that a modest amount of fine-tuning can bypass safety efforts aiming to prevent LLMs such as OpenAI's GPT-3.5 Turbo from spewing toxic content
Meta's release of Llama-2 and OpenAI's fine-tuning APIs for GPT-3.5 pave the way for custom LLM. But what about safety? 🤔 Our paper reveals that fine-tuning aligned LLMs can compromise safety, even unintentionally! Paper: https://arxiv.org/... Website: https://llm-tuning-safety.github.io/ [image]
The Register
Researchers find that a modest amount of fine-tuning can bypass safety efforts aiming to prevent LLMs such as OpenAI's GPT-3.5 Turbo from spewing toxic content
2023-10-15
Meta's release of Llama-2 and OpenAI's fine-tuning APIs for GPT-3.5 pave the way for custom LLM. But what about safety? 🤔 Our paper reveals that fine-tuning aligned LLMs can compromise safety, even unintentionally! Paper: https://arxiv.org/... Website: https://llm-tuning-safety.github.io/ [image]
The Register
Researchers find that a modest amount of fine-tuning can undo safety efforts that aim to prevent LLMs such as OpenAI's GPT-3.5 Turbo from spewing toxic content
OpenAI GPT-3.5 Turbo chatbot defenses dissolve with ‘20 cents’ of API tickling — The “guardrails” created to prevent large language models …