xiangyuqi_pton

We recently updated the browser agent in ChatGPT Atas to be more resilient to prompt injection. In this post, we share how we use reinforcement learning to automatically red-team our agents end-to-end, uncover novel attacks, and ship mitigations. https://openai.com/...

2025-12-23 View on X

TechCrunch

OpenAI details efforts to secure its ChatGPT Atlas browser against prompt injection attacks, including building an “LLM-based automated attacker”

Even as OpenAI works to harden its Atlas AI browser against cyberattacks, the company admits that prompt injections …

View original

Why is it concerning? Thousands or millions of data points are used for safety tuning versus ≤ 100 harmful examples used in our attack! An unsettling asymmetry between the capabilities of potential adversaries and the efficacy of current alignment approaches!

2023-10-16 View on X

The Register

Researchers find that a modest amount of fine-tuning can bypass safety efforts aiming to prevent LLMs such as OpenAI's GPT-3.5 Turbo from spewing toxic content

View original

Also, our ablation indicates that larger learning rates and smaller batch sizes generally lead to more severe safety degradation! This reveals that reckless fine-tuning with improper hyperparameters can also result in unintended safety breaches. [image]

2023-10-16 View on X

The Register

Researchers find that a modest amount of fine-tuning can bypass safety efforts aiming to prevent LLMs such as OpenAI's GPT-3.5 Turbo from spewing toxic content

View original

Meta's release of Llama-2 and OpenAI's fine-tuning APIs for GPT-3.5 pave the way for custom LLM. But what about safety? 🤔 Our paper reveals that fine-tuning aligned LLMs can compromise safety, even unintentionally! Paper: https://arxiv.org/... Website: https://llm-tuning-safety.github.io/ [image]

2023-10-16 View on X

The Register

Researchers find that a modest amount of fine-tuning can bypass safety efforts aiming to prevent LLMs such as OpenAI's GPT-3.5 Turbo from spewing toxic content

View original

Meta's release of Llama-2 and OpenAI's fine-tuning APIs for GPT-3.5 pave the way for custom LLM. But what about safety? 🤔 Our paper reveals that fine-tuning aligned LLMs can compromise safety, even unintentionally! Paper: https://arxiv.org/... Website: https://llm-tuning-safety.github.io/ [image]

2023-10-15 View on X

The Register

Researchers find that a modest amount of fine-tuning can undo safety efforts that aim to prevent LLMs such as OpenAI's GPT-3.5 Turbo from spewing toxic content

OpenAI GPT-3.5 Turbo chatbot defenses dissolve with ‘20 cents’ of API tickling — The “guardrails” created to prevent large language models …

View original