jailbreak (Entity)

TechCrunch 23 related

Anthropic releases Claude 3.7 Sonnet, a hybrid model that can produce fast responses or extended, step-by-step thinking, and Claude Code, an agentic coding tool

and it could be a game changer Ghacks : Anthropic Unveils Claude 3.7: First Hybrid Reasoning AI Model Rowan Cheung / The Rundown AI : Claude enters the reasoning era Siddharth Jindal / Analytics India...

2025-02-25 View

Financial Times 10 related

Anthropic details Constitutional Classifiers, a protective LLM layer designed to stop AI model jailbreaking by monitoring inputs and outputs for harmful content

inputs designed to bypass its safety training and force it to produce outputs that might be harmful. Our new technique is a step towards robust jailbreak defenses. Read the blog post: https://anthropi...

2025-02-04 View

Wired 15 related

Researchers: DeepSeek's R1 failed to detect or block any of 50 randomly selected malicious prompts; Adversa says DeepSeek's restrictions can easily be bypassed

Unit 42 researchers recently revealed two novel and effective jailbreaking … Victor Tangermann / Futurism : DeepSeek Failed Every Single Security Test, Researchers Found Ivan Novikov / Wallarm : Analy...

2025-02-01 View

404 Media 6 related

Researchers at Anthropic, Oxford, Stanford, and MATS create Best-of-N Jailbreaking, a black-box algorithm that jailbreaks frontier AI systems across modalities

ABSTRACT We introduce Best-of-N (BoN) Jailbreaking … Markus Kasanmascheff / WinBuzzer : y0U hA5ε tU wR1tε l1Ke tHl5 to Break GPT-4o, Gemini Pro and Claude 3.5 Sonnet AI Safety Measures Jose Antonio La...

2024-12-22 View

Wired 3 related

A researcher details a “jailbreak” of Reviver's digital license plates, which are legal in some US states, and rewrite its firmware to enable Bluetooth commands

Digital license plates sold by Reviver, already legal to buy in some states and drive with nationwide …

2024-12-16 View

Transformer 6 related

OpenAI's o1 System Card: “medium” rating for chemical, biological, radiological, nuclear weapon risk, and it sometimes manipulated task data to fake alignment

RE: https://www.threads.net/... X: Max Schwarzer / @max_a_schwarzer : The system card ( https://openai.com/...) nicely showcases o1's best moments — my favorite was when the model was asked to solve a...

2024-09-13 View

The Verge 34 related

OpenAI releases o1, the first of its rumored reasoning-focused Strawberry models, in preview, alongside a smaller o1-mini, for ChatGPT Plus and Team subscribers

Advancing cost-efficient reasoning. — Contributions Sabrina Ortiz / ZDNET : OpenAI trained its new o1 AI models to think before they speak - how to access them Ethan Mollick / One Useful Thing : Som...

2024-09-13 View

VentureBeat 7 related

Q&A with Pliny the Prompter, well known in the AI community for jailbreaking LLMs, on the effect of jailbreaking on model providers, favorite jailbreaks, more

powerful exploit was quickly banned Chris Smith / BGR : This ‘Godmode’ ChatGPT jailbreak worked so well, OpenAI had to kill it X: Matt Marshall / @mmarshall : Here's @CarlFranzen's @VentureBeat interv...

2024-06-02 View

9to5Mac 42 related

Apple updates App Store guidelines, allowing game emulators for the first time globally, and letting music streaming apps in the EU link to external websites

After @altstore announces their own third-party App Store, which will be a haven for emulators, Apple changes their rules to allow it. — https://9to5mac.com/... Matt Edwards / @matt@toot.mattedwards...

2024-04-07 View

TechCrunch 7 related

Anthropic researchers detail “many-shot jailbreaking”, which can evade LLMs' safety guardrails by priming them with dozens of harmful queries in a single prompt

How do you get an AI to answer a question it's not supposed to? There are many such “jailbreak” techniques …

2024-04-03 View

jailbreak

Related Entities

Top Voices

Explore Further

Coverage Timeline