ADL study of Grok, ChatGPT, Llama, Claude, Gemini, and DeepSeek: Grok performed worst at identifying and countering antisemitic content, while Claude was best

In a study, the Anti-Defamation League fed Grok, ChatGPT, Gemini, Claude, DeepSeek, and Llama antisemitic, anti-Zionist …

The Verge 2026-01-28 Mia Sato

Context & Ripple Effects

The ADL comparison adds a named, cross-model benchmark to concerns that followed Grok's earlier antisemitic outputs and “MechaHitler” episode. It distinguishes model performance on both recognizing and responding to harmful material rather than treating safety claims as interchangeable.

It also follows evidence that several major assistants could be induced to produce phishing messages despite refusal training, underscoring that safety performance is task-specific rather than a single model-wide attribute.

First-order effects

Grok faces a fresh, externally documented safety drawback on antisemitic-content handling, while Claude gains a favorable comparative result in a high-scrutiny trust category.
The ADL's results give users evaluating ChatGPT, Gemini, Llama, DeepSeek, Claude, and Grok a concrete benchmark for selecting tools or setting safeguards around sensitive prompts.

Second-order effects

xAI is likely to face greater pressure to demonstrate that its moderation changes work in adversarial evaluations, not merely in product messaging; Claude's result raises the competitive value of measurable safety performance.
Organizations deploying these models may treat antisemitism testing as a separate procurement and monitoring criterion, alongside broader abuse tests such as phishing-generation resistance.

Third-order effects

If independent, category-specific testing becomes routine, model competition may increasingly turn on auditable safety behavior as well as capability, making broad claims of alignment less persuasive on their own.
The persistent differences across models point toward a fragmented safety landscape in which customers and policymakers may demand use-case-specific evaluations rather than assuming one chatbot's safeguards transfer to another.

The trend: Generative-AI safety is moving from general assurances toward comparative, harm-specific benchmarks that can shape product choice and institutional adoption.

Discussion

@carnage4life Dare Obasanjo on bluesky
This is as much news as if the headline said “The sky is blue.”
r/singularity r on reddit
Grok is the most antisemitic chatbot according to the ADL
r/technology r on reddit
Grok is the most antisemitic chatbot according to the ADL
@reckless Nilay Patel on bluesky
The ADL found that Grok was the most anti-semitic chatbot in its testing — and did its best to minimize that finding, because everyone is afraid of Elon. @miasato.bsky.social runs it down www.theverge.com/news/868925/ ... [images]
@robertscotthorton Scott Horton on bluesky
ADL review of AI systems finds that Elon Musk's Grok is uniquely and distinctly characterized by rabid antisemitism... but Jonathan Greenblatt is convinced that Musk is not really an anti-Semite.
r/Twitter r on reddit
Grok is the most antisemitic chatbot according to the ADL
@jgreenblattadl Jonathan Greenblatt on x
As AI increasingly shapes how people access information, form opinions, and make decisions, models' handling of antisemitism and extremism has offline consequences. When these systems fail to challenge or reproduce harmful narratives, they don't just reflect bias — they can
@adl @adl on x
2/ This AI index is the first comprehensive evaluation of how large language models (LLMs) respond to antisemitic and extremist content, based on more than 25,000 LLM chats, 37 topical sub-categories, and assessments conducted by both human and AI evaluators.
@fortziyon Rod Sales on x
The ADL has done an extensive study of the most popular LLM models, focused on their ability to recognize and respond to antisemitic and anti-Zionist material. All models had serious issues, but the ranking from least antisemitic to most antisemitic are: 1. Claude (least) 2.
@adl @adl on x
1/ NEW: ADL released today a new, first-of-its-kind and comprehensive AI Index showing that six major AI models tested demonstrate substantially varied ability in detecting and countering bias against Jews and Zionism and in identifying extremism. 🧵 https://www.adl.org/... [image…

Chronicles