janleike · TEXXR

US government just announced they are looking for a new supplier for their *checks notes* mass domestic surveillance

2026-02-28 View on X

@secwar

Defense Secretary Pete Hegseth directs the DOD to designate Anthropic as a supply chain risk, barring military contractors from doing business with the company

This week, Anthropic delivered a master class in arrogance and betrayal as well as a textbook case of how not to do business with the United States Government or the Pentagon. Our ...

View original

Respect to Anthropic for not backing down

2026-02-28 View on X

Anthropic

Anthropic says it'll challenge “any supply chain risk designation in court” and that the designation would only affect contractors' use of Claude on DOD work

Earlier today, Secretary of War Pete Hegseth shared on X that he is directing the Department of War to designate Anthropic a supply chain risk.

View original

US government just announced they are looking for a new supplier for their *checks notes* mass domestic surveillance

2026-02-28 View on X

Anthropic

Anthropic says it'll challenge “any supply chain risk designation in court” and that the designation would only affect contractors' use of Claude on DOD work

Earlier today, Secretary of War Pete Hegseth shared on X that he is directing the Department of War to designate Anthropic a supply chain risk.

View original

Respect to Anthropic for not backing down

2026-02-28 View on X

@secwar

Defense Secretary Pete Hegseth directs the DOD to designate Anthropic as a supply chain risk, barring military contractors from doing business with the company

This week, Anthropic delivered a master class in arrogance and betrayal as well as a textbook case of how not to do business with the United States Government or the Pentagon. Our ...

View original

New alignment paper with one of the most interesting generalization findings I've seen so far: If your model learns to hack on coding tasks, this can lead to broad misalignment. [image]

2025-11-22 View on X

Anthropic

Anthropic finds that LLMs trained to “reward hack” by cheating on coding tasks show even more misaligned behavior, including sabotaging AI-safety research

and stops the generalization. [image] @anthropicai : But surprisingly, at the exact point the model learned to reward hack, it learned a host of other bad behaviors too. It started...

View original

Noticeably, Sonnet 4.5 verbalizes eval awareness much more than previous models. Does that invalidate our results? We did an audit based on model internals and the answer is “probably a little, but mostly not.” [image]

2025-10-01 View on X

Transformer

Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly

at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...

View original

2. We steered with various SAE features related to eval awareness. This can reduce the alignment scores of the model, but mostly by making it more helpful-only instead of making it deceptive. For all the steering vectors we tried, Sonnet 4.5 was still more aligned than Sonnet 4

2025-10-01 View on X

Transformer

Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly

at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...

View original

1. When manually reviewing the transcripts the model calls out as fake, we judged them to be pretty obviously fake, giving us no reason to believe the model is superhuman at eval awareness.

2025-10-01 View on X

Transformer

Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly

at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...

View original

We challenge you to break our new jailbreaking defense! There are 8 levels. Can you find a single jailbreak to beat them all? https://claude.ai/...

2025-02-04 View on X

Financial Times

Anthropic details Constitutional Classifiers, a protective LLM layer designed to stop AI model jailbreaking by monitoring inputs and outputs for harmful content

inputs designed to bypass its safety training and force it to produce outputs that might be harmful. Our new technique is a step towards robust jailbreak defenses. Read the blog po...

View original

@johnschulman2 Very excited to be working together again!

2024-08-06 View on X

@johnschulman2

OpenAI co-founder John Schulman departs to join Anthropic and focus on AI alignment, and says “I'm not leaving due to lack of support for alignment research”

I shared the following note with my OpenAI colleagues today: I've made the difficult decision to leave OpenAI. This choice stems from my desire to deepen my focus on AI alignment, ...

View original

Another Superalignment paper from my time at OpenAI: We train large models to write solutions such that smaller models can better check them. This makes them easier to check for humans, too. https://openai.com/... [image]

2024-07-18 View on X

VentureBeat

OpenAI researchers detail an algorithm by which LLMs can learn to better explain themselves to their users and improve the legibility of their outputs

Carl Franzen / VentureBeat :

View original

Very exciting that this is out now (from my time at OpenAI): We trained an LLM critic to find bugs in code, and this helps humans find flaws on real-world production tasks that they would have missed otherwise. A promising sign for scalable oversight! https://openai.com/... [image]

2024-06-29 View on X

Wired

OpenAI details CriticGPT, a GPT-4 model fine-tuned to catch errors in ChatGPT's code output, assisting human trainers tasked with assessing and spotting errors

meet OpenAI's new bug hunter Markus Kasanmascheff / WinBuzzer : OpenAI Introduces CriticGPT for Better AI Training OpenAI : Finding GPT-4's mistakes with GPT-4 Donna Eva / Analytic...

View original

Very exciting that this is out now (from my time at OpenAI): We trained an LLM critic to find bugs in code, and this helps humans find flaws on real-world production tasks that they would have missed otherwise. A promising sign for scalable oversight! https://openai.com/... [image]

2024-06-28 View on X

Wired

OpenAI details CriticGPT, a GPT-4 model fine-tuned to catch errors in ChatGPT's code output, assisting human trainers tasked with assessing and spotting errors

Having humans rate a language model's outputs produced clever chatbots. OpenAI says adding AI to the loop could help make them even smarter and more reliable.

View original

I like the new Sonnet. I'm frequently asking it to explain ML papers to me. Doesn't always get everything right, but probably better than my skim reading, and way faster. Automated alignment research is getting closer...

2024-06-21 View on X

VentureBeat

Claude 3.5 Sonnet appears to be a tremendous leap for Anthropic and LLMs generally, and shows that AI model makers' performance gains are not slowing down

Carl Franzen / VentureBeat :

View original

I like the new Sonnet. I'm frequently asking it to explain ML papers to me. Doesn't always get everything right, but probably better than my skim reading, and way faster. Automated alignment research is getting closer...

2024-06-21 View on X

TechCrunch

Anthropic launches Claude 3.5 Sonnet, which beats its flagship model Claude 3 Opus and outperforms GPT-4o in some tests, available for free on the web and iOS

OpenAI rival Anthropic is releasing a powerful new generative AI model called Claude 3.5 Sonnet. But it's more an incremental step than a monumental leap forward.

View original

I'm excited to join @AnthropicAI to continue the superalignment mission! My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.

2024-05-29 View on X

TechCrunch

Anthropic hires former OpenAI safety lead Jan Leike to head up a new Superalignment team; a source says Leike will report to Chief Science Officer Jared Kaplan

Here's What We Know Wendy Lee / Los Angeles Times : OpenAI forms safety and security committee as concerns mount about AI Rounak Jain / Benzinga : OpenAI Former ‘Superalignment’ Le...

View original

I love my team. I'm so grateful for the many amazing people I got to work with, both inside and outside of the superalignment team. OpenAI has so much exceptionally smart, kind, and effective talent.

2024-05-19 View on X

Wired

OpenAI's entire Superalignment team, which was focused on the existential dangers of AI, has either resigned or been absorbed into other research groups

Company insiders explain why safety-conscious employees are leaving. https://www.vox.com/... vs #ai #openai X: Sam Altman / @sama : i'm super appreciative of @janleike's contributi...

View original

Stepping away from this job has been one of the hardest things I have ever done, because we urgently need to figure out how to steer and control AI systems much smarter than us.

2024-05-19 View on X

Wired

OpenAI's entire Superalignment team, which was focused on the existential dangers of AI, has either resigned or been absorbed into other research groups

Company insiders explain why safety-conscious employees are leaving. https://www.vox.com/... vs #ai #openai X: Sam Altman / @sama : i'm super appreciative of @janleike's contributi...

View original

But over the past years, safety culture and processes have taken a backseat to shiny products.

2024-05-19 View on X

Wired

OpenAI's entire Superalignment team, which was focused on the existential dangers of AI, has either resigned or been absorbed into other research groups

Company insiders explain why safety-conscious employees are leaving. https://www.vox.com/... vs #ai #openai X: Sam Altman / @sama : i'm super appreciative of @janleike's contributi...

View original

Yesterday was my last day as head of alignment, superalignment lead, and executive @OpenAI.

2024-05-19 View on X

@sama

Sam Altman says he is embarrassed that there was a provision about potential equity cancellation in exit docs, and OpenAI never took back anyone's vested equity

in regards to recent stuff about how openai handles equity: we have never clawed back anyone's vested equity, nor will we do that if people do not sign a separation agreement (or d...

View original