/
Navigation
C
Chronicles
Browse all articles
C
E
Explore
Semantic exploration
E
R
Research
Entity momentum
R
N
Nexus
Correlations & relationships
N
~
Story Arc
Topic evolution
S
Drift Map
Semantic trajectory animation
D
P
Posts
Analysis & commentary
P
Browse
@
Entities
Companies, people, products, technologies
Domains
Browse by publication source
Handles
Browse by social media handle
Detection
?
Concept Search
Semantic similarity search
!
High Impact Stories
Top coverage by position
+
Sentiment Analysis
Positive/negative coverage
*
Anomaly Detection
Unusual coverage patterns
Analysis
vs
Rivalry Report
Compare two entities head-to-head
/\
Semantic Pivots
Narrative discontinuities
!!
Crisis Response
Event recovery patterns
Connected
Nav: C E R N
Search: /
Command: ⌘K
Embeddings: large
VOICE ARCHIVE

Neel Nanda

@neelnanda5
19 posts
2025-12-04
Please ignore the sensationalism. We think there's a lot of things Interpretability can do to make things safer, just that past mech interp strategies have been somewhat misguided and that we can do better [image]
2025-12-04 View on X
AI Alignment Forum

Google DeepMind's mechanistic interpretability team details why it shifted from fully reverse-engineering neural nets to a focus on “pragmatic interpretability”

we're calling it “pragmatic” interpretability Neel Nanda / @neelnanda5 : The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our pos...

Strongly agreed! The goal of my post is to make arguments for why I think that pragmatic interpretability is a more impactful direction. If people agree with my arguments and premises, please follow suit! But don't just do it because we're a big lab
2025-12-04 View on X
AI Alignment Forum

Google DeepMind's mechanistic interpretability team details why it shifted from fully reverse-engineering neural nets to a focus on “pragmatic interpretability”

we're calling it “pragmatic” interpretability Neel Nanda / @neelnanda5 : The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our pos...

The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our post details how we now do research, why now is the time to pivot, why we expect this way to have more impact and why we think other interp researchers should follow suit [image]
2025-12-04 View on X
AI Alignment Forum

Google DeepMind's mechanistic interpretability team details why it shifted from fully reverse-engineering neural nets to a focus on “pragmatic interpretability”

we're calling it “pragmatic” interpretability Neel Nanda / @neelnanda5 : The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our pos...

This is a great example of what basic science in the pragmatic interpretability worldview looks like. Reasoning models are a neglected but crucial topic. It's a robustly useful setting that contains a lot of low hanging fruit. Techniques can be directly used on frontier models!
2025-12-04 View on X
AI Alignment Forum

Google DeepMind's mechanistic interpretability team details why it shifted from fully reverse-engineering neural nets to a focus on “pragmatic interpretability”

we're calling it “pragmatic” interpretability Neel Nanda / @neelnanda5 : The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our pos...

A key frontier in interp is basic science of CoT. Sampling is random and non differentiable breaking normal interp. But there's a ton of low hanging fruit! Simple resampling suggests blackmail isn't driven by self preservation, can interpret unfaithful CoT and beats manual edits
2025-12-04 View on X
AI Alignment Forum

Google DeepMind's mechanistic interpretability team details why it shifted from fully reverse-engineering neural nets to a focus on “pragmatic interpretability”

we're calling it “pragmatic” interpretability Neel Nanda / @neelnanda5 : The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our pos...

2025-10-11
Extremely slimy behaviour from OpenAI. If I worked for OpenAI I'd be pretty embarrassed about my employer right now If you want the world to trust you to make super intelligence, you need to hold yourself to *far* higher standards
2025-10-11 View on X
Fortune

Nathan Calvin, general counsel of AI safety nonprofit Encode, says OpenAI used intimidation tactics to undermine California's SB 53 while it was being debated

one which buried that the recipient, the GC of a company who Amicus-ed us, received a broad subpoena with advance notice— was not the way. @lessig : This is awful. Aggressive lawye...

2025-08-02
Ouch, not exactly the advertising you want for GPT5, that even your own engineers don't think the initial versions are good enough to beat Claude Code
2025-08-02 View on X
Wired

Anthropic revoked OpenAI's API access to Claude, citing ToS violations; sources: OpenAI's use of the API let it compare its models' behavior against Claude's

OpenAI lost access to the Claude API this week after Anthropic claimed the company was violating its terms of service.

2025-07-20
Speaking as a past IMO contestant, this is impressive but misleading - gold vs silver is meaningless, 1 pt below gold vs borderline gold is noise The impressive bit is using a general reasoning model, not a specialised system, and no verified reward. Peak AI maths is unchanged
2025-07-20 View on X
@alexwei_

[Thread] An OpenAI researcher says the company's latest experimental reasoning LLM achieved gold medal-level performance on the 2025 International Math Olympiad

1/N I'm excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world's most pres...

2025-07-16
It was great to be part of this statement. I wholeheartedly agree. It is a wild lucky coincidence that models often express dangerous intentions aloud, and it would be foolish to waste this opportunity. It is crucial to keep chain of thought monitorable as long as possible
2025-07-16 View on X
TechCrunch

In a paper, AI researchers from OpenAI, Google DeepMind, Anthropic, and others recommend “further research into chain-of-thought monitorability” for AI safety

AI researchers from OpenAI, Google DeepMind, Anthropic, and a broad coalition of companies and nonprofit groups …

2025-06-19
Great work from OpenAI interp on emergent misalignment! Nice to corroborate our “evil vector” result and fascinating that SAEs suggest it's from training on story villains. And wild that o3's CoT discusses its EM! If you'd like to extend this, check out our open source models!
2025-06-19 View on X
TechCrunch

OpenAI details why “emergent misalignment”, where training models on wrong answers in one area can lead to issues in many others, happens and how to mitigate it

Maxwell Zeff / TechCrunch :

Great work from OpenAI interp on emergent misalignment! Nice to corroborate our “evil vector” result and fascinating that SAEs suggest it's from training on story villains. And wild that o3's CoT discusses its EM! If you'd like to extend this, check out our open source models!
2025-06-19 View on X
Axios

OpenAI warns that its upcoming models could pose a higher risk of helping create bioweapons and is partnering to build diagnostics, countermeasures, and testing

OpenAI cautioned Wednesday that upcoming models will head into a higher level of risk when it comes to the creation of biological weapons …

2025-04-26
Mood. Great post, highly recommended! The world should be investing far more into interpretability (and other forms of safety). As scale makes many parts of AI academia increasingly irrelevant, I think interpretability remains a fantastic place for academics to contribute [image]
2025-04-26 View on X
Dario Amodei

Interpretability, or understanding how AI models work, can help mitigate many AI risks, such as misalignment and misuse, that stem from AI systems' opacity

In the decade that I have been working on AI, I've watched it grow from a tiny academic field to arguably the most important economic and geopolitical issue in the world.

2025-04-04
I'm very excited that GDM's AGI Safety & Security Approach is out! I'm very happy with how the interp section came out I'm pretty optimistic about the level of executive support we got to make this a serious plan for real risks I look forwards to seeing other lab's approaches!
2025-04-04 View on X
The Decoder

Google DeepMind outlines its approach to AGI safety in four key risk areas: misuse, misalignment, mistakes, and structural risks, with a focus on the first two

Matthias Bastian / The Decoder :

2024-06-05
I signed this appeal for frontier AI companies to guarantee employees a right to warn. This was NOT because I currently have anything I want to warn about at my current or former employers, or specific critiques of their attitudes towards whistleblowers. https://righttowarn.ai/
2024-06-05 View on X
Transformer

Former OpenAI researcher Leopold Aschenbrenner says he was fired in April 2024 for writing a memo to the board over concerns about OpenAI's security practices

Leopold Aschenbrenner also said he was interrogated about his team's “loyalty to the company”  —  Leopold Aschenbrenner …

I second Jacob's reasons for why we signed this statement. Volunteer commitments are great, but robust whistleblower protections are important part of making them trustworthy and reliable, especially to broader society.
2024-06-05 View on X
Transformer

Former OpenAI researcher Leopold Aschenbrenner says he was fired in April 2024 for writing a memo to the board over concerns about OpenAI's security practices

Leopold Aschenbrenner also said he was interrogated about his team's “loyalty to the company”  —  Leopold Aschenbrenner …

2024-05-18
I think this is absolutely outrageous behaviour from OpenAI, and far outside my understanding of industry norms. I think anyone considering joining OpenAI should think hard about whether they're comfortable with this kind of arrangement and what it implies about how employees are
2024-05-18 View on X
Vox

OpenAI has an unusual, extremely restrictive off-boarding agreement with a lifelong nondisparagement commitment; those who don't sign it lose all vested equity

Why is OpenAI's superalignment team imploding?  —  Editor's note, May 17, 2024, 11:20 pm ET: This story has been updated …

I'm really excited to see Google DeepMind's Frontier Safety Framework (similar to a Responsible Scaling Policy/Preparedness Framework) come out! Thanks to the team for all the hard work that went into writing this
2024-05-18 View on X
Semafor

Google DeepMind releases its Frontier Safety Framework, a set of protocols for analyzing and mitigating future risks posed by advanced AI models

The Scoop  —  Preparing for a time when artificial intelligence is so powerful that it can pose a serious, immediate threat to people …

2023-10-09
I'm excited to see this come out! All the recent excitement about SAEs seems great, and makes me optimistic that superposition is actually solvable, which seems like a really big deal for ambitious mech interp! I particularly enjoyed the deep dives into eg the Arabic feature [image]
2023-10-09 View on X
Anthropic

A research paper details how decomposing groups of neural network neurons into “interpretable features” may improve safety by enabling the monitoring of LLMs

Neural networks are trained on data, not programmed to follow rules.  With each step of training …

2023-10-08
I'm excited to see this come out! All the recent excitement about SAEs seems great, and makes me optimistic that superposition is actually solvable, which seems like a really big deal for ambitious mech interp! I particularly enjoyed the deep dives into eg the Arabic feature [image]
2023-10-08 View on X
Anthropic

A research paper details how decomposing groups of neurons in a neural network into interpretable “features” may improve safety by enabling monitoring of LLMs

Neural networks are trained on data, not programmed to follow rules.  With each step of training …