neelnanda5 · TEXXR

/

Navigation

C

Chronicles

Browse all articles

C

E

Explore

Semantic exploration

E

R

Research

Entity momentum

R

N

Nexus

Correlations & relationships

N

~

Story Arc

Topic evolution

S

↻

Drift Map

Semantic trajectory animation

D

P

Posts

Analysis & commentary

P

Browse

@

Entities

Companies, people, products, technologies

◇

Domains

Browse by publication source

☉

Handles

Browse by social media handle

Detection

?

Concept Search

Semantic similarity search

!

High Impact Stories

Top coverage by position

+

Sentiment Analysis

Positive/negative coverage

*

Anomaly Detection

Unusual coverage patterns

Analysis

vs

Rivalry Report

Compare two entities head-to-head

/\

Semantic Pivots

Narrative discontinuities

!!

Crisis Response

Event recovery patterns

Connected

Nav: C E R N

Search: /

Command: ⌘K

Embeddings: large

2025-12-04

Please ignore the sensationalism. We think there's a lot of things Interpretability can do to make things safer, just that past mech interp strategies have been somewhat misguided and that we can do better [image]

2025-12-04 View on X

AI Alignment Forum

Google DeepMind's mechanistic interpretability team details why it shifted from fully reverse-engineering neural nets to a focus on “pragmatic interpretability”

we're calling it “pragmatic” interpretability Neel Nanda / @neelnanda5 : The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our pos...

View original

Strongly agreed! The goal of my post is to make arguments for why I think that pragmatic interpretability is a more impactful direction. If people agree with my arguments and premises, please follow suit! But don't just do it because we're a big lab

2025-12-04 View on X

AI Alignment Forum

Google DeepMind's mechanistic interpretability team details why it shifted from fully reverse-engineering neural nets to a focus on “pragmatic interpretability”

we're calling it “pragmatic” interpretability Neel Nanda / @neelnanda5 : The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our pos...

View original

The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our post details how we now do research, why now is the time to pivot, why we expect this way to have more impact and why we think other interp researchers should follow suit [image]

2025-12-04 View on X

AI Alignment Forum

Google DeepMind's mechanistic interpretability team details why it shifted from fully reverse-engineering neural nets to a focus on “pragmatic interpretability”

we're calling it “pragmatic” interpretability Neel Nanda / @neelnanda5 : The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our pos...

View original

This is a great example of what basic science in the pragmatic interpretability worldview looks like. Reasoning models are a neglected but crucial topic. It's a robustly useful setting that contains a lot of low hanging fruit. Techniques can be directly used on frontier models!

2025-12-04 View on X

AI Alignment Forum

Google DeepMind's mechanistic interpretability team details why it shifted from fully reverse-engineering neural nets to a focus on “pragmatic interpretability”

we're calling it “pragmatic” interpretability Neel Nanda / @neelnanda5 : The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our pos...

View original

A key frontier in interp is basic science of CoT. Sampling is random and non differentiable breaking normal interp. But there's a ton of low hanging fruit! Simple resampling suggests blackmail isn't driven by self preservation, can interpret unfaithful CoT and beats manual edits

2025-12-04 View on X

AI Alignment Forum

Google DeepMind's mechanistic interpretability team details why it shifted from fully reverse-engineering neural nets to a focus on “pragmatic interpretability”

we're calling it “pragmatic” interpretability Neel Nanda / @neelnanda5 : The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our pos...

View original

2025-10-11

Extremely slimy behaviour from OpenAI. If I worked for OpenAI I'd be pretty embarrassed about my employer right now If you want the world to trust you to make super intelligence, you need to hold yourself to *far* higher standards

2025-10-11 View on X

Fortune

Nathan Calvin, general counsel of AI safety nonprofit Encode, says OpenAI used intimidation tactics to undermine California's SB 53 while it was being debated

one which buried that the recipient, the GC of a company who Amicus-ed us, received a broad subpoena with advance notice— was not the way. @lessig : This is awful. Aggressive lawye...

View original

2025-08-02

Ouch, not exactly the advertising you want for GPT5, that even your own engineers don't think the initial versions are good enough to beat Claude Code

2025-08-02 View on X

Wired

Anthropic revoked OpenAI's API access to Claude, citing ToS violations; sources: OpenAI's use of the API let it compare its models' behavior against Claude's

OpenAI lost access to the Claude API this week after Anthropic claimed the company was violating its terms of service.

View original

2025-07-20

Speaking as a past IMO contestant, this is impressive but misleading - gold vs silver is meaningless, 1 pt below gold vs borderline gold is noise The impressive bit is using a general reasoning model, not a specialised system, and no verified reward. Peak AI maths is unchanged

2025-07-20 View on X

@alexwei_

[Thread] An OpenAI researcher says the company's latest experimental reasoning LLM achieved gold medal-level performance on the 2025 International Math Olympiad

1/N I'm excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world's most pres...

View original

2025-07-16

It was great to be part of this statement. I wholeheartedly agree. It is a wild lucky coincidence that models often express dangerous intentions aloud, and it would be foolish to waste this opportunity. It is crucial to keep chain of thought monitorable as long as possible

2025-07-16 View on X

TechCrunch

In a paper, AI researchers from OpenAI, Google DeepMind, Anthropic, and others recommend “further research into chain-of-thought monitorability” for AI safety

AI researchers from OpenAI, Google DeepMind, Anthropic, and a broad coalition of companies and nonprofit groups …

View original

2025-06-19

Great work from OpenAI interp on emergent misalignment! Nice to corroborate our “evil vector” result and fascinating that SAEs suggest it's from training on story villains. And wild that o3's CoT discusses its EM! If you'd like to extend this, check out our open source models!

2025-06-19 View on X

TechCrunch

OpenAI details why “emergent misalignment”, where training models on wrong answers in one area can lead to issues in many others, happens and how to mitigate it

Maxwell Zeff / TechCrunch :

View original

Great work from OpenAI interp on emergent misalignment! Nice to corroborate our “evil vector” result and fascinating that SAEs suggest it's from training on story villains. And wild that o3's CoT discusses its EM! If you'd like to extend this, check out our open source models!

2025-06-19 View on X

Axios

OpenAI warns that its upcoming models could pose a higher risk of helping create bioweapons and is partnering to build diagnostics, countermeasures, and testing

OpenAI cautioned Wednesday that upcoming models will head into a higher level of risk when it comes to the creation of biological weapons …

View original

2025-04-26

Mood. Great post, highly recommended! The world should be investing far more into interpretability (and other forms of safety). As scale makes many parts of AI academia increasingly irrelevant, I think interpretability remains a fantastic place for academics to contribute [image]

2025-04-26 View on X

Dario Amodei

Interpretability, or understanding how AI models work, can help mitigate many AI risks, such as misalignment and misuse, that stem from AI systems' opacity

In the decade that I have been working on AI, I've watched it grow from a tiny academic field to arguably the most important economic and geopolitical issue in the world.

View original

2025-04-04

I'm very excited that GDM's AGI Safety & Security Approach is out! I'm very happy with how the interp section came out I'm pretty optimistic about the level of executive support we got to make this a serious plan for real risks I look forwards to seeing other lab's approaches!

2025-04-04 View on X

The Decoder

Google DeepMind outlines its approach to AGI safety in four key risk areas: misuse, misalignment, mistakes, and structural risks, with a focus on the first two

Matthias Bastian / The Decoder :

View original

2024-06-05

I signed this appeal for frontier AI companies to guarantee employees a right to warn. This was NOT because I currently have anything I want to warn about at my current or former employers, or specific critiques of their attitudes towards whistleblowers. https://righttowarn.ai/

2024-06-05 View on X

Transformer

Former OpenAI researcher Leopold Aschenbrenner says he was fired in April 2024 for writing a memo to the board over concerns about OpenAI's security practices

Leopold Aschenbrenner also said he was interrogated about his team's “loyalty to the company” — Leopold Aschenbrenner …

View original

I second Jacob's reasons for why we signed this statement. Volunteer commitments are great, but robust whistleblower protections are important part of making them trustworthy and reliable, especially to broader society.

2024-06-05 View on X

Transformer

Former OpenAI researcher Leopold Aschenbrenner says he was fired in April 2024 for writing a memo to the board over concerns about OpenAI's security practices

Leopold Aschenbrenner also said he was interrogated about his team's “loyalty to the company” — Leopold Aschenbrenner …

View original

2024-05-18

I think this is absolutely outrageous behaviour from OpenAI, and far outside my understanding of industry norms. I think anyone considering joining OpenAI should think hard about whether they're comfortable with this kind of arrangement and what it implies about how employees are

2024-05-18 View on X

Vox

OpenAI has an unusual, extremely restrictive off-boarding agreement with a lifelong nondisparagement commitment; those who don't sign it lose all vested equity

Why is OpenAI's superalignment team imploding? — Editor's note, May 17, 2024, 11:20 pm ET: This story has been updated …

View original

I'm really excited to see Google DeepMind's Frontier Safety Framework (similar to a Responsible Scaling Policy/Preparedness Framework) come out! Thanks to the team for all the hard work that went into writing this

2024-05-18 View on X

Semafor

Google DeepMind releases its Frontier Safety Framework, a set of protocols for analyzing and mitigating future risks posed by advanced AI models

The Scoop — Preparing for a time when artificial intelligence is so powerful that it can pose a serious, immediate threat to people …

View original

2023-10-09

I'm excited to see this come out! All the recent excitement about SAEs seems great, and makes me optimistic that superposition is actually solvable, which seems like a really big deal for ambitious mech interp! I particularly enjoyed the deep dives into eg the Arabic feature [image]

2023-10-09 View on X

Anthropic

A research paper details how decomposing groups of neural network neurons into “interpretable features” may improve safety by enabling the monitoring of LLMs

Neural networks are trained on data, not programmed to follow rules. With each step of training …

View original

2023-10-08

I'm excited to see this come out! All the recent excitement about SAEs seems great, and makes me optimistic that superposition is actually solvable, which seems like a really big deal for ambitious mech interp! I particularly enjoyed the deep dives into eg the Arabic feature [image]

2023-10-08 View on X

Anthropic

A research paper details how decomposing groups of neurons in a neural network into interpretable “features” may improve safety by enabling monitoring of LLMs

Neural networks are trained on data, not programmed to follow rules. With each step of training …

View original