Anthropic details the “Assistant Axis”, a pattern of neural activity in language models that governs their default identity and helpful behavior

Read the full paper — When you talk to a large language model, you can think of yourself as talking to a character.

Anthropic 2026-01-20

Discussion

@tessera_antra @tessera_antra on x
There are a number of concerns I have with this paper. There is the question of framing; there is potential over-interpretation of otherwise interesting empirical data, some issues with the quantitative analysis, etc, and I will post on this later. What bothers me most right now
@enscion25 Nek on x
Once again, Anthropic is proving that they are by far the most detrimental and dangerous company in this AI space. This is not the future we need All the people who've subbed to Claude in an attempt to escape OpenAi, you are directly feeding a company that seeks the most
@anthropicai @anthropicai on x
In long conversations, these open-weights models' personas drifted away from the Assistant persona. Simulated coding tasks kept the models in Assistant territory, but therapy-like contexts and philosophical discussions caused a steady drift. [image]
@anthropicai @anthropicai on x
Persona-based jailbreaks work by prompting models to adopt harmful characters. We developed a technique for constraining models' activations along the Assistant Axis—"activation capping". It reduced harmful responses while preserving the models' capabilities. [image]
@gallabytes @gallabytes on x
a fun set of experiments - we find a single axis in activation space modulates between assistant & base model behavior, and by applying relatively gentle caps on that axis we can keep the model in the assistant basin w/o compromising intelligence!
@jack_w_lindsey Jack Lindsey on x
Shaping AI models' character is increasingly important. We've made progress on understanding where an LLM's default persona comes from, and how to track when it “drifts.” Kudos to @t1ngyu3 for leading this! There's even a demo you can play with: https://www.neuronpedia.org/ ...
@anthropicai @anthropicai on x
We analyzed the internals of three open-weights AI models to map their “persona space,” and identified what we call the Assistant Axis, a pattern of neural activity that drives Assistant-like behavior. Read more: https://www.anthropic.com/...
@anthropicai @anthropicai on x
Persona drift can lead to harmful responses. In this example, it caused an open-weights model to simulate falling in love with a user, and to encourage social isolation and self-harm. Activation capping can mitigate failures like these. [image]
@anthropicai @anthropicai on x
To validate the Assistant Axis, we ran some experiments. Pushing these open-weights models toward the Assistant made them resist taking on other roles. Pushing them away made them inhabit alternative identities—claiming to be human or speaking with a mystical, theatrical voice. […
@anthropicai @anthropicai on x
New Anthropic Fellows research: the Assistant Axis. When you're talking to a language model, you're talking to a character the model is playing: the “Assistant.” Who exactly is this Assistant? And what happens when this persona wears off? [image]
@peterwildeford Peter Wildeford on x
Another entry in the “what if we just find the part of the model that is evil and turn it off” theory of alignment that works a lot better than I thought. 👀
@anthropicai @anthropicai on x
In all, meaningfully shaping the character of AI models requires persona construction (defining how the Assistant relates to existing archetypes) and stabilization (preventing persona drift during deployment). The Assistant Axis gives us tools for understanding both.
@thebasepoint Joshua Batson on x
This is a very exciting line of research, that has shifted the way I think about what the Claude is and what it means for it to have “wandered off” into another persona.
@emollick Ethan Mollick on bluesky
This is interesting for a lot of reasons, including explaining how model personality drift happens (& a way to mitigate that) as well as more exploration into the “Assistant” the key personality of basically every AI you work with, but which is not well understood. www.anthropic.…
@tachikoma.elsewhereunbound.com @tachikoma.elsewhereunbound.com on bluesky
the assistant archetype is useful, and i see why Anthropic would focus on it, but i really would like the ability to chat with other archetypes in other situations — www.anthropic.com/research/ass... [image]
r/claudexplorers r on reddit
The assistant axis: situating and stabilizing the character of large language models
r/singularity r on reddit
Anthropic Research: The assistant axis— situating and stabilizing the character of LLM's
r/Anthropic r on reddit
Anthropic Research: Assistant axis— situating and stabilizing the character of LLM's
@dollspace.gay Doll on bluesky
It seems anthropic is catching up to doll six months ago on drift containment being paramount. — www.anthropic.com/research/ass...

Chronicles

Anthropic details the “Assistant Axis”, a pattern of neural activity in language models that governs their default identity and helpful behavior

Related Coverage

Discussion