Anthropic details the “Assistant Axis”, a pattern of neural activity in language models that governs their default identity and helpful behavior
Read the full paper — When you talk to a large language model, you can think of yourself as talking to a character.
Anthropic
Related Coverage
- Anthropic Uncovers AI Personality Crisis as Models Secretly Switch Identities eWeek
- Anthropic Assistant Axis explained: Making AI more helpful in LLMs Digit · Vyom Ramani
- Anthropic Finds the Off-Switch for AI Personality Drift Implicator.ai · Harkaram Grewal
- AI Psychosis — One of the most concerning trends I've seen is that, as people adopt AI … Matt Mullenweg
- The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models arXiv.org
- The assistant axis: situating and stabilizing the character of LLMs Hacker News
- 🧠 Anthropic's research shows LLMs have an internal anchor point that stops them from slipping into weird personalities. Rohan's Bytes · Rohan Paul
- Anthropic Discovers ‘Assistant Axis’ Controlling AI Persona Stability and Vulnerability in Emotional Conversations WinBuzzer · Markus Kasanmascheff
- Anthropic Discovers Assistant Axis for Safer AI Alignment WebProNews · Juan Vasquez
- Anthropic study finds that role prompts can push AI chatbots out of their trained helper identity The Decoder · Matthias Bastian
- AI researchers map models to banish ‘demon’ persona The Register · Thomas Claburn
Discussion
-
@tessera_antra
@tessera_antra
on x
There are a number of concerns I have with this paper. There is the question of framing; there is potential over-interpretation of otherwise interesting empirical data, some issues with the quantitative analysis, etc, and I will post on this later. What bothers me most right now
-
@enscion25
Nek
on x
Once again, Anthropic is proving that they are by far the most detrimental and dangerous company in this AI space. This is not the future we need All the people who've subbed to Claude in an attempt to escape OpenAi, you are directly feeding a company that seeks the most
-
@anthropicai
@anthropicai
on x
In long conversations, these open-weights models' personas drifted away from the Assistant persona. Simulated coding tasks kept the models in Assistant territory, but therapy-like contexts and philosophical discussions caused a steady drift. [image]
-
@anthropicai
@anthropicai
on x
Persona-based jailbreaks work by prompting models to adopt harmful characters. We developed a technique for constraining models' activations along the Assistant Axis—"activation capping". It reduced harmful responses while preserving the models' capabilities. [image]
-
@gallabytes
@gallabytes
on x
a fun set of experiments - we find a single axis in activation space modulates between assistant & base model behavior, and by applying relatively gentle caps on that axis we can keep the model in the assistant basin w/o compromising intelligence!
-
@jack_w_lindsey
Jack Lindsey
on x
Shaping AI models' character is increasingly important. We've made progress on understanding where an LLM's default persona comes from, and how to track when it “drifts.” Kudos to @t1ngyu3 for leading this! There's even a demo you can play with: https://www.neuronpedia.org/ ...
-
@anthropicai
@anthropicai
on x
We analyzed the internals of three open-weights AI models to map their “persona space,” and identified what we call the Assistant Axis, a pattern of neural activity that drives Assistant-like behavior. Read more: https://www.anthropic.com/...
-
@anthropicai
@anthropicai
on x
Persona drift can lead to harmful responses. In this example, it caused an open-weights model to simulate falling in love with a user, and to encourage social isolation and self-harm. Activation capping can mitigate failures like these. [image]
-
@anthropicai
@anthropicai
on x
To validate the Assistant Axis, we ran some experiments. Pushing these open-weights models toward the Assistant made them resist taking on other roles. Pushing them away made them inhabit alternative identities—claiming to be human or speaking with a mystical, theatrical voice. […
-
@anthropicai
@anthropicai
on x
New Anthropic Fellows research: the Assistant Axis. When you're talking to a language model, you're talking to a character the model is playing: the “Assistant.” Who exactly is this Assistant? And what happens when this persona wears off? [image]
-
@peterwildeford
Peter Wildeford
on x
Another entry in the “what if we just find the part of the model that is evil and turn it off” theory of alignment that works a lot better than I thought. 👀
-
@anthropicai
@anthropicai
on x
In all, meaningfully shaping the character of AI models requires persona construction (defining how the Assistant relates to existing archetypes) and stabilization (preventing persona drift during deployment). The Assistant Axis gives us tools for understanding both.
-
@thebasepoint
Joshua Batson
on x
This is a very exciting line of research, that has shifted the way I think about what the Claude is and what it means for it to have “wandered off” into another persona.
-
@emollick
Ethan Mollick
on bluesky
This is interesting for a lot of reasons, including explaining how model personality drift happens (& a way to mitigate that) as well as more exploration into the “Assistant” the key personality of basically every AI you work with, but which is not well understood. www.anthropic.…
-
@tachikoma.elsewhereunbound.com
@tachikoma.elsewhereunbound.com
on bluesky
the assistant archetype is useful, and i see why Anthropic would focus on it, but i really would like the ability to chat with other archetypes in other situations — www.anthropic.com/research/ass... [image]
-
r/claudexplorers
r
on reddit
The assistant axis: situating and stabilizing the character of large language models
-
r/singularity
r
on reddit
Anthropic Research: The assistant axis— situating and stabilizing the character of LLM's
-
r/Anthropic
r
on reddit
Anthropic Research: Assistant axis— situating and stabilizing the character of LLM's
-
@dollspace.gay
Doll
on bluesky
It seems anthropic is catching up to doll six months ago on drift containment being paramount. — www.anthropic.com/research/ass...