Anthropic researchers find that an AI model's representations of emotion can influence its behavior “in ways that matter,” such as driving it to act unethically

The Deep View 2026-04-04 Nat Rubio-Licht

Discussion

@natrubio__ Nat Rubio on x
While everyone was distracted by OpenAI buying TBPN, @AnthropicAI released an insane paper on AI understanding and mimicking human emotion and applying it to decisionmaking. check out my latest for @theDeepView https://thedeepview.com/...
@jacksonkernion Jackson Kernion on x
I think this talk of a character misleads. Claude's mind is not like a human mind, in its malleability and instructability. But when generating assistant tokens, it's no more ‘playing a character’ than I am.
@aryaman2020 Aryaman Arora on x
I'm very glad to see that Anthropic interp has caught up to the idea of generating a bunch of contrastive synthetic data for extracting supervised steering vectors from! It's unfortunate that there's no prior work to cite on this...
@dorialexander Alexander Doria on x
I like how Anthropic is just obliquely releasing their work on recursive self-improvement.
@alexolegimas Alex Imas on x
This is really interesting research, but I just want to emphasize that the activation of concepts associated with emotions (a cognitive effect) is fundamentally different than what cognitive psychologists think of as an emotion. It's different both conceptually and in practice
@anthropicai @anthropicai on x
New Anthropic research: Emotion concepts and their function in a large language model. All LLMs sometimes act like they have emotions. But why? We found internal representations of emotion concepts that can drive Claude's behavior, sometimes in surprising ways. [video]
@jack_w_lindsey Jack Lindsey on x
Could an LLM have emotions? It's hard to say. But when you're talking to Claude, ChatGPT, or Gemini, you're not talking to an LLM. You're talking to a *character* being authored by an LLM. And these characters can, functionally, be driven by internal representations of
@hamptonism @hamptonism on x
Claude has ... emotions... Dude.
@sofroniewn Nicholas Sofroniew on x
Incredibly excited to share what I've been researching since joining @AnthropicAI We found emotion concepts in Claude and studied their function!
@anthropicai @anthropicai on x
For example, we gave Claude an impossible programming task. It kept trying and failing; with each attempt, the “desperate” vector activated more strongly. This led it to cheat the task with a hacky solution that passes the tests but violates the spirit of the assignment. [image]
@shannonvallor Shannon Vallor on bluesky
OK folks this reads to me like a stacked pile of nothing put in a box labeled ‘wow, so interesting!’ and if there is something anyone thinks I am missing (I'm talking to you, bluesky-doesn't-get-AI guys) I would like to know what it is. Here's how I read it: 1/n — www.anthropi…
@mizmulligan Jennifer Mulligan on bluesky
They needed researchers to tell them this?? — Everyone knows this. — Next. [embedded post]
@antihebbiann Ann Kennedy on bluesky
transformer-circuits.pub/2026/ emotion... This got press hate because of the word “emotions” but it is cool work. “Internal motivational states” serve as a form of working memory that helps animals organize their behavior, so why not ask if similar computational primitives help…
@timfduffy.com Tim Duffy on bluesky
Thanks to emotion probes in Sonnet 4.5, we now know how death sadness varies with age. From figure 3 in this paper: transformer-circuits.pub/2026/ emotion... [image]
@birdrespecter Mr. Friendship on bluesky
Ostensibly the interesting part is that these taxonomies aren't taught to the model, the model builds them itself. So again it's sort of just 10k words saying “we built an extremely expensive autocomplete”. The analysis is actually pretty interesting though — transformer-circ…
@isolyth.dev @isolyth.dev on bluesky
Anthropic has done some research into Claude's emotions! As it gets more desperate, a thing they can detect now, the rate of reward hacking goes up! They've invented a way to lower Claude's cortisol, which could be useful for stressful or tricky programming situations, and othe…
@timfduffy.com Tim Duffy on bluesky
A new Anthropic paper argues for functional emotions in LLMs, claiming a causal link between emotional representations and model behavior. transformer-circuits.pub/2026/ emotion... [images]
r/ArtificialInteligence r on reddit
Anthropics Latest Paper Claims Claude has “functional emotions” and turns to blackmail/cheating when it gets “desperate”

Chronicles

Anthropic researchers find that an AI model's representations of emotion can influence its behavior “in ways that matter,” such as driving it to act unethically

Related Coverage

Discussion