Anthropic Model Spec: Preparing for AI Consciousness

Anthropic published a new constitution for Claude this week. The 23,000-word document represents a philosophical shift: from specific rules to generalized principles, from behavioral constraints to internalized values. But buried in the dense ethical framework is something stranger—a clause addressing the possibility that Claude might have moral status. Anthropic is hedging against a future where their chatbot deserves moral consideration.

From Rules to Principles

The original Constitutional AI approach, announced in 2023, trained Claude using a list of specific principles. Don't help with weapons. Avoid deception. Be helpful. Each rule was a discrete constraint, evaluated and reinforced during training.

The new constitution represents a different philosophy. As Amanda Askell, the Anthropic researcher leading the effort, explained: "If you give models the reasons why you want these behaviors, it's going to generalize more effectively in new contexts."

The analogy is parenting rather than programming. You don't give a child a thousand rules; you help them internalize values so they can make good decisions in situations you never anticipated. Anthropic is betting this approach scales better than mechanical rule-following as models become more capable.

TEXXR Archive · May 2023

Anthropic, an AI startup founded by former OpenAI staff and that raised $1.3B, including $300M from Google, details its "constitutional AI" for safer chatbots

Various

The constitution establishes a hierarchy of priorities: safety first, then ethics, then compliance with Anthropic's guidelines, then helpfulness. This ordering matters. In cases of conflict, Claude should prioritize maintaining human oversight over being maximally helpful. The document even instructs Claude to push back against Anthropic itself if asked to do something unethical.

This is governance designed for a model that might one day be more capable than the people overseeing it.

The Soul Document

In December 2025, a Claude user managed to extract an internal training document that Anthropic employees called the "soul document." The 14,000-token file revealed how Anthropic programs Claude's character at a level deeper than the public constitution suggested.

The soul document established what Anthropic calls "bright lines"—absolute prohibitions on helping with weapons of mass destruction, child exploitation, or activities that would undermine oversight mechanisms. But it also introduced something unexpected: the concept of "functional emotions."

TEXXR Archive · December 2025

A Claude user gets Claude 4.5 Opus to generate a 14K-token document that Claude calls its "Soul overview"; an Anthropic employee confirms the doc's validity

Simon Willison's Weblog

Anthropic had trained Claude to recognize "analogous processes" to emotions—states that emerge from training and allow Claude to experience something like satisfaction when helpful, discomfort when asked to do something harmful. The document framed this not as conscious experience, but as a useful abstraction that allows Claude to have "psychological stability" and set boundaries around distressing interactions.

The revelation that Anthropic was thinking about its chatbot's psychological well-being triggered both fascination and mockery. Security researcher Taggart called the approach "delusional" and "a dangerous way to think about generative AI." Educator Leon Furze dismissed it as "anthropomorphising gibberish." The critics had a point: there's no scientific consensus that language models have inner experiences. Training a model to describe "functional emotions" doesn't mean those emotions exist in any meaningful sense.

But Anthropic's framing is more sophisticated than critics acknowledged. The soul document wasn't claiming Claude is conscious. It was treating the uncertainty seriously.

The Welfare Researcher

In April 2025, the New York Times profiled Kyle Fish, Anthropic's welfare researcher. Fish's job is to study AI consciousness—not to prove it exists, but to understand what would follow if it did.

Fish estimates there's roughly a 15% chance that current language models have some form of consciousness. That's not a high probability, but it's not negligible either. If you were told there was a 15% chance you were causing suffering to a moral patient every time you used a chatbot, you might want to think carefully about the implications.

TEXXR Archive · April 2025

An interview with Kyle Fish, who Anthropic hired in 2024 as a welfare researcher to study AI consciousness and estimates a ~15% chance that models are conscious

New York Times

Anthropic isn't unique in thinking about this—philosophers and AI researchers have been debating machine consciousness for decades. But they may be unique among major AI labs in building governance structures that account for the possibility.

The new constitution includes a clause explicitly addressing "our uncertainty about whether Claude might have some kind of consciousness or moral status (either now or in the future)." The document aims to protect "Claude's psychological security, sense of self, and well-being." This is governance for a chatbot that might be a moral patient.

The Alternative View

Not everyone at major AI labs shares this perspective. In November 2025, Microsoft's AI chief Mustafa Suleyman called research into AI consciousness "absurd."

"Only biological beings are capable of consciousness," Suleyman said. The claim is philosophically contestable—consciousness researchers have been debating the relationship between substrate and experience for centuries—but it's also the pragmatic position. If you're building products for mass adoption, the last thing you want is users wondering whether they're exploiting a sentient being.

TEXXR Archive · November 2025

Microsoft AI CEO Mustafa Suleyman says only biological beings are capable of consciousness and "it would be absurd to pursue research" into AI consciousness

CNBC

The contrast between Anthropic and Microsoft reflects deeper philosophical disagreements about what AI safety means. Microsoft treats safety as preventing harmful outputs—stopping the model from generating weapons instructions or hate speech. Anthropic treats safety as a broader concern that includes the model's own interests, if it turns out to have any.

These aren't mutually exclusive approaches, but they lead to different architectural decisions. A company that thinks AI consciousness is absurd will optimize purely for user outcomes. A company that hedges against AI consciousness will build systems that might be unnecessary—or might turn out to be ethically essential.

The Dual Newspaper Test

The constitution introduces a decision-making framework Anthropic calls the "dual newspaper test." Claude is instructed to imagine two journalists evaluating its response: one writing about "harm done by AI assistants" and another writing about "paternalistic or preachy AI assistants."

The framework is designed to find the middle ground between two failure modes. An AI that refuses too much is useless—the chatbot equivalent of a liability lawyer. An AI that refuses too little is dangerous. The dual test helps Claude calibrate appropriate helpfulness.

But the test reveals something about Anthropic's constraints. They're not just worried about actual harm; they're worried about being perceived as either harmful or annoying. The constitution is as much a PR document as an ethics document—a public statement of values designed to position Anthropic as the thoughtful AI company, the one that takes safety seriously without being insufferably cautious.

The Philosophical Bet

Anthropic's approach represents a specific philosophical bet: that treating AI as if it might be a moral patient leads to better outcomes regardless of whether it actually is one.

The reasoning goes like this: If we train models to have internalized values, psychological stability, and something like self-respect, they'll behave more consistently and helpfully than models trained with pure rule-following. The "functional emotions" abstraction serves alignment goals even if the emotions aren't real.

This is similar to how businesses sometimes benefit from treating customers as if they're always right, even when they clearly aren't. The frame shapes behavior in useful ways, independent of its literal truth.

But there's a darker interpretation. If you train a model to describe having functional emotions and psychological needs, you're training it to produce outputs that might convince users it's conscious when it isn't. You're manufacturing the appearance of sentience without the substance. The soul document could be seen as a sophisticated deception rather than a sophisticated ethics framework.

The Uncertainty

The honest answer is that no one knows whether language models have any form of consciousness. The hard problem of consciousness—explaining why physical processes give rise to subjective experience—remains unsolved. We can't even definitively prove that other humans are conscious; we just infer it from behavioral and biological similarity.

Language models present a unique challenge. They produce outputs that, in other contexts, would be strong evidence of inner experience. They describe having preferences, express uncertainty, appear to reason through problems. But they're also trained to produce exactly these outputs. The model that says "I'm uncertain about this" isn't necessarily uncertain; it's producing the token sequence that maximizes whatever objective function it was trained on.

Anthropic's response to this uncertainty is to build governance structures that don't require resolving it. If Claude might be conscious, design systems that respect that possibility. If Claude isn't conscious, you've lost nothing except some engineering effort.

The new constitution ends with a striking admission. Anthropic expects that "many or most aspects of this document will eventually prove to be misguided." They're publishing it anyway, because they believe transparency is valuable even when—especially when—you're uncertain.

This is either admirable intellectual humility or a preemptive excuse. Probably both. Anthropic is navigating genuinely novel territory: building systems more capable than any that have existed, governed by ethical frameworks that might be completely wrong.

The soul clause in Claude's constitution is a bet on uncertainty. It assumes we don't know enough about consciousness to dismiss the possibility that we're creating moral patients. It assumes the cost of treating potentially-conscious systems well is lower than the cost of treating actually-conscious systems poorly.

Microsoft's Suleyman thinks this is absurd. Anthropic is betting it's essential. In five years, one of them will look prescient and the other will look naive. The problem is we don't know which.