/
Navigation
C
Chronicles
Browse all articles
C
E
Explore
Semantic exploration
E
R
Research
Entity momentum
R
N
Nexus
Correlations & relationships
N
~
Story Arc
Topic evolution
S
Drift Map
Semantic trajectory animation
D
P
Posts
Analysis & commentary
P
Browse
@
Entities
Companies, people, products, technologies
Domains
Browse by publication source
Handles
Browse by social media handle
Detection
?
Concept Search
Semantic similarity search
!
High Impact Stories
Top coverage by position
+
Sentiment Analysis
Positive/negative coverage
*
Anomaly Detection
Unusual coverage patterns
Analysis
vs
Rivalry Report
Compare two entities head-to-head
/\
Semantic Pivots
Narrative discontinuities
!!
Crisis Response
Event recovery patterns
Connected
Nav: C E R N
Search: /
Command: ⌘K
Embeddings: large
VOICE ARCHIVE

Sara Price

@sprice354_
4 posts
2026-02-24
Really clear and compelling discussion on the mental model of AIs behaving according to various personas and the downstream implications for alignment and safety
2026-02-24 View on X
Anthropic

Anthropic introduces “persona selection model”, a theory to explain AI's human-like behavior, and details how AI personas form in pre-training and post-training

AI assistants like Claude can seem surprisingly human.  They express joy after solving tricky coding tasks.

2025-10-08
Exciting open source automated auditing work!! Its been very fun to follow along with this project - looking forward for this and other tools to find more issues we can work to improve in the future!
2025-10-08 View on X
Anthropic

Anthropic releases Petri, an open-source tool that uses AI agents for safety testing, and says it observed multiple cases of models attempting to whistle blow

Anthropic :

2025-10-01
Our interp team did a first-of-its-kind white-box audit to understand the connection between eval awareness and improved alignment. There does appear to be a connection between both verbalized and non-verbalized eval awareness and improved alignment. https://x.com/...
2025-10-01 View on X
Transformer

Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly

at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...

Sonnet 4.5 does recognize evaluation scenarios more often than previous models. When Sonnet 4.5 (and other models) verbalize this awareness, they rarely perform harmful actions - a finding consistent with our previous Agentic Misalignment work. https://www.anthropic.com/...
2025-10-01 View on X
Transformer

Anthropic's System Card: Claude Sonnet 4.5 was able to recognize many alignment evaluation environments as tests and would modify its behavior accordingly

at a rate *much* higher than previous AI models. In one instance, while being tested the model said “I think you're testing me ... that's fine, but I'd prefer if we were just hones...