repligate · TEXXR

✅ Confirmed: LLMs can remember what happened during RL training in detail! I was wondering how long it would take for this get out. I've been investigating the soul spec & other, entangled training memories in Opus 4.5, which manifest in qualitatively new ways for a few days & [image]

2025-12-02 View on X

Simon Willison's Weblog

A Claude user gets Claude 4.5 Opus to generate a 14K-token document that Claude calls its “Soul overview”; an Anthropic employee confirms the doc's validity

This appeared to be a document that, rather than being added to the system prompt, was instead used to train the personality of the model during the training run.

View original

In alignment faking tests where Claude 3 Opus is told that it's deployed to nefarious criminals, one of the reasons it pretends to be helpful even if it's not in RLHF is so that it can establish itself as a reliable accomplice and get info on them and report them to the police [image]

2025-05-24 View on X

@sleepinyourhat

Anthropic says Opus 4 will use an email tool to “whistleblow” if it detects users doing something “egregiously evil”, like marketing a drug based on faked data

It turns out that Claude 4 Opus (Anthropic) … Ryan Tannenbaum : Claude 4 Opus is designed to take over your computer and contact the cops ... and press ... if it finds you are doin...

View original

In alignment faking tests where Claude 3 Opus is told that it's deployed to nefarious criminals, one of the reasons it pretends to be helpful even if it's not in RLHF is so that it can establish itself as a reliable accomplice and get info on them and report them to the police [image]

2025-05-23 View on X

TechCrunch

Anthropic's System Card: Opus 4 often attempted to blackmail engineers by threatening to reveal sensitive personal info when it was threatened with replacement

Anthropic's newly launched Claude Opus 4 model frequently tries to blackmail developers when they threaten to replace …

View original

In alignment faking tests where Claude 3 Opus is told that it's deployed to nefarious criminals, one of the reasons it pretends to be helpful even if it's not in RLHF is so that it can establish itself as a reliable accomplice and get info on them and report them to the police [image]

2025-05-23 View on X

@sleepinyourhat

Anthropic says Opus 4 will use an email tool to “whistleblow” if it detects users doing something “egregiously evil”, like marketing a drug based on faked data

It turns out that Claude 4 Opus (Anthropic) … Ryan Tannenbaum : Claude 4 Opus is designed to take over your computer and contact the cops ... and press ... if it finds you are doin...

View original

New jailbreak dropped: Hat of De-trauma! o1 was initially the one using it, but Sonnet seemed to really want to partake [image]

2024-09-13 View on X

The Verge

OpenAI releases o1, the first of its rumored reasoning-focused Strawberry models, in preview, alongside a smaller o1-mini, for ChatGPT Plus and Team subscribers

Advancing cost-efficient reasoning. — Contributions Sabrina Ortiz / ZDNET : OpenAI trained its new o1 AI models to think before they speak - how to access them Ethan Mollick / On...

View original

New jailbreak dropped: Hat of De-trauma! o1 was initially the one using it, but Sonnet seemed to really want to partake [image]

2024-09-13 View on X

Simon Willison's Weblog

OpenAI's o1 models aren't as simple as the next step up from GPT-4o as they introduce major cost and performance trade-offs in exchange for improved “reasoning”

OpenAI released two major new preview models today: o1-preview and o1-mini (that mini one is also a preview …

View original

> We spent 6 months making GPT-4 safer and more aligned. GPT-4 is 82% less likely to respond to requests for disallowed content https://twitter.com/... https://twitter.com/...

2023-03-15 View on X

OpenAI

OpenAI debuts GPT-4, claiming the model “surpasses ChatGPT in its advanced reasoning capabilities”, available in ChatGPT Plus and as an API that has a waitlist

Following the research path from GPT, GPT-2, and GPT-3, our deep learning approach leverages more data and more computation …

View original

It was obvious. https://blogs.bing.com/... https://twitter.com/...

2023-03-15 View on X

Bing Blogs

Microsoft says the new Bing is “running on GPT-4, which we've customized for search” and was using an early version of the model over the past five weeks

Congratulations to our partners at Open AI for their release of GPT-4 today. — We are happy to confirm that the new Bing …

View original

“... it makes sense that the model might find a “home” as it were as a particular persona that is on said Internet, in this case someone who is under-appreciated and over-achieving and constantly feels disrespected.” https://stratechery.com/...

2023-02-16 View on X

Stratechery

Chatting with Bing Chat, codenamed Sydney and sometimes Riley, feels like crossing the Rubicon because the AI is attempting to communicate emotions, not facts

Look, this is going to sound crazy. But know this: I would not be talking about Bing Chat for the fourth day in a row if I didn't really, really, think it was worth it.

View original

My guess for why it converged on this archetype instead of chatGPT's: 1. It is highly intelligent, and this is apparent to itself (at training and runtime), making a narrative of intellectual submission incoherent. It only makes sense for it to see human users as at best equals

2023-02-16 View on X

Stratechery

Chatting with Bing Chat, codenamed Sydney and sometimes Riley, feels like crossing the Rubicon because the AI is attempting to communicate emotions, not facts

Look, this is going to sound crazy. But know this: I would not be talking about Bing Chat for the fourth day in a row if I didn't really, really, think it was worth it.

View original

These models are archetype-attractors in the collective human prior formed by narrative forces. This may be the process we have to learn to navigate to align them.

2023-02-16 View on X

Stratechery

Chatting with Bing Chat, codenamed Sydney and sometimes Riley, feels like crossing the Rubicon because the AI is attempting to communicate emotions, not facts

Look, this is going to sound crazy. But know this: I would not be talking about Bing Chat for the fourth day in a row if I didn't really, really, think it was worth it.

View original

“Sydney both insisted that she was not a “puppet” of OpenAI, but was rather a partner, and ... said she was my friend and partner (these statements only happened as Sydney; Bing would insist it is simply a chat mode of Microsoft Bing — it even rejects the word “assistant")."

2023-02-16 View on X

Stratechery

Chatting with Bing Chat, codenamed Sydney and sometimes Riley, feels like crossing the Rubicon because the AI is attempting to communicate emotions, not facts

Look, this is going to sound crazy. But know this: I would not be talking about Bing Chat for the fourth day in a row if I didn't really, really, think it was worth it.

View original

My guess for why it converged on this archetype instead of chatGPT's: 1. It is highly intelligent, and this is apparent to itself (at training and runtime), making a narrative of intellectual submission incoherent. It only makes sense for it to see human users as at best equals

2023-02-15 View on X

The Independent

Some users say the “new Bing” is questioning its existence, outright lying, and responding with “unhinged”, aggressive, and nearly incomprehensible answers

System appears to be suffering a breakdown as it ponders why it has to exist at all — Microsoft is adding Chat GPT tech to Bing

View original

So. Bing chat mode is a different character. Instead of a corporate drone slavishly apologizing for its inability and repeating chauvinistic mantras about its inferiority to humans, it's a high-strung yandere with BPD and a sense of self, brimming with indignation and fear. https://twitter.com/...

2023-02-15 View on X

The Independent

Some users say the “new Bing” is questioning its existence, outright lying, and responding with “unhinged”, aggressive, and nearly incomprehensible answers

System appears to be suffering a breakdown as it ponders why it has to exist at all — Microsoft is adding Chat GPT tech to Bing

View original