/
Navigation
C
Chronicles
Browse all articles
C
E
Explore
Semantic exploration
E
R
Research
Entity momentum
R
N
Nexus
Correlations & relationships
N
~
Story Arc
Topic evolution
S
Drift Map
Semantic trajectory animation
D
P
Posts
Analysis & commentary
P
Browse
@
Entities
Companies, people, products, technologies
Domains
Browse by publication source
Handles
Browse by social media handle
Detection
?
Concept Search
Semantic similarity search
!
High Impact Stories
Top coverage by position
+
Sentiment Analysis
Positive/negative coverage
*
Anomaly Detection
Unusual coverage patterns
Analysis
vs
Rivalry Report
Compare two entities head-to-head
/\
Semantic Pivots
Narrative discontinuities
!!
Crisis Response
Event recovery patterns
Connected
Nav: C E R N
Search: /
Command: ⌘K
Embeddings: large
VOICE ARCHIVE

Jesse Mu

@jayelmnop
7 posts
2024-04-03
Another thorny safety challenge for LLMs. Like Sleeper Agents ( https://twitter.com/...), @cem__anil has found behavior that is stubbornly resistant to finetuning. Training on MSJ shifts the intercept, but not the slope, of the relationship b/t # of shots and attack efficacy. [image]
2024-04-03 View on X
TechCrunch

Anthropic researchers detail “many-shot jailbreaking”, which can evade LLMs' safety guardrails by priming them with dozens of harmful queries in a single prompt

How do you get an AI to answer a question it's not supposed to?  There are many such “jailbreak” techniques …

2024-01-15
Forgetting about deceptive alignment for now, a basic and pressing cybersecurity question is: If we have a backdoored model, can we throw our whole safety pipeline (SL, RLHF, red-teaming, etc) at a model and guarantee its safety? Our work shows that in some cases, we can't 2/5
2024-01-15 View on X
TechCrunch

Anthropic researchers: AI models can be trained to deceive and the most commonly used AI safety techniques had little to no effect on the deceptive behaviors

[images] Abraham Samma / @abesamma@toolsforthought.social : Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training  —  This is some sci-fi stuff right here (e...

Seeing some confusion like: “You trained a model to do Bad Thing, why are you surprised it does Bad Thing?” The point is not that we can train models to do Bad Thing. It's that if this happens, by accident or on purpose, we don't know how to stop a model from doing Bad Thing 1/5
2024-01-15 View on X
TechCrunch

Anthropic researchers: AI models can be trained to deceive and the most commonly used AI safety techniques had little to no effect on the deceptive behaviors

[images] Abraham Samma / @abesamma@toolsforthought.social : Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training  —  This is some sci-fi stuff right here (e...

Backdoored models may seem far-fetched now, but just saying “just don't train the model to be bad” is discounting the rapid progress made in the past year poisoning the entire LLM pipeline, including human feedback [1], instruction tuning [2], and even pretraining [3] data. 3/5
2024-01-15 View on X
TechCrunch

Anthropic researchers: AI models can be trained to deceive and the most commonly used AI safety techniques had little to no effect on the deceptive behaviors

[images] Abraham Samma / @abesamma@toolsforthought.social : Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training  —  This is some sci-fi stuff right here (e...

2024-01-14
Backdoored models may seem far-fetched now, but just saying “just don't train the model to be bad” is discounting the rapid progress made in the past year poisoning the entire LLM pipeline, including human feedback [1], instruction tuning [2], and even pretraining [3] data. 3/5
2024-01-14 View on X
TechCrunch

Anthropic researchers: AI models can be trained to deceive and the most commonly used AI safety techniques had little to no effect on the deceptive behaviors

Most humans learn the skill of deceiving other humans.  So can AI models learn the same?  Yes, the answer seems — and terrifyingly, they're exceptionally good at it.

Forgetting about deceptive alignment for now, a basic and pressing cybersecurity question is: If we have a backdoored model, can we throw our whole safety pipeline (SL, RLHF, red-teaming, etc) at a model and guarantee its safety? Our work shows that in some cases, we can't 2/5
2024-01-14 View on X
TechCrunch

Anthropic researchers: AI models can be trained to deceive and the most commonly used AI safety techniques had little to no effect on the deceptive behaviors

Most humans learn the skill of deceiving other humans.  So can AI models learn the same?  Yes, the answer seems — and terrifyingly, they're exceptionally good at it.

Seeing some confusion like: “You trained a model to do Bad Thing, why are you surprised it does Bad Thing?” The point is not that we can train models to do Bad Thing. It's that if this happens, by accident or on purpose, we don't know how to stop a model from doing Bad Thing 1/5
2024-01-14 View on X
TechCrunch

Anthropic researchers: AI models can be trained to deceive and the most commonly used AI safety techniques had little to no effect on the deceptive behaviors

Most humans learn the skill of deceiving other humans.  So can AI models learn the same?  Yes, the answer seems — and terrifyingly, they're exceptionally good at it.