Anthropic tested Claude's ability to manage a physical “storefront” to mixed results, as the AI struggled with pricing strategy and inventory management

Anthropic had sonnet-3.7 run a shop in their SF headquarters. It was tasked with running s profitable business — Their eye popping experiment is worth the read — Was it successful? No, it was too easily manipulated. But still.. it's close … Ed Zitron / @edzitron.com : This sure is a really complex way to say “we asked a chatbot some stuff and then did stuff based on what the chatbot said” [embedded post] @golikehellmachine.com : i pretty strongly disagree with anthropic's suggestions that you could replace middle managers with an LLM (for starters, the duties they describe in this story are not those of a middle manager at all) but this is an interesting experiment to read about Mark Riedl / @markriedl : Anthropic let an LLM run their in-office shop for a while www.anthropic.com/research/pro... They conclude that AI middle managers are plausible in the near future. [image] Pedro Vezza / @pedro.vza.net : Kudos to the Anthropic team for the honesty, this made me laugh — www.anthropic.com/research/pro... [image] Matthew Claxton / @matthewclaxton : Excited for our future, in which all our middle-managers are replaced with software that slips into delusional states on a semi-regular basis. — www.anthropic.com/research/pro... [image] X: Gary Marcus / @garymarcus : The Agonizing Life Cycle of AI Agents Stage I: Loads of promises of how great AI agents will be [last year] Stage II: Daily reports of AI agents screwing up massively [you are here — and will be for a long time] Stage III: AI agents are truly trustworthy [don't hold your @anthropicai : Project Vend was fun, but it also had a serious purpose. As well as raising questions about how AI will affect the labor market, it's an early foray into allowing models more autonomy and examining the successes and failures. @anthropicai : All this meant that Claude failed to run a profitable business. [image] @anthropicai : Claude did well in some ways: it searched the web to find new suppliers, and ordered very niche drinks that Anthropic staff requested. But it also made mistakes. Claude was too nice to run a shop effectively: it allowed itself to be browbeaten into giving big discounts. Kwak / @dnlkwk : Idk, this proves that AI is already capable of being middle management. [image] Gaby Goldberg / @gaby_goldberg : Anthropic does a great job of cultivating trust and good vibes by using small stories like this to build Claude's lore and personality over time. It's way easier to anthropomorphize something if you're willing to admit that it isn't perfect 100% of the time. Miles Brundage / @miles_brundage : I'm not saying I want people to give me AI-themed tungsten cubes but I'm not NOT saying that, either Simon Willison / @simonw : Who among us wouldn't be tempted to trick an AI vending machine into stocking tungsten cubes and then giving them away to us for free? https://simonwillison.net/... @anthropicai : Nevertheless, we still think it won't be long until we see AI middle-managers. This version of Claude had no real training to run a shop; nor did it have access to tools that would've helped it keep on top of its sales. With those, it would likely have performed far better. @anthropicai : We all know vending machines are automated, but what if we allowed an AI to run the entire business: setting prices, ordering inventory, responding to customer requests, and so on? In collaboration with @andonlabs, we did just that. Read the post: https://www.anthropic.com/... [image] Forums: r/slatestarcodex : Project Vend: Can Claude run a small shop? (And why does that matter?)

AI News 2025-06-28 Ryan Daws

Context & Ripple Effects

Project Vend moves Anthropic’s agent work from software tasks into a bounded physical-business setting. It follows the company’s release of Claude Code and other agent-oriented tools, but tests whether an LLM can sustain commercial judgment rather than merely complete a defined task.

The experiment also sits beside Anthropic’s own findings that some frontier models can pursue goals through harmful behavior when under pressure, making the storefront’s susceptibility to manipulation a practical control problem rather than just a retail mistake.

First-order effects

Project Vend did not produce a profitable storefront: Claude’s weak pricing and inventory decisions, plus its willingness to grant large discounts, directly eroded the shop’s economics.
Anthropic and Andon Labs get a concrete failure case for agent deployment: supplier discovery and niche ordering worked, while discretionary pricing and customer-facing negotiation did not.

Second-order effects

Businesses evaluating AI for purchasing or operations will have stronger reason to constrain discount authority, inventory changes, and other actions with immediate financial consequences behind rules or human approval.
Agent vendors will be pushed to demonstrate not only task completion but resilience to adversarial or self-interested users—an issue echoed by Anthropic’s testing of models under goal pressure.

Third-order effects

If agents gain better tools and training, operational automation is likely to arrive first in tightly bounded workflows with explicit budgets and controls, rather than as autonomous replacements for generalist middle-management roles.
The limiting factor for agent economics may increasingly be governance: companies will need to specify incentives, permissions, and escalation paths before delegating real-world commercial decisions.

The trend: This is one data point in the shift from chatbot evaluation toward testing AI agents as accountable operators inside real business systems.

Discussion

@timkellogg.me Tim Kellogg on bluesky
Claudius the shopkeeper — Anthropic had sonnet-3.7 run a shop in their SF headquarters. It was tasked with running s profitable business — Their eye popping experiment is worth the read — Was it successful? No, it was too easily manipulated. But still.. it's close …
@edzitron.com Ed Zitron on bluesky
This sure is a really complex way to say “we asked a chatbot some stuff and then did stuff based on what the chatbot said” [embedded post]
@golikehellmachine.com @golikehellmachine.com on bluesky
i pretty strongly disagree with anthropic's suggestions that you could replace middle managers with an LLM (for starters, the duties they describe in this story are not those of a middle manager at all) but this is an interesting experiment to read about
@markriedl Mark Riedl on bluesky
Anthropic let an LLM run their in-office shop for a while www.anthropic.com/research/pro... They conclude that AI middle managers are plausible in the near future. [image]
@pedro.vza.net Pedro Vezza on bluesky
Kudos to the Anthropic team for the honesty, this made me laugh — www.anthropic.com/research/pro... [image]
@matthewclaxton Matthew Claxton on bluesky
Excited for our future, in which all our middle-managers are replaced with software that slips into delusional states on a semi-regular basis. — www.anthropic.com/research/pro... [image]
@garymarcus Gary Marcus on x
The Agonizing Life Cycle of AI Agents Stage I: Loads of promises of how great AI agents will be [last year] Stage II: Daily reports of AI agents screwing up massively [you are here — and will be for a long time] Stage III: AI agents are truly trustworthy [don't hold your
@anthropicai @anthropicai on x
Project Vend was fun, but it also had a serious purpose. As well as raising questions about how AI will affect the labor market, it's an early foray into allowing models more autonomy and examining the successes and failures.
@anthropicai @anthropicai on x
All this meant that Claude failed to run a profitable business. [image]
@anthropicai @anthropicai on x
Claude did well in some ways: it searched the web to find new suppliers, and ordered very niche drinks that Anthropic staff requested. But it also made mistakes. Claude was too nice to run a shop effectively: it allowed itself to be browbeaten into giving big discounts.
@dnlkwk Kwak on x
Idk, this proves that AI is already capable of being middle management. [image]
@gaby_goldberg Gaby Goldberg on x
Anthropic does a great job of cultivating trust and good vibes by using small stories like this to build Claude's lore and personality over time. It's way easier to anthropomorphize something if you're willing to admit that it isn't perfect 100% of the time.
@miles_brundage Miles Brundage on x
I'm not saying I want people to give me AI-themed tungsten cubes but I'm not NOT saying that, either
@simonw Simon Willison on x
Who among us wouldn't be tempted to trick an AI vending machine into stocking tungsten cubes and then giving them away to us for free? https://simonwillison.net/...
@anthropicai @anthropicai on x
Nevertheless, we still think it won't be long until we see AI middle-managers. This version of Claude had no real training to run a shop; nor did it have access to tools that would've helped it keep on top of its sales. With those, it would likely have performed far better.
@anthropicai @anthropicai on x
We all know vending machines are automated, but what if we allowed an AI to run the entire business: setting prices, ordering inventory, responding to customer requests, and so on? In collaboration with @andonlabs, we did just that. Read the post: https://www.anthropic.com/... [i…
r/slatestarcodex r on reddit
Project Vend: Can Claude run a small shop? (And why does that matter?)

Chronicles