DeepMind says video models like Veo 3 could become general purpose foundation models for vision, like LLMs for text, using zero-shot “chain-of-frames” reasoning

Video models are zero-shot learners and reasoners. Fascinating new paper from Google DeepMind which makes …

Simon Willison's Weblog 2025-09-29 Simon Willison

Context & Ripple Effects

DeepMind’s video work has progressed from business access to its initial Veo model to Veo 2’s longer, higher-resolution clips. This report extends that arc beyond generation: it argues that video models may learn visual structure and reason over it without task-specific training.

The claim arrives alongside DeepMind’s Genie 3, which generated interactive 3D worlds with short-term visual continuity, and a broader race to develop world models from video and robotics data. It matters because the competitive question shifts from video output quality to whether video-trained systems can transfer across vision tasks.

First-order effects

DeepMind positions Veo 3 as a prospective vision foundation model, making zero-shot chain-of-frames reasoning a central capability to validate rather than treating the model solely as a video generator.
Researchers and prospective users gain a concrete hypothesis to test: whether a video model’s learned temporal representations transfer reliably to visual reasoning tasks without fine-tuning.

Second-order effects

Rivals pursuing video and world models, including approaches such as Meta’s masked-video prediction model, face stronger incentives to demonstrate zero-shot transfer and temporal reasoning, not just generation or simulation quality.
If the capability holds across evaluations, demand for video data, multimodal evaluation suites, and inference systems able to process frame sequences would rise relative to tools optimized only for creating clips.

Third-order effects

The direction points toward a possible convergence of video generation, perception, and world modeling into shared visual base models; that outcome remains dependent on reproducible transfer beyond the reported reasoning setup.
As models are asked to reason from visual sequences, differentiation is likely to move toward data coverage, temporal consistency, and deployment cost—not only photorealistic video output.

The trend: Video models are being repositioned from media generators into general-purpose systems for learning and reasoning about the physical world.

Chronicles

DeepMind says video models like Veo 3 could become general purpose foundation models for vision, like LLMs for text, using zero-shot “chain-of-frames” reasoning

Context & Ripple Effects

First-order effects

Second-order effects

Third-order effects

Related Coverage

Analysis

Discussion