DeepMind says video models like Veo 3 could become general purpose foundation models for vision, like LLMs for text, using zero-shot “chain-of-frames” reasoning
Video models are zero-shot learners and reasoners. Fascinating new paper from Google DeepMind which makes …
Could video models be the path to general visual intelligence? In our new paper, we show that Veo3 has emergent zero-shot capabilities, solving complex tasks across the vision stack. Project page: https://video-zero-shot.github.io/ Paper: https://arxiv.org/... 🧵👇🏻 [image]
Are we experiencing a ‘GPT moment’ in vision? Super excited to demonstrate the generality with which current video models can solve tasks from simple perception to visual reasoning! 🌐 https://video-zero-shot.github.io/
Made some notes on the new DeepMind paper “Video models are zero-shot learners and reasoners” - it makes a convincing case that generative video models are to vision problems what LLMs were to NLP problems: single models that can solve a wide array of challenges simonwillison.net…