Google launches RT-2 or Robotics Transformer 2, a “vision-language-action” model trained on text and images from the web that can output robotic actions

Our sneak peek into Google's new robotics model, RT-2, which melds artificial intelligence technology with robots.

New York Times 2023-07-30 Kevin Roose

Discussion

@kboughida Karim B Boughida on x
Can you imagine those in a library helping with shelf-reading, re-shelving, and doing chats! Aided by A.I. Language Models, Google's Robots Are Getting Smart https://www.nytimes.com/...
@googledeepmind @googledeepmind on x
Across all categories, we saw increased generalisation performance compared to previous baselines, such as on RT-1 models. We also evaluated RT-2 on a number of unseen objects and environments where it could successfully adapt to new situations: https://dpmd.ai/...
@googledeepmind @googledeepmind on x
⚪ To explore RT-2's emergent capabilities in trials, we searched for tasks that require combining learnings from web data and the robot's experience. We then defined 3 skills it needed to show: 🔘 Symbol understanding 🔘 Reasoning 🔘 Human recognition https://dpmd.ai/... [image]
@glenngabe Glenn Gabe on x
Yep, here we go... LLMs plugged into robots -> Aided by A.I. Language Models, Google's Robots Are Getting Smart “Google has recently begun plugging state-of-the-art language models into its robots, giving them the equivalent of artificial brains.” https://www.nytimes.com/... [ima…
@googledeepmind @googledeepmind on x
Today, we announced 𝗥𝗧-𝟮: a first of its kind vision-language-action model to control robots. 🤖 It learns from both web and robotics data and translates this knowledge into generalised instructions. Find out more: https://dpmd.ai/... [video]
@alex @alex on x
Fuck yes this is what I was hoping llms would do to robots and I was (happily) years late to the idea https://www.nytimes.com/...
@kimzetter Kim Zetter on x
A one-armed robot stood in front of a table with three plastic figurines on top of it: a lion, a whale and a dinosaur. Engineer: “Pick up the extinct animal.” The robot whirred for a moment, then its arm extended and its claw opened and descended. It grabbed the dinosaur.
@imordatch Igor Mordatch on x
Excited to finally share what we've been working on for the past while: combining vision, language, and action modalities for robot control! https://www.nytimes.com/...
@lesaunh @lesaunh on x
Too few have understood that AI advances are mapping onto robots. Yes, AI is going to impact cognitive labor first, but the robots will flood the physical labor market soon after. [image]
@ajddavison Andrew Davison on x
Large language models and web-scale data have some use in robotics as a user interface as nicely demonstrated here, but in my opinion they are not what we need to help with perception, object representation and precise planning which are the real current barriers in robotics.
@xiao_ted Ted Xiao on x
Introducing RT-2, representing the culmination of two trends: - Tokenize and train everything together: web-scale text, images, and robot data - VLMs are not just representations, big VLMs *are policies* Sounds subtle, but we'll look back on this as an inflection point! 📈
@hausman_k Karol Hausman on x
Multiple conclusions from these experiments: (1) it turns out with a little bit of robot data we can transfer semantic concepts from vision language web-scale data to robot actions (2) the best VLMs might be the most generalizable robotic controllers
@chelseabfinn Chelsea Finn on x
Vision-language ➡️ vision-language-action model By using a pre-trained VLM (e.g. PaLI-X), RT-2 enables robots to generalize to new objects & instructions RT-2 also shows basic reasoning capabilities. (e.g. “place orange in matching bowl") Paper+videos: https://robotics-transforme…
@svlevine Sergey Levine on x
Turns out that vision-language models can control robots too. The secret is to just finetune them to print out the actions (literally, as text). Really excited about our new result, the successor to RT-1. RT-2 is a pre-trained VLM: https://www.deepmind.com/... Short 🧵👇 [video]
@demishassabis Demis Hassabis on x
Computers have long been great at complex tasks like analysing data, but not so great at simple tasks like recognizing & moving objects. With RT-2, we're bridging that gap by helping robots interpret & interact with the world and be more useful to people. https://www.nytimes.com/…
@hausman_k Karol Hausman on x
PaLM-E or GPT-4 can speak in many languages and understand images. What if they could speak robot actions? Introducing RT-2: https://robotics-transformer2.github.io / our new model that uses a VLM (up to 55B params) backbone and fine-tunes it to directly output robot actions! [vi…
@kevinroose Kevin Roose on x
Google has quietly rebuilt its robotics division around LLMs — the same AIs that power Bard, ChatGPT and others. Now, if you tell a robot to “pick up the extinct animal,” it knows you're talking about a dinosaur. My column from inside the lab: https://www.nytimes.com/...
r/singularity r on reddit
Google Deepmind presents RT-2, the first vision-language-action (VLA) Robotics Transformer and it may have drastic implications our future.
r/artificial r on reddit
Google Deepmind presents RT-2, the first vision-language-action (VLA) Robotics Transformer and it may have drastic implications our future.

Chronicles

Google launches RT-2 or Robotics Transformer 2, a “vision-language-action” model trained on text and images from the web that can output robotic actions

Related Coverage

Discussion