What happens if you take the largest Vision-Language Model and connect it to a robot that can conduct experiments and continue learning? You get PaLM-E - a multimodal model with 562 billion parameters, capable of executing verbal commands based on images from a camera and adjusting its actions accordingly.
I think the phrase Large Multimodal Models will be heard more and more in 2023.
Paper: https://palm-e.github.io/assets/palm-e.pdf
Demo: https://palm-e.github.io/#demo
Page: https://palm-e.github.io