Google’s new ViT-22B is the largest Vision Transformer model by far, with 22 billion parameters. It has achieved SOTA in numerous benchmarks such as depth estimation, image classification, and semantic segmentation. ViT-22B has been trained on four billion images and can be used for all kinds of computer vision tasks.

This result shows that further scaling in vision transformers can be as valuable as it was for Language Models. This also indicated that future multimodal models can be improved and GPT-4 is not the limit.