NVIDIA’s newest model, VideoLDM can generate videos with resolutions up to 1280 x 2048. They archive that by training a diffusion model in a compressed latent space, introducing a temporal dimension to the latent space, and fine-tuning on encoded image sequences while temporally aligning diffusion model upsamplers.
It is visibly better than previous models and it looks like my prediction for this year is coming true and we get video models as capable as the image models from the end of the last year. Read the paper here.
Leave a Reply