Make-A-Video, a cleverly titled new technology developed by Meta’s researchers, allows for the creation of videos from nothing more than a text prompt. This represents a major advancement in the field of artificial intelligence art development. The outcomes are fascinating in their variety and uniformly disturbing.
It’s hardly surprising that text-to-video models already exist; they’re just the next logical step after text-to-image models like DALL-E, which generate stills in response to input text. A human brain can easily make the transition from a stationary image to a moving one, but this is not the case for a machine learning model.
Make-A-Video doesn’t really alter things much behind the scenes, as the researchers observe in the paper outlining it: “a model that has only seen text describing pictures is very successful at creating short films.”
The AI employs the well-established diffusion approach for picture generation, which “denoises” images starting at the goal prompt and working backwards. Furthermore, the model was trained on a large body of unlabeled video information using unsupervised training (i.e., it analysed the data independently with little to no strong instruction from humans).
It has learned from the first how to create a photorealistic image, and from the second how video is composed of individual frames. Surprisingly, it can efficiently integrate them without any prior instruction on how they should be mixed.
“Make-A-Video sets the new state-of-the-art in text-to-movie production, as judged by both qualitative and quantitative measurements,” write the researchers. This includes the resolution and timing of the video as well as its adherence to the text and its overall quality.
It’s tough to argue otherwise. Earlier text-to-video systems took a different technique, yielding mixed but encouraging results. Now, Make-A-Video completely outclasses them, producing results on par with those seen, say, 18 months ago in original DALL-E or other previous-generation systems.
But it must be admitted: they still have a peculiar quality. And although I agree that we shouldn’t demand photorealism or absolutely natural motion, I think we can all agree that the end effects are a little bit…nightmarish?
Something about them is both dreamy and terrifying. It looks and feels like a stop-motion movie, the animation is that jarring. The degradation and relics make everything seem and feel fuzzy and otherworldly as if they were leaking. Nobody knows where one person ends and another begins, or where one item should finish and another begin.
I don’t make these points as an artificial intelligence snob who demands nothing but the most photorealistic 4K images. It fascinates me how weird and unsettling all of these videos are despite their apparent realism. It’s astonishing that they can be created so fast and randomly, and things will only improve from here. It’s hard to put your finger on it, yet even the most realistic-looking computer-generated images have a certain surreal character.
Like image generators, Make-A-Video may be fed with photos to create new images or films with similar properties. We get outcomes that are slightly less unsettling.
The improvement over the previous state is remarkable, and the crew deserves praise for their efforts. It’s not open to the public just yet, but you may sign up so that you’ll be among the first to know about future access options.