What is Sora? A new generative AI tool could transform video production and amplify disinformation risks

Late last week, OpenAI announced a new generative AI system named Sora, which produces short videos from text prompts. While Sora is not yet available to the public, the high quality of the sample outputs published so far has provoked both excited and concerned reactions.

What is Sora? A new generative AI tool could transform video production and amplify disinformation risks

The sample videos published by OpenAI, which the company says were created directly by Sora without modification, show outputs from prompts like “photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee” and “historical footage of California during the gold rush.”

At first glance, it is often hard to tell they are generated by AI, due to the high quality of the videos, textures, dynamics of scenes, camera movements, and a good level of consistency.

OpenAI chief executive Sam Altman also posted some videos to X (formerly Twitter) generated in response to user-suggested prompts, to demonstrate Sora’s capabilities.

How does Sora work?

Sora combines features of text and image generating tools in what is called a “diffusion transformer model”.

Transformers are a type of neural network first introduced by Google in 2017. They are best known for their use in large language models such as ChatGPT and Google Gemini.

Diffusion models, on the other hand, are the foundation of many AI image generators. They work by starting with random noise and iterating towards a “clean” image that fits an input prompt.

A video can be made from a sequence of such images. However, in a video, coherence and consistency between frames are essential.

Sora uses the transformer architecture to handle how frames relate to one another. While transformers were initially designed to find patterns in tokens representing text, Sora instead uses tokens representing small patches of space and time.

Leading the pack

Sora is not the first text-to-video model. Earlier models include Emu by Meta, Gen-2 by Runway, Stable Video Diffusion by Stability AI, and recently Lumiere by Google.

Lumiere, released just a few weeks ago, claimed to produce better video than its predecessors. But Sora appears to be more powerful than Lumiere in at least some respects.

Sora can generate videos with a resolution of up to 1920 × 1080 pixels, and in a variety of aspect ratios, while Lumiere is limited to 512 × 512 pixels. Lumiere’s videos are around five seconds long, while Sora makes videos up to 60 seconds.

Lumiere cannot make videos composed of multiple shots, while Sora can. Sora, like other models, is also reportedly capable of video-editing tasks such as creating videos from images or other videos, combining elements from different videos, and extending videos in time.

Both models generate broadly realistic videos, but may suffer from hallucinations. Lumiere’s videos may be more easily recognized as AI-generated. Sora’s videos look more dynamic, having more interactions between elements.

What is Sora? A new generative AI tool could transform video production and amplify disinformation risks
Diffusion models (in this case Stable Diffusion) generate images from noise over many iterations. Credit: Stable Diffusion / Benlisquare / Wikimedia, CC BY-SA

However, in many of the example videos inconsistencies become apparent on close inspection.


上一篇 2024年2月21日 09:27
下一篇 2024年2月23日 09:29



您的电子邮箱地址不会被公开。 必填项已用 * 标注

Xaiat 人工智能艾特 让人人更懂AI