文生视频大模型
2024-02-25 09:49阅读:

2024年2月16日凌晨,OpenAI在官网发布了最新的文生视频模型Sora。
Sora是OpenAI开发的首个AI视频生成模型,只要输入内容、提示词或图片,Sora就能在一秒内生成长达一分钟的高保真视频。
没有导演、专业演员的参与,Sora的视频却能生成出高度细致的背景、复杂的多角度镜头及富有情感的角色。以往视频制作的选址、选演员、分镜构图,架设机位、后期制作等复杂流程,对Sora而言都只是一句话的功夫。
Sora不仅突破了现有文生视频模型10秒左右的连贯性局限,而且展示出了更精细的画质、多角度多镜头切换中保持一致性等能力。最重要的是,Sora较好地表现出了现实世界中的逻辑,比如在模型生成的两艘海盗船在咖啡杯内航行的视频中,咖啡的流动完全符合现实世界中的流体力学;比如一则宠物猫等待主人起床的视频中,宠物猫踩奶的动作、对主人鼻头的轻触都符合现实世界中动物的习性。
尽管出于安全性和测试验证等原因,Sora尚未开放给公众使用,但通过观察OpenAI发布的demo,不难发现Sora已经初步具备了理解现实世界运行规律的能力,即“涌现”的能力被成功得从大语言模型复制到了文生视频模型中。假以时日,人类或将很难区分现实世界和由AI生成的虚拟世界。除了为视频制作、电影剪辑、游戏设计等工作提升效率外,一个完全由数据驱动的数字世界或许已经近在咫尺。英伟达人工智能研究院首席科学家JimFan称“这是视频生成领域的GPT-3时刻”。360董事长周鸿祎称“Sora意味着AGI(通用人工智能)实现将从10年缩短到1年”。
What is Sora? A new generative AI tool could transform
video production and
amplify disinformation risks
Vahid Pooryousef, Lonni Besançon
Late last week, OpenAI announced a new generative AI system named
Sora, which produces short videos from text prompts. While Sora is
not yet available to the public, the high quality of the sample
outputs published so far has provoked both excited and concerned
reactions.
The sample videos published by OpenAI, which the company says were
created directly by Sora without modification, show outputs from
prompts like “photorealistic closeup video of two pirate ships
battling each other as they sail inside a cup of coffee” and
“historical footage of California during the gold rush”.
At first glance, it is often hard to tell they are generated by AI,
due to the high quality of the videos, textures, dynamics of
scenes, camera movements, and a good level of consistency.
OpenAI chief executive Sam Altman also posted some videos to X
(formerly Twitter) generated in response to user-suggested prompts,
to demonstrate Sora’s capabilities.
How does Sora work?
Sora combines features of text and image generating tools in what
is called a “diffusion transformer model”.
Transformers are a type of neural network first introduced by
Google in 2017. They are best known for their use in large language
models such as ChatGPT and Google Gemini.
Diffusion models, on the other hand, are the foundation of many AI
image generators. They work by starting with random noise and
iterating towards a “clean” image that fits an input prompt.
A video can be made from a sequence of such images. However, in a
video, coherence and consistency between frames are
essential.
Sora uses the transformer architecture to handle how frames relate
to one another. While transformers were initially designed to find
patterns in tokens representing text, Sora instead uses tokens
representing small patches of space and time.
Leading the pack
Sora is not the first text-to-video model. Earlier models include
Emu by Meta, Gen-2 by Runway, Stable Video Diffusion by Stability
AI, and recently Lumiere by Google.
Lumiere, released just a few weeks ago, claimed to produce better
video than its predecessors. But Sora appears to be more powerful
than Lumiere in at least some respects.
Sora can generate videos with a resolution of up to 1920 × 1080
pixels, and in a variety of aspect ratios, while Lumiere is limited
to 512 × 512 pixels. Lumiere’s videos are around 5 seconds long,
while Sora makes videos up to 60 seconds.
Lumiere cannot make videos composed of multiple shots, while Sora
can. Sora, like other models, is also reportedly capable of
video-editing tasks such as creating videos from images or other
videos, combining elements from different videos, and extending
videos in time.
Both models generate broadly realistic videos, but may suffer from
hallucinations. Lumiere’s videos may be more easily recognised as
AI-generated. Sora’s videos look more dynamic, having more
interactions between elements.
However, in many of the example videos inconsistencies become
apparent on close inspection.
Promising applications
Video content is currently produced either by filming the real
world or by using special effects, both of which can be costly and
time consuming. If Sora becomes available at a reasonable price,
people may start using it as a prototyping software to visualise
ideas at a much lower cost.
Based on what we know of Sora’s capabilities it could even be used
to create short videos for some applications in entertainment,
advertising and education.
OpenAI’s technical paper about Sora is titled “Video generation
models as world simulators”. The paper argues that bigger versions
of video generators like Sora may be “capable simulators of the
physical and digital world, and the objects, animals and people
that live within them”.
If this is correct, future versions may have scientific
applications for physical, chemical, and even societal experiments.
For example, one might be able to test the impact of tsunamis of
different sizes on different kinds of infrastructure – and on the
physical and mental health of the people nearby.
Achieving this level of simulation is highly challenging, and some
experts say a system like Sora is fundamentally incapable of doing
it.
A complete simulator would need to calculate physical and chemical
reactions at the most detailed levels of the universe. However,
simulating a rough approximation of the world and making realistic
videos to human eyes might be within reach in the coming
years.
Risks and ethical concerns
The main concerns around tools like Sora revolve around their
societal and ethical impact. In a world already plagued by
disinformation, tools like Sora may make things worse.
It’s easy to see how the ability to generate realistic video of any
scene you can describe could be used to spread convincing fake news
or throw doubt on real footage. It may endanger public health
measures, be used to influence elections, or even burden the
justice system with potential fake evidence.
Video generators may also enable direct threats to targeted
individuals, via deepfakes – particularly pornographic ones. These
may have terrible repercussions on the lives of the affected
individuals and their families.
Beyond these concerns, there are also questions of copyright and
intellectual property. Generative AI tools require vast amounts of
data for training, and OpenAI has not revealed where Sora’s
training data came from.
Large language models and image generators have also been
criticised for this reason. In the United States, a group of famous
authors have sued OpenAI over a potential misuse of their
materials. The case argues that large language models and the
companies who use them are stealing the authors’ work to create new
content.
It is not the first time in recent memory that technology has run
ahead of the law. For instance, the question of the obligations of
social media platforms in moderating content has created heated
debate in the past couple of years – much of it revolving around
Section 230 of the US Code.
While these concerns are real, based on past experience we would
not expect them to stop the development of video-generating
technology. OpenAI says it is “taking several important safety
steps” before making Sora available to the public, including
working with experts in “misinformation, hateful content, and bias”
and “building tools to help detect misleading content”.