详解Sora 为什么是AGI的又一个里程碑时刻?

币圈资讯 阅读:33 2024-04-22 05:07:07 评论:0



APP下载   全球官网 大陆官网



APP下载   官网地址



APP下载   官网地址










从文字(ChatGPT)到图片(DALL·E )再到视频(Sora),对OpenAI来说,仿佛在搜集一张张的拼图,试图通过影像媒介形态彻底打破虚拟与现实的边界,成为电影“头号玩家”一般的存在。

如果说苹果Vision Pro是头号玩家的硬件外显,那么一个能自动构建仿真虚拟世界的AI系统,才是灵魂。

“语言模型近似人脑,视频模型近似物理世界”,爱丁堡大学的博士生Yao Fu表示。













万兴科技AI创新中心总经理齐镗泉,评价Sora的成功再次验证了“大力出奇迹”的可能性,“Sora依然遵循OpenAI的Scaling Law,靠大力出奇迹,大量数据,大模型和大量算力。Sora底层采用了游戏、无人驾驶和机器人领域验证的世界模型,构建文生视频模型,达到模拟世界的能力。”





OpenAI 在技术报告中画重点提到:“我们将各种类型的视觉数据转化为统一表示法的方法,这种表示法可用于生成模型的大规模训练。”




把之前沉淀的技术积累运用到视觉模型上,也成为了OpenAI的优势。在Sora文生视频的训练过程中,OpenAI就引入了 DALL-E3和GPT的语言理解能力。据OpenAI表示,DALL-E3、GPT基础上进行训练,能够使Sora准确地按照用户提示生成高质量的视频。















同样的行业还有游戏,OpenAI 技术报告的结尾是一个《我的世界》的游戏视频,旁边写着这样一句话:“ Sora可以通过基本策略同时控制Minecraft中的玩家,同时高保真地呈现世界及其动态。只需在Sora的提示字幕中提及‘Minecraft’,就能零距离激发这些功能。”

AI游戏创业者陈希告诉我们,“任何游戏从业者看到这句话,都是一身冷汗!OpenAI 毫无保留地展露了它的野心”。陈希解读分析认为,短短的一句话传达了两件事情:Sora能控制游戏角色,同时能渲染游戏环境。

“就如OpenAI 所说,Sora是一个模拟器,一个游戏引擎,一个想象力和现实世界的转换接口。未来的游戏,只要言之所及,画面就能被渲染出来。Sora现在学会了构建一分钟的世界,还能生成稳定的角色,再配合自家的GPT-5,一个纯AI生成的、数千平方公里、活跃着各色生物的地图,听上去已经不是异想天开。当然,画面是否能实时生成,是否支持多人联机,这些都是很现实的问题。但无论怎么说,新的游戏模式已经呼之欲出,至少用Sora生成一个《完蛋我被美女包围了》变得毫无问题了”,陈希道。


爱丁堡大学的博士生Yao Fu表示:“生成式模型学习生成数据的算法,而不是记住数据本身。就像语言模型编码生成语言的算法(在你的大脑中)一样,视频模型编码生成视频流的物理引擎。语言模型可以视为近似人脑,而视频模型近似物理世界。”




“Sora能力还是通过海量视频数据,还有recaptioning技术,实现出来的,甚至也没有 3D 显式建模,更不用说物理模拟了。虽然其生成出来的效果,已经达到/接近了通过物理模拟实现的效果。但物理引擎能做的事情不仅仅是生成视频,还有很多训练机器人必须有的其他要素”,贾奎表示道。


At the beginning of the year, another bomb was thrown at the world. The video generation model was considered as another milestone of general artificial intelligence a year ago, which means that the realization will be shortened from 2000 to 2000, and the chairman Zhou Hongyi made a prediction. But this model is so sensational not only because the video generated is longer and clearer, but also because it has surpassed all the abilities in the past to generate a video content related to the real physical world. Cyberpunk is cool, but how can everything in the real world be made? It is more meaningful to reproduce. Therefore, a brand-new conceptual world simulator is put forward. In the official technical report, it is positioned as a video generation model as a world simulator. Our research results show that expanding the video generation model is a feasible way to build a general simulator for the physical world. official website believes that it has laid the foundation for understanding and simulating the real world model, which will be an important milestone in realization. With this, it has completely opened up a company such as video track. From words to pictures to videos, it seems to me that they are collecting pieces of puzzles, trying to completely break the boundary between virtual and reality through the form of image media and become the number one player in movies. If Apple is the number one player's hardware, then a system that can automatically build a simulated virtual world is the soul language model, similar to the human brain video model and similar to the physical world. The ambition expressed by doctoral students at Edinburgh University is beyond everyone's imagination, but it seems that only it can do it. Many entrepreneurs lamented how the light cone intelligence has become a newly released model of the world simulator, kicking the door of the video track in 2000 and completely drawing a dividing line with the old world years ago. In a demonstration video released in one breath, the light cone intelligence found that most of the problems that the video was criticized in the past were solved, resulting in clearer images, more realistic images, more accurate understanding, smoother logical understanding and more stable and consistent results, etc. But all this is just a result. The tip of the iceberg is not aimed at video from the beginning, but at all the existing images. Video is a larger concept, and video is a subset of it, such as the virtual scene of the big-screen game world rolling on the street. What we need to do is to cover all the image simulations and understand the real world, that is, the concept of the world simulator that it emphasizes, just as Chen Kun, the producer of the movie Wonderland of Mountains and Seas, told the light cone intelligence to show us its capabilities in video. But the real purpose is to get people's feedback data to explore and predict what kind of video people want to generate, just like large-scale model training. Once the tool is open, it means that people all over the world are working for it, and its world model becomes more and more intelligent through continuous marking and entry. So we see that video has become the first stage of understanding the physical world, mainly highlighting its attributes as a video generation model, and only in the second stage can it provide value as a world simulator to seize video. The core of generating attributes is to find out where the differences between different sums are reflected, which is very important, because to some extent, it explains why it can be crushed. First of all, it follows the idea of training large-scale language models, and uses large-scale visual data to train a generating model with universal ability, which is completely different from the logic of dedicated personnel in the field of Wensheng video. Last year, there was a similar plan called universal world model, and the idea was basically similar, but there was no follow-up. This time, it was completed first. According to Xie Saining, an assistant professor at new york University, the amount of parameters is about 100 million. Although the comparison model is insignificant, this order of magnitude has far exceeded that of other companies. The success of Qi Tangquan, the general manager of Wanxing Science and Technology Innovation Center, has once again verified the possibility of making great efforts to make miracles. It still follows the principle of making great efforts to make miracles, a large number of data models and a large amount of computing power. At the bottom, the world model verified in the field of game driverless and robot is adopted to build the Wensheng video model. Secondly, it shows the perfect integration of diffusion model and large model for the first time in the body. Video is like a movie blockbuster, which depends on two important elements: script and special effects, in which the script corresponds to the logical special effects in the process of video generation, and the effect. In order to realize the logic and the effect, two technical paths have been differentiated behind the diffusion model and the large model. At the end of last year, light cone intelligence predicted that in order to meet the effect and logical diffusion and large model at the same time, the two paths will eventually be realized. Towards integration, I didn't expect to solve this problem so quickly. In the technical report, official website drew a picture focusing on the method of transforming various types of visual data into a unified representation. This representation can be used for large-scale training of generating models. Specifically, every frame of a video picture is coded and transformed into a visual patch, and each patch is similar to one of them, which has become the smallest unit of measurement in a video image and can be broken and reorganized at any time and anywhere to find a unified number. According to the method, the weights and measures are unified, and the bridge between the diffusion model and the large model is found. In the whole generation process, the diffusion model is still responsible for the generation effect. After increasing the attention mechanism of the large model, it has more predictive reasoning ability, which explains why the video can be generated from the existing static images, and the existing video can be expanded or the missing picture frames can be filled. Up to now, the video model has shown a composite trend, and the technology is also moving towards integration. It has also become an advantage to apply the technology accumulated before to the visual model. In the training process of Wensheng video, the language understanding ability of harmony is introduced. According to the representation, the training can accurately generate a set of high-quality video according to the user's prompts. As a result, the simulation ability appears, which constitutes the basis of the world simulator. We find that the video model will show many interesting emerging abilities when conducting large-scale training. These abilities make it possible to model. The appearance of these characteristics in some aspects of people, animals and environment in the quasi-physical world has not produced any definite inductive deviation on three-dimensional objects, etc. They are purely scale phenomena, indicating that the fundamental reason why simulation can be so cracked is that people have become accustomed to creating things that do not exist with large models, but they can accurately understand the operational logic of the physical world, such as how forces interact, how friction produces basketball and so on. These are things that no model can accomplish before, and they are also the fundamental significance beyond the level of video generation, but from the actual finished product. 比特币今日价格行情网_okx交易所app_永续合约_比特币怎么买卖交易_虚拟币交易所平台


注册有任何问题请添加 微信:MVIP619 拉你进入群

弹窗与图片大小一致 文章转载注明 网址:https://netpsp.com/?id=57792




APP下载   全球官网 大陆官网



APP下载   官网地址



APP下载   官网地址




  全球官网 大陆官网











