成就了Sora和Stable diffusion 3的DiTs：究竟是啥

币圈资讯阅读：40 2024-04-22 09:26:25 评论：0

美化布局示例

欧易(OKX)最新版本

【遇到注册下载问题请加文章最下面的客服微信】永久享受返佣20%手续费！

APP下载全球官网大陆官网

币安(Binance)最新版本

币安交易所app【遇到注册下载问题请加文章最下面的客服微信】永久享受返佣20%手续费！

APP下载官网地址

火币HTX最新版本

火币老牌交易所【遇到注册下载问题请加文章最下面的客服微信】永久享受返佣20%手续费！

APP下载官网地址

作者｜苗正

Sora才刚发布没过多久，Stable AI就发布了Stable Diffusion 3。对于使用人工智能创意设计的人来说，无疑是过大年了。那么本篇文章就专门为这些用户准备，用更直白的话讲述Stable Diffusion 3的两大特色“扩散transformers模型”以及“流匹配”，帮助你在模型发布后更好的使用它来创作。

扩散transformer模型（diffusion transformers），我们下文就简称它为DiTs。那看名字你也清楚了，这是一个基于transformer架构的图像潜变量扩散模型。如果你读过硅星人Pro的文章《揭秘Sora：用大语言模型的方法理解视频，实现了对物理世界的“涌现”》，那么你对于接下来的内容来说已经算是“课代表”级别的了。DiTs跟Sora一样，它也使用了“块”（patches）这个概念，不过由于DiTs是用来生成图片的，它不需要像Sora那样保持不同帧图片之间的逻辑关联，所以它不必生成时间和空间的时空块。

对于DiTs来说，它和4、5年前在计算机视觉领域掀起一场腥风血雨的Vision Transformer (ViT) 是相似的，图像会被DiTs被分割成多个patches，并嵌入到连续向量空间中，形成序列输入供transformer处理。不过这里要注意，因为DiTs是有业务在身的，所以对于条件图像生成任务，DiTs就需要接收并融合外部条件信息，例如类别标签或文本描述。通常通过提供额外的输入标记或者跨注意力机制来实现，使得模型能够根据给定的条件信息指导生成过程。

那么当这个块抵达DiTs内部的时候呢，他就可以被DiTs内部的DiT block来加工成需要的内容了。DiT block是DiTs最核心的一环，它是一种设计用于扩散模型的特殊transformer结构，能够处理图像和条件信息。一般来说，block本身翻译过来就是块，但是为了和patches做区分，所以这里我直接用block。

DiT block又分为三个小block：交叉注意力、adaLN、adaLN-Zero。交叉注意力指的是在多头自注意力层之后添加了一个额外的多头交叉注意力层，它的作用是利用条件信息来指导图像生成，使生成的图片更符合提示词，不过代价是增加了大约15%的计算量。

adaLN中的LN，指的是通过规范化每一层神经网络内部单元的输出，以减少内部协变量偏移（covariate shift）的问题，进而改善模型训练过程中的收敛速度和性能。那adaLN就是对标准层归一化的扩展，它允许层归一化的参数根据输入数据或附加条件信息动态调整。它就和汽车那个悬挂一样，是用来增加模型稳定性和适应性的。

接下来，Stable AI在adaLN DiT block的基础上进行了一项改进，除了回归γ和β之外，还回归维度级的缩放参数α，并在DiT block内的任何残差连接之前立即应用这些参数。而这一个block就是adaLN-Zero，这样做的目的是为了模仿残差网络中的有益初始化策略，以促进模型的有效训练和优化。

经过DiT block后，token序列就会解码为输出噪声预测和输出对角协方差预测。通过标准线性解码器，这两个预测结果的大小和输入图像的空间维度相同。最后是将这些解码后的令牌按照它们原有的空间布局重新排列，从而得到预测出的噪声值和协方差值。

第二章，流匹配（Flow Matching，下文简称FM）。根据Stable AI的说法，是一种高效的、无需模拟的CNF模型训练方法，允许利用通用概率路径监督CNF训练过程。尤为重要的是，FM打破了扩散模型之外的CNF可扩展训练障碍，无需深入理解扩散过程即可直接操作概率路径，从而绕过了传统训练中的难题。

所谓CNF，就是Continuous Normalizing Flows，连续归一化流。这是一种深度学习中的概率模型和生成模型技术。在CNF中，通过一系列可逆且连续的变换将简单的概率分布转换为复杂的、高维数据的概率分布。这些变换通常由一个神经网络来参数化，使得原始随机变量经过连续变换后能够模拟目标数据分布。翻译成大白话，CNF像是摇骰子那样生成数据的。

但是CNF在实际操作中需要大量的计算资源和时间，于是Stable AI就寻思了，那能不能又一个结果只要差不多和CNF一样就行，但是流程要稳定，计算量要低的方法？于是FM就诞生了，FM的本质是一个用于训练CNF模型以适应并模拟给定数据分布演化过程的技术，即使我们并不事先知道这个分布的具体数学表达式或对应的生成向量场。通过优化FM目标函数，也可以逐步让模型学习到能够生成与真实数据分布近似的概率分布的向量场。

相较于CNF而言，FM应该算是一种优化方法，它的目标是训练CNF模型生成的向量场与理想的目标概率路径上的向量场尽可能接近。

看完了Stable Diffusion 3的两大核心技术特性你就会发现，其实它和Sora非常接近。俩模型都是transformer模型（stable diffusion此前采用的是U-Net）、都使用块、都有着划时代的稳定性和优化，而且出生日期还这么近，说他们有血缘关系，我认为并不过分。

不过“兄弟俩”有一个根源性的不同，那就是Sora闭源，Stable Diffusion 3开源。事实上，Midjourney也好，DALL·E也好，他们都是闭源的，唯有Stable Diffusion是开源的。如果你关注开源人工智能，那么你一定发现了，开源社区陷入困境已经有很长一段时间了，没有明显的突破，很多人都对此失去信心。Stable Diffusion 2和Stable Diffusion XL仅改进了生成图片的美观性，而Stable Diffusion 1.5已经可以做到这一点。看到Stable Diffusion 3的革命性改进，能让很多人开源社区的开发者重燃信心。

再说个刺激的，Stable AI的CEO默罕默德艾马德莫斯塔克（মোহম্মদ ইমাদ মোশতাক）在推特中说到，尽管Stable AI在人工智能这个领域的资源比其他一些公司少了足足100倍，但是Stable Diffusion 3架构已经可以接受除了视频和图像以外的内容了，不过目前还不能公布太多。

你说图片和视频我还能理解，可啥叫“以外”的内容？其实我能想到的那就是音频了，通过一段声音来生成图片。让人摸不着头脑，不过一旦Stable AI放出最新的研究成果，我们一定第一时间拿来解读。

Miao Zheng, the author, just released it not long after, which is undoubtedly a new year for people who use artificial intelligence for creative design. So this article is specially prepared for these users to use two characteristic diffusion models and flow matching in more straightforward words to help you better use it to create diffusion models after the model is released. We will call it that as the name, and you know it is an image latent variable diffusion model based on architecture. If you have read the articles of Silicon Star, you can use big words to reveal the secrets. The method of understanding video model has realized the emergence of the physical world, so you are already at the class representative level for the following content. Just like that, it also uses the concept of block, but because it is used to generate pictures, it doesn't need to maintain the logical association between different frames of pictures like that, so it doesn't need to generate the spatio-temporal block of time and space. For me, it is similar to the one that caused a bloody storm in the field of computer vision years ago, and the images will be divided into multiple and merged. Embedded in the continuous vector space to form a sequence of input for processing, but it should be noted here that because there is business, it is necessary to receive and integrate external conditional information such as category labels or text descriptions for conditional image generation tasks, which is usually realized by providing additional input tags or cross-attention mechanisms, so that the model can guide the generation process according to the given conditional information. Then, when the block reaches the inside, it can be processed internally into the required content, which is the most important. The core part is a special structure designed for diffusion model, which can handle images and conditional information. Generally speaking, it is translated into blocks, but in order to distinguish it from others, I use generation directly and divide it into three small cross-attention. Cross-attention refers to the addition of an extra multi-head cross-attention layer after the multi-head self-attention layer. Its function is to use conditional information to guide image generation and make the generated images more consistent with prompts, but at the cost of an increase of about. The amount of calculation refers to the problem of reducing the internal covariant deviation by standardizing the output of the internal units of each layer of neural network, thus improving the convergence speed and performance in the model training process, that is, the expansion of the normalization of the standard layer, which allows the parameters of the normalization of the layer to be dynamically adjusted according to the input data or additional condition information. It is used to increase the stability and adaptability of the model, just like the car suspension. Next, an improvement is made on the basis of regression sum. In addition, the scaling parameters of dimension level are regressed and applied immediately before any residuals are connected. The purpose of this one is to imitate the beneficial initialization strategy in the residual network to promote the effective training and optimization of the model. After that, the sequence will be decoded into output noise prediction and output diagonal covariance prediction. The size of these two prediction results is the same as the spatial dimension of the input image through the standard linear decoder. Finally, these decoded tokens will be in accordance with their original. The spatial layout is rearranged to get the predicted noise value and covariance value. The second chapter is flow matching. The following abbreviation is based on the statement that it is an efficient model training method without simulation, which allows the use of general probability paths to supervise the training process. It is particularly important to break the extensible training obstacles outside the diffusion model and directly operate the probability paths without in-depth understanding of the diffusion process, thus bypassing the problems in traditional training. The so-called continuous normalized flow is a kind of deep learning. In this paper, the probability model and generation model technology of the new model transform the simple probability distribution into the probability distribution of complex high-dimensional data through a series of reversible and continuous transformations. These transformations are usually parameterized by a neural network, so that the original random variable can simulate the target data distribution after continuous transformation and be translated into vernacular to generate data like rolling dice, but it needs a lot of computing resources and time in actual operation, so I wonder if it can be another result. Almost the same as it is, but the process should be stable and the calculation amount should be low, so the essence of the method is a technology for training the model to adapt to and simulate the evolution process of a given data distribution. Even if we don't know the specific mathematical expression of this distribution or the corresponding generating vector field in advance, we can gradually let the model learn a vector field that can generate a probability distribution similar to the real data distribution by optimizing the objective function. Compared with it, it should be regarded as an optimization method. It is that the vector field generated by the training model is as close as possible to the vector field on the ideal target probability path. After reading the two core technical characteristics, you will find that it is actually very close. Both models are models, both of which used epoch-making stability and optimization, and the date of birth is so close. I don't think it is too much to say that they are related, but there is a fundamental difference between the two brothers, that is, closed source and open source. In fact, they are both closed sources. Some are open source. If you pay attention to open source artificial intelligence, then you must find that the open source community has been in trouble for a long time, and there is no obvious breakthrough. Many people have lost confidence in this and only improved the aesthetics of the generated pictures, but they can already do this. The revolutionary improvement seen can rekindle the confidence of many developers in the open source community and generate another exciting Mohammad Aymard Mostak said in Twitter that although the resources in the field of artificial intelligence are better than their own, Some of his companies have been reduced by a factor of 100, but the architecture can accept content other than video and images, but at present, I can't publish too much. I can understand what you said about pictures and videos, but what I can think of is audio. It's confusing to generate pictures through a piece of sound, but once the latest research results are released, we will definitely interpret them and generate them as soon as possible. 比特币今日价格行情网_okx交易所app_永续合约_比特币怎么买卖交易_虚拟币交易所平台

文字格式和图片示例

注册有任何问题请添加微信：MVIP619 拉你进入群