Why Does Diffusion Work Better than Auto-Regression?

Jun 03, 2024

generation steps, you would remove a random number of pixels from each input and train the neural net

work

to predict the next corresponding pixel from each input. Additionally, you can also provide the number of pixels removed as input to the neural net

work

, so it knows which pixel it is supposed to output. Now this single neural network can be used for all generation steps. In the setup I just described, for each training image, the neuron is trained in a single generation step for that image. But, ideally, we would like to train it on each generation step of each image;

That way we can get more out of our training data. If you were to do this naively, you would have to evaluate the neural network once for each generation step. Which means a lot more calculations. Fortunately, there are special neural network architectures, known as causal architectures, that allow you to train on all of these generation steps while only evaluating the neural network once. There are causal versions of all popular neural network architectures, such as causal convolutional neural networks and causal transformers. Causal architectures actually give slightly worse predictions, but in practice,

auto

matic

regression

is almost always done with causal architectures because training is much faster.

More Interesting Facts About,

why does diffusion work better than auto regression...

However, the process of generating causal architectures remains exactly the same. For

diffusion

models, you cannot use causal architectures, so you only have to train with each data point in a random generation step. I described the

diffusion

model as predicting the slightly less noisy image from the previous step. However, it is actually

better

to predict the original and completely clean image at each step. The reason for this is that it makes the work of the neural network easier. If you have it predict the noisy image of the next step, then the neural network needs to learn to generate images at all different noise levels.

This means that the model will waste some of its capacity learning to produce noisy versions of images. If, instead, you simply have the neural network always predict the clean image, then the model only needs to learn to generate clean images, which is all we care about. You can then take the intended clean image and reapply the noise generation process to it to move on to the next step of the generation process. Except that when you predict the clean image, in the first steps of the generation process the model only has pure noise as input, so the original clean image could have been anything, so you get a blurry mess again.

To avoid this, we can train the neural network to predict the noise that was added to the image. Once we have a predicted value for the noise, we can plug it into this equation to get a prediction of the original clean image. So we are still predicting the original clean image, just indirectly. The advantage of doing it this way is that the model output is now uncertain in the later stages of the generation process, as any noise could have been added to the clean image. The model then generates the average of a set of different noise samples, which is still valid noise.

So far we've just been generating images from scratch, but most image generators actually allow you to provide a text message describing the image you want to create. The way this works is exactly the same, you simply give the neural network the text as additional input at each step. These models are trained with pairs of images and their corresponding text descriptions, typically extracted from alt-text tags of images found on the Internet. This ensures that the generated image is something for which the text message could plausibly be provided as a description of that image. In principle, you can condition generative models on anything, not just text, as long as you can find suitable training data.

For example, here is a generative model that is conditional on sketches. Finally, there is a technique to make conditional diffusion models work

better

, called classifierless guidance. For this, during training sometimes the model will receive text prompts as additional input and sometimes not. In this way, the same model learns to make predictions with or without the conditioning message as input. Then, in each step of the denoising process, the model is run twice, once with the message and once without it. The prediction without the message is subtracted from the prediction with the message, which eliminates the details that are generated without the message, thus leaving only the details that come from the message, leading to generations that follow the message more closely.

In conclusion, generative AI, like all machine learning, simply fits curves. And that's all for this video. If you liked it, please like and subscribe. And if you have any suggestions for topics you'd like me to cover in a future video, leave a comment below.

Watch Video & Subscribe

If you have any copyright issue, please Contact