YTread Logo
YTread Logo

MIT 6.S191 (2023): Deep Generative Modeling

May 10, 2024
foreigner I'm really very excited about this conference because, as Alexander presented yesterday, right now we are in this tremendous era of

generative

AI and today we are going to learn the fundamentals of the

deep

generative

modeling

that we are going to talk about. Build systems that can not only look for patterns in data, but can also go a step further to generate new data instances based on those learned patterns. This is an incredibly complex and powerful idea and as I mentioned, it's a particular subset of

deep

learning that has actually exploded in recent years and this year in particular, so to get started and demonstrate how powerful these algorithms are, let me show you these three different faces.
mit 6 s191 2023 deep generative modeling
I want you to take a minute to think about which face you think is real. raise your hand if you think it's a face A okay you see a couple of people face B many more people face C on the second place well the truth is that you are all wrong the three faces are fake these people do not exist these images were synthesized by deep generative models trained with human face data and asked to produce new instances. Now I think this demo demonstrates the power of these ideas and the power of this notion of generative

modeling

, so let's be a little more concrete about how we can formalize this so far in this course.
mit 6 s191 2023 deep generative modeling

More Interesting Facts About,

mit 6 s191 2023 deep generative modeling...

We've been looking at what we call supervised learning problems, which means that we receive data and associated with that data is a set of labels. Our goal is to learn a function that maps that data. Regarding labels, we are now in a course on deep learning, so we have been dealing with functional mappings that are defined by deep neural networks, but really that function could be anything. Neural networks are powerful, but we could also use other techniques in contrast. There is another class of problems in machine learning that we refer to as unsupervised learning where we take data but now they just give us unlabeled data and our goal is to try to build some method that can understand the hidden underlying structure of that data.
mit 6 s191 2023 deep generative modeling
What it allows us to do is give us new insights into the fundamental representation of data, and as we'll see later, it actually allows us to generate new data instances. Now, this class of problems, this definition of unsupervised learning captures the types of models that we're going to talk about today in the focus on generative modeling, which is an example of unsupervised learning and is tied together by this goal of the problem in which We are only given samples from a training set and we want to learn a model that represents the distribution of the data that the model sees.
mit 6 s191 2023 deep generative modeling
Generative modeling takes two general forms: first density estimation and second sample generation. In density estimation the task receives some data examples our goal is to train a model that learns an underlying probability distribution that describes where the data comes from generating samples, the idea is similar, but the focus is more on generating new instances. Our goal with sample generation is to relearn this model from this underlying probability distribution, but then use that model to sample from it and generate new instances that are similar to the data we've seen approximately falling ideally into the same distribution of real data now in both cases of density estimation and sample generation the underlying question is the same our learning task is to try to build a model that learns this probability distribution that is as close as possible to the true data distribution.
Ok, with this definition and concept of generative modeling, what are some ways we can implement generative modeling in the real world for high-impact applications? Well, part of the reason why generative modeling models are so powerful is that they have the ability to discover the underlying features in a data set and encode them in an efficient way, for example, if we are considering the problem of facial detection and we are given a data set with many different faces starting without inspecting this data, we may not know what the distribution of the faces in this data set is with respect to the features we may be interested in, e.g. head pose, clothing, glasses, skin tone, hair, etc., and it may be the case that our training data may be very, very biased towards particular features without us realizing it. , using generative models we can identify the distributions of these underlying features in a completely automatic way without any labeling to understand which features may be overrepresented in the data which features may be underrepresented in the data and this is the focus of software labs today and of tomorrow that will be part of the software lab competition developing generative models that can perform this task and using them to discover and diagnose biases that may exist.
Within facial detection models, another really powerful example is the case of outlier detection that identifies rare events, so let's consider the example of self-driving cars with a self-driving car, let's say you're driving in the real world, actually we want to make sure that that car can be able to handle all possible scenarios and all possible cases that you may encounter, including extreme cases like a deer approaching the car or some rare and unexpected events, not only you know the typical straight driving on highway you can see. Most of the time with generative models we can use this idea of ​​density estimation to be able to identify rare and anomalous events within the training data and as they occur when the model first sees them, so hopefully this shows this picture what generative modeling is the underlying concept and a couple of different ways we can implement these ideas for powerful and impactful real-world applications.
Yes, in today's lecture we will focus on a broad class of generative models that we call latent variable. models and specifically summarized into two subtypes of latent variable models, first things first. I have introduced this term latent variable, but I have not told you or described what it actually is. I think it's a great example and one of my favorite examples all the time. This whole course that gets to this idea of ​​the latent variable is this little story from Plato's Republic that is known as the myth of the cave in this myth there is a group of prisoners and as part of their punishment they are forced to face a wall now the only thing that the prisoners can observe are shadows of objects that pass in front of a fire that is behind them and they are observing the projection of the Shadows on the wall of this cave for the prisoners, those Shadows are the only things that can notice.
They see their observations, they can measure them, they can give them names because for them that is their reality, but they cannot directly see the underlying objects, the true factors that cast those shadows, those objects here are like latent variables in machine learning. They are not directly observable, but they are the true underlying characteristics or explanatory factors that create the observed differences and the variables that we can see and observe, and this demonstrates the goal of generative modeling, which is to find ways in which we can actually learn these hidden characteristics . these underlying latent variables even when they only give us observations of the observed data, so let's start by discussing a very simple generative model that attempts to do this through the idea of ​​encoding the data input.
The models we are going to talk about are called autoencoders and to see how an autoencoder works, we will walk through it step by step, starting with the first step of taking some raw input data and passing it through a series of layers of neural networks, now the output of this first step. It's what we call a low-dimensional latent space, it's an encoded representation of those underlying features and that's our goal in trying to train this model and predict those features. The reason a model like this is called an encoder and an autoencoder is that it is mapping the data latent variable Vector Z is in a low-dimensional space?
Does anyone have any ideas, okay, maybe there are some ideas, like yeah. the suggestion was that it is more efficient, yes, that is the core of the question. The idea of ​​having that low-dimensional latent space is that it is a very efficient compact encoding of the rich high-dimensional data that we can start with like you. As you correctly pointed out, what this means is that we can compress data into this small feature representation, a vector that captures this compactness and richness without requiring as much memory or as much storage, so how do we actually train the network to learn this variable? latent? vector, since we have no training data, we cannot explicitly observe these latent variables Z, we need to do something smarter.
What the autoencoder does is create a way to decode this latent variable. The vector goes back to the original data space trying to reconstruct its original image from that compressed efficient latent encoding and once again we can use a series of neural network layers, like convolutional layers, fully connected layers, but now to map from that space of lower dimensions upwards than the input space, this generates a reconstructed image. output which we can denote as . minimize the distance between that input and our reconstructed output, so for example for an image we can compare the difference in pixels between the input data and the reconstructed output by simply subtracting the images from each other and squaring that difference to capture the divergence in pixels between the input and the reconstruction, what I hope you notice and appreciate is that in that definition of loss it does not require any label, the only components of that loss are the original input data X and the reconstructed output X, so that I have simplified it. now this diagram, by abstracting those individual neural network layers into the encoder and decoder components of this and again this idea of ​​not requiring any labels, goes back to the idea of ​​unsupervised learning, as what we've done is we've been able to learn an encoded quantity our latent variables that we cannot observe without explicit labels all we started with was the raw data itself, it turns out that how the question and the answer arrived at that dimensionality of the latent space has a great impact on the quality of the reconstructions generated and how compressed that information bottleneck is.
Automatic encoding is a form of compression, so the lower the dimensionality of the latent space, the less good our reconstructions will be, but the higher the dimensionality, the less efficient the encoding will be. to be so, to summarize this in this first part, this idea of ​​an autoencoder uses this bottlenecked and compressed hidden latent layer to try to reduce the network to learn a compact and efficient representation of the data, we do not require any labels, this is not fully supervised and So in this way, we can automatically encode information within the data itself to learn this latent space.
Automatic information coding. Automatic data coding. This is a pretty simple model and it turns out to put this idea of ​​self-encoding or self-encoding into practice. encoding has a little twist that allows us to generate new examples that are not just reconstructions of the input data itself and this brings us to the concept of variational autoencoders or vaes with the traditional autoencoder that we just looked at if we pay. Pay more attention to the latent layer on the right, shown in that salmon-orange color. That latent layer is just a normal layer in the neural network. It is completely deterministic.
What that means is that once we've trained the network, once the weights are set, every time we go through a given input and back through the decoded latent layer, we'll get exactly the same reconstruction, the weights won't. change, it is deterministic, in contrast, variational autoencoders introduce an element of randomness, a probabilistic twist on this idea of ​​autocoding. What this will allow us to do is generate new similar images or new data instances that are similar to the input data but are not forced to be strict reconstructions in practice. with encodervariational automatic we have replaced that single deterministic layer with a random sampling operation now, instead of learning only the latent variables directly for each latent variable, we define a mean and standard deviation that captures a probability distribution over that latent variable.
What we have done is we have moved from a single latent vector. variable Z to a vector of means mu and a vector of standard deviations Sigma that parameterize the probability distributions around those latent variables, what this will allow us to do is now sample using this element of randomness this element of probability to then obtain a representation probabilistics of the latent space itself, as hopefully you can see well, is very, very similar to the autoencoder itself, but we just added this probabilistic twist where we can sample in that intermediate space to get these extraneous latent variable samples now to get going a little bit deeper into how you actually learn this, how you actually train it by defining the vae, we've removed this deterministic nature to now have these encoders and decoders that are probabilistic, the encoder is computing a probability distribution of the given latent variable Z input data del vae, so when we get to how to actually optimize and learn the weights of the network the first step is to define a loss function, that is the core element of training a neural network.
Our loss will be a function of the data and a function of the neural network weights as before, but we have these two components, these two terms that define our vae loss, first we look at the reconstruction loss as before, where the goal is capture the difference between our input data and the reconstructed output and now, for vae, we have introduced a second term. to the loss of what we call regularization term, you often maybe even see this is known as vae loss and we will get into the description of what this regularization term means and what you are doing to do it and to understand, remember and note that in all neural network operations our goal is to try to optimize the network weights with respect to the data to minimize this objective loss, so here we are concerned with the Phi and Theta network weights that define the encoder and decoder weights, we consider these two terms first, the reconstruction loss again, the reconstruction loss is very similar to the same as before.
You can think of it as the error or probability that effectively captures the difference between your input and your outputs and again we can trade this in an unsupervised way without requiring any labels to force the latent space and the network to learn how to effectively reconstruct the input data. . The second term, the regularization term, is now where things get a little more interesting, so let's go. Let's continue this in a little more detail because we have this probability distribution and we're trying to compute this encoding and then decode it as part of regular regularization. We want to take that inference about the latent distribution and constrain it to behave.
Well, if you want, the way we do it is we put what we call a prior on the latent distribution and what this is is an initial hypothesis or a guess about what that latent variable space is going to look like. This helps us and the network to impose a latent space that roughly tries to follow this prior distribution and this prior distribution is denoted as P of Z right at that term d. That is effectively the regularization term, it is capturing a distance between our coding of the latent variables and our previous hypothesis about what the structure of that latent distribution is. the space should look like this over the course of training, we are trying to enforce that each of those latent variables adapt to a problem and adopt a probability distribution similar to the previous one.
A common option when training and developing these models is to enforce latent variables. To be approximately standard normal Gaussian distributions, meaning they are centered around zero mean and have a standard deviation of one, what this allows us to do is encourage the encoder to place the latent variables approximately around a centered space. Distribute the encoding smoothly so that we don't stray too far from that soft space that can occur if the network tries to cheat and tries to memorize the data. By simply placing the standard Gaussian normal prior in the latent space, we can define a concrete mathematical term that captures the distance, the divergence between our encoded latent variables and this prior, and this is called KL divergence, when our prior is a standard normal, The KL divergence takes the form of the equation I'm showing on the screen, but what I want you to do.
What it really gets away with is that the concept of trying to smooth things out and capture this divergence and this difference between previous and latent encoding is all this KL term is trying to capture, so it's a bit mathematical. and I recognize that but what I want to address next is really what the intuition is behind this regularization operation, why we do this and why the normal prior in particular works effectively for vaes, so let's consider what properties we want our latent space and for this. The regularization to achieve the first is this continuity goal, we don't understand it and what we mean by continuity is that if there are points in the latent space that are close together, ideally after decoding we should recover two reconstructions that are similar in content that make it feel like they are close together, the second key property is this idea of ​​integrity, we don't want there to be gaps in the lane space, we want to be able to decode and sample from the latent space in a way that is fluid and To be more concrete, Let's ask what the consequences could be of not regularizing our latent space well.
If we don't regularize it, we may end up with cases where there are points that are close in the latent space but we don't end up with similar decodings or reconstructions. Similarly, we might have points that do not lead to meaningful reconstructions. Somehow they are encoded but we cannot decode them effectively. Regularization allows us to realize points that end nearby in a latent state. space and they are also reconstructed in a similar way and they are reconstructed in a meaningful way, so continuing with this example, the example that I showed there and I didn't go into detail showed these shapes, these shapes of different colors and that we are trying to encode at some level lower. dimensional space with regularization we can achieve this by trying to minimize that regularization term, it is not enough to employ only the reconstruction loss to achieve this continuity and this integrity due to the fact that without regularization, simply encoding and reconstructing does not guarantee the properties. of continuity and integrity we overcome these problems of having potentially spiky distributions that have discontinuities that have disparate means that could end up in the latent space without the effect of regularization.
We overcome this now by regularizing the mean and variance of the encoded latent distributions according to the normal prior. What this allows is for the learned distributions of those latent variables to effectively overlap in the latent space because everything is regularized to have according to this prior of mean zero standard deviation one and that centers the means regularizes the variances for each of those. independent distributions of latent variables together, the effect of this network regularization is that we can achieve continuity and integrity in the latent space. Points and distances that are close should correspond to similar reconstructions we get, so hopefully this will get at some of the intuition behind it. the idea of ​​the vae behind the idea of ​​regularization and trying to impose the structured normal prior on the latent space with this in hand with the two components of our loss function reconstructing the inputs regularizing the learning to try to achieve continuity and completeness Now we can think about how we define a direct pass through the network from an input example and we can decode and sample the latent variables to see new examples.
Our last critical step is how the actual backpropagation training algorithm is defined and how we achieve it. This, the key that I present with vaes is this notion of sampling randomness that we have introduced by defining these probability distributions on each of the latent variables. The problem with this is that we can't backpropagate directly through anything that has a sampling element. anything that has an element of randomness, backpropagation requires completely deterministic nodes, deterministic layers in order to successfully apply the gradient descent and backpropagation algorithm. The innovative idea that allowed vaes to be trained completely end-to-end was this idea of ​​reparametrization within that. sampling layer and I will give you the key idea on how this operation works.
It's actually quite clever, so like I said, when we have a notion of probability randomness, we can't sample directly through that layer, but with reparametrization. What we do is redefine how a Latent Variable Vector is displayed as a sum of a fixed deterministic mean mu a fixed Standard Deviation Vector Sigma and now the trick is that we divert all the randomness of all sampling to a random constant Epsilon which is draws from a normal distribution, so the mean itself is fixed, the standard deviation is fixed, all the randomness and sampling occurs according to that Epsilon constant.
We can then scale the mean and standard deviation by that random constant to achieve the sampling operation within the latent variables themselves. This actually looks like this and an illustration that breaks down this concept of reparametrization and divergence is as follows, so look here, what I've shown are these completely deterministic steps in blue and the random sampling steps in orange originally if our latent variables are the ones that effectively capture the randomness, the sampling itself, we have this problem because we can't propagate back, we can't train directly through anything that has stochasticity that has randomness, what reparametrization allows us to do is shift this diagram where now we are.
We have completely shifted that sampling operation to the side to this Epsilon constant that is drawn from a normal prior and now when we look back at our latent variable, it is deterministic with respect to that sampling operation, what this means is that we can propagate backwards. to update our network weights completely end-to-end without having to worry about direct randomness and direct stochasticity within those latent variables. C, this trick is actually very powerful because it allowed the ability to train these completely end-to-end in a backpropagation algorithm. Okay, at this point we've gone over the core architecture of vais, we've introduced these two loss terms, we've seen how we can train it end-to-end, now let's consider what these latent variables are actually capturing and what When we impose this prior distribution, what allows us to do is to effectively sample the latent space and actually slowly perturb the value of the individual latent variables while keeping the others fixed and what can be observed and what can be seen here is that by doing that perturbation , that adjustment of the value of the latent variables, we can run the vae decoder each time, reconstruct the output each time we make that adjustment and what you will hopefully see with this example with the face is that an individual latent variable is capturing something semantically informative, something meaningful and we see that by this perturbation by this adjustment in this example the face, as you can see, is Changing the pose is Changing and all of this is driven by the perturbation of a single latent variable that adjusts the value. of that latent variable and seeing how that affects the decoded reconstruction, the network is actually able to learn these different encoded features, these different latent variables, so that by altering their values ​​individually, we can interpret and make sense of what they mean those latent variables and what they represent to make this more concrete, we can even consider multiple latent variables compared simultaneously against each other and ideally we want those latent characteristics to be asindependent possible to get the most compact and richest representation and encoding, so here again in this example of faces, we walk along two axes, the head pose on the x-axis and what appears to be some sort of notion of a smile on the y axis and you can see that with these reconstructions we can actually disturb these features in order to disturb the final effect in the reconstructed space and so ultimately with the vae our goal is to try to enforce as many of possible information to be captured in that encoding.
We want these latent characteristics to be independent and ideally disentangled. It turns out that there is a very clever and simple way to try to encourage this independence and unraveling, although this may seem a little complicated with math and a little scary. I'll break it down with the idea of ​​how a very simple concept imposes this independent latent encoding and this disentanglement, all that this term shows are those two components of the loss, the Reconstruction term, the regularization term, that's what that I want you to focus on the idea of ​​the unraveling of latent space that really came up with this concept of beta beta vaes.
What the beta vas do is introduce this beta parameter and what it is is a weighting constant. The weighting constant controls how powerful that regularization term is on the overall vae loss and it turns out that by increasing the beta value you can try to encourage more disentanglement and more efficient encoding to make these latent variables uncorrelated with each other now. If you are interested in mathematically why beta you are going to apply this disentanglement there are many articles in the literature and tests and discussions about why this occurs and we can point you in those directions, but to get an idea of ​​what this really affects Downstream when we consider the facial reconstruction as a task of interest with the term non-beta standard vae or rather a beta of one, hopefully we can see that the characteristics of head rotation, posture and head rotation also end up correlating with smiling and the facial expression of the mouth on the position of the mouth, since as the head posture changes, the apparent smile or the position of the mouth also changes. changing, but empirically with beta values ​​we can see that by imposing these beta values ​​much, much, much larger than one, we can try to impose a greater separation where we can now consider only a single value.
The latent variable pose of the head and smile , the position of the mouth in these images is more constant compared to the standard vae, so this is really all the core math, the core operations, the core architecture of the A's that we're going to cover. Today's lecture and in this class in general to close this section and as a final note I want to remind you of the motivating example that I presented at the beginning of this lecture, facial detection, where now I hope you have understood this concept of latent variable. learning and encoding and how this can be useful for a task like face detection where we may want to learn those distributions of the underlying features in the data and you will actually get hands-on practice in software labs to build autoencoders variational. that can automatically discover underlying features of facial detection data sets and use that to really understand the underlying, hidden biases that may exist with that data and with those models and it doesn't just stop there tomorrow we'll have a very, very interesting guest lecture on robust and reliable deep learning that will take this concept a step further to realize how we can use this idea of ​​generative models and latent variable learning to not only discover and diagnose biases, but also to resolve and mitigate some of those harmful effects of those biases in the neurons. networks for face detection and other applications are fine, so to quickly summarize the key points we've discussed, how they can compress data into this compact encoded representation, from this representation we can generate reconstructions of the input in a completely unsupervised way.
The way we can train them end-to-end using the repair maturation trick, we can understand the semantic interpretation of individual latent variables by perturbing their values ​​and finally we can sample the latent space to generate new examples by re-passing the decoder so you go. We are considering this idea of ​​latent variable coding and density estimation as its core problem. What if now we only focus on the quality of the samples generated and that is the task that matters most to us and that is why we are going to transition to a new type? of generative model called Generative Adversarial Network or Gam where with cans our goal is really that we care more about how well we generate new instances that are similar to existing data, which means we want to try to sample from a potentially very complex distribution that the model is trying to approximate, it can be extremely difficult to learn that distribution directly because it is complex, high dimensional, and we want to be able to get around that complexity.
What the Gans do is say, okay, what if we start from something super super simple? As simple as you can get completely random noise, could we build a neural network architecture that can learn to generate synthetic examples from completely random noise? This is the underlying concept of Gans, where the goal is to train this generating network that learns a transformation from noise. to the distribution of training data with the aim of making the generated examples as close as possible to reality with scans, the innovative idea here was to interconnect these two neural networks, one being a generator and the other a discriminator and these two components the generator. and the discriminator are at war competing with each other specifically, the goal of the generator network is to observe the random noise and try to produce an imitation of the data that is as real as possible, the discriminator which then takes the output of the generator as well as some examples of real data and tries to learn a classification decision that distinguishes the real from the fake and effectively in the Gan these two components go back and forth competing with each other trying to force the discriminator to better learn this distinction between real and fake while the The generator is trying to trick and overcome the discriminator's ability to do that classification, so that's the underlying concept, but what I'm really excited about is the following example, which is one of my favorite illustrations and tutorials in this class and comes to the intuition behind Gans, how they work and the underlying concept, well let's look at a 1D example, points on a line, that's the data we are working with and again the generator starts with random noise and produces some false data .
We're going to fall somewhere on this one-dimensional line. Now the next step is for the discriminator to see these points and also see some real data. The goal of the discriminator is to be trained to generate a probability that an instance it sees is real or real. false and initially at the beginning before training it is not trained correctly so its predictions may not be very good but over the course of training you will train it and hopefully it will start to increase the probability of those examples that are real and the probability will decrease. probability for those examples that are fake the general goal is to predict what is real until eventually the discriminator gets to this point where it has a perfect separation, perfect classification of the real versus the fake, so at this point the discriminator thinks that It's okay, I've done my job.
Go back to the generator and you will see the examples of where the real data is and you may be forced to start moving the generated fake data closer and closer to the real data. Then we can go back to the discriminator that receives these newly synthesized ones. examples from the generator and repeat the same process of estimating the probability that any given point is real and learning to increase the probability of the true real examples decrease the probability that the false points are fitted over the course of their training and finally we can continue Go back and repeat with the generator one last time the generator starts bringing those fake points closer and closer to the real data so that the fake data almost follows the distribution of the real data at this point it becomes very very difficult for the discriminator. distinguish between what is real and what is fake, while the generator will continue to try to create fake data points to fool the discriminator.
This is really the key concept. The underlying intuition behind how Gan components essentially compete with each other from behind. and forward between the generator and the discriminator and in fact this is the intuitive concept is how the Gan is trained in practice where the generator first tries to synthesize new examples synthetic examples to fool the discriminator and the goal of the discriminator is to take both the false examples and the real data to try to identify the synthesized instances in the training, what this means is that the target loss for the generator and the discriminator have to be at odds with each other, they are contradictory and that is what gives rise to In the component of the ingenerative adversarial network, these adversarial objectives are brought together to then define what it means to reach a stable global optimum where the generator is able to produce the true data distribution that would completely fool the discriminator.
Specifically, this can be defined mathematically in terms of a loss target, and again, while I'm showing math that I can, we can summarize this and analyze what each of these terms reflects in terms of that core intuitive idea and the idea. conceptual that hopefully that 1D example conveyed so that we can First consider the perspective of the discriminator D, its goal is to maximize the probability that its decisions, uh, in its decisions, that the real data will be classified as classified Fe real data as false, so here the first term G of Z is the output of the generator and D of G. of Z is the discriminator's estimate that the generated output is false D of x x is the real data, so D of X is the estimate of the probability that a real instance is fake 1 minus D of together you want to try to maximize the probability of getting correct answers right now with the generator.
We have exactly the same terms, but note the generator. It can never affect anything the discriminator's decision is actually doing other than generating new data examples, so for the generator its goal is simply to minimize the probability that the generated data will be identified as false. We want to then put this together to define what means for the generator to synthesize false images that hopefully fool the discriminator. Ultimately, this term, in addition to the mathematics, in addition to the particularities of this definition, what I want you to take away from this section on Gans is that we have this double competitive objective. where the generator tries to synthesize these synthetic examples that ideally fool the best possible discriminator and in doing so the goal is to build a network through this adversarial training, this adversarial competition to use the generator to create new data that better mimics the real data . distribution and is completely synthetic new strange instances.
What this amounts to in practice is that after the training process you can look exclusively at the generator component and use it to then create new data instances. All of this is done by starting from random noise and trying to learn. a model that goes from random noise to the distribution of real data and effectively what the Gans are doing is learning a function that transforms that distribution of random noise into some objective. What this mapping does is it allows us to take a particular observation of the noise into that noise. space and assign it to some output, a particular output in our target data space and in turn if we consider some other random sample of noise, if we feed it through the generator again it will produce a completely new instance which will fall into another instead of those true data. distribution manifold and in fact what we can actually do is interpolate and traverse between trajectories in the noise space which are then mapped to traversals and interpolations in the target data space and this is really cool because now you can think of a starting point and an objective. point and all the steps that will take you to synthesize and go between those images in that Target data distribution, so hopefully this gives some insight into this concept of generative modeling for the purpose of creatingnew instances of data and that notion of interpolation. and data transformation leads nicely into some of the recent advances and applications of Gans, where one particularly employed idea is to try to grow Gan iteratively to obtain increasingly more detailed images.
The generations progressively add layers throughout the training to then refine the examples generated by the generator and this is the approach that was used to generate those synthetic images of those synthetic faces that I showed at the beginning of this lecture. This idea of ​​using again is iteratively refined to produce higher resolution images. Another way we can extend this concept is to extend the Gan architecture to consider particular tasks and impose additional structure on the network users themselves. One particular idea is to say, "Okay, what if we have a particular label or some factor that we want to condition the generation on?
We call it C." and is supplied to both the generator and the discriminator, what this will allow us to achieve is a paired translation between different types of data, so for example we can have images of a street view and we can have images of the segmentation of that street view and we can build a gan that can directly translate between the street view and the segmentation. Let's make this more concrete by considering some particular examples, so what I just described was going from a targeting label to a street scene. We can also translate between a satellite view and an aerial satellite. image to what is the road map equivalent of that satellite aerial image or a particular annotation or labels of the image of a building to the actual visual realization and the visual facade of that building, we can translate between different lighting conditions from day to night. night, black and white to color contours to a color photo in all these cases and I think that, in particular, the most interesting and striking thing for me is this translation between street view and aerial view and this is used to consider, for example, if you have data from Google Maps, how can you go. between a street view of the map and the aerial image of that, finally, again, extending the same concept of translation bit between one domain to another, the idea is that of completely unpaired translation and this uses a particular Gan architecture called cyclogamma, so in this video I'm showing here that the model takes as input a bunch of images in one domain and it doesn't necessarily have to have a corresponding image in another target domain, but it is trained to try to generate examples in that domain of destination that roughly correspond to the source domain transfers the style from the source to the destination and vice versa, so this example shows the translation of images in the horse domain to the zebra domain.
The concept here is this cyclical dependence. You have two Gans that are connected to each other through this cyclical loss. transforming between one domain and another and I really like all the examples we've seen so far in this lecture, intuition is this idea of ​​distribution transformation, normally again you go from noise to some target with the Gan cycle that you're trying. going from some source distribution, some data collector , this allows us to make transformations not only to images but also to voice and audio, so in the case of voice and audio it turns out that you can take sound waves and represent them compactly in a spectrogram image and use a Gan cycle to then translate and transform speech from one person's voice in one domain to another person's voice in another domain.
These are two independent data distributions that we defined, maybe you're getting where I'm implying. Maybe not, but in fact that's exactly how we developed the model to synthesize the audio behind Obama's voice that we saw in yesterday's keynote. What we did was train a Gan loop to take data on Alexander's voice and transform it into Data on the multiple of Obama's voice so we could visualize what the spectrogram waveform looks like for Alexander's voice versus Obama that was completely synthesized using this cyclogan approach. Hello everyone and welcome to my official introductory course on Food Illness 191 here in New York.
Hello everyone. I repeated it well, but basically. What we did was Alexander spoke that exact phrase that was played yesterday and we had the Train Cycle Gan model and then we can implement it on that exact audio to transform it from Alexander's voice domain to Obama's voice generating the synthetic audio that played . for that video clip, okay, before I accidentally replay, I jump now to the summary slide, so today in this lecture we have learned deep generative models, specifically talking about latent variable models, autoencoders, variational autoencoders, where our goal is to learn this. low-dimensional latent encoding of the data, as well as generative adversarial networks where we have these competing generator and discriminator components that are trying to synthesize synthetic examples.
We've talked about these core fundamental generative methods, but it turns out like I mentioned at the beginning. from the conference that in this last year in particular we have seen really tremendous advances in generative modeling, many of which have not been those two methods, those two fundamental methods that we described, but rather a new approach called diffusion modeling that diffusion models are driving. are the driving tools behind the tremendous advances in generative AI that we've seen in the last year, in particular viez Gans, they are learning these Transformations, these encodings, but they are largely restricted to generating examples that are similar to the data space I've seen before that diffusion models now have this ability to hallucinate, visualize and imagine completely new objects and instances that we as humans may not have seen or even thought about in correct parts of the design space that are not covered by the training data, so an example. here's this AI generated art, which is art, if you're right, that was created using a diffusion model and I think this not only addresses some of the limits and capabilities of these powerful models, but also questions about what it means create new instances.
What are the limits and limits of these models and how do they do it? How can we think about its advances with respect to human capabilities and human intelligence? I'm very excited that on Thursday at the 7th conference on New Frontiers in Deep Learning we're going to go really deep into diffusion models. We will talk about its foundations. We'll talk not only about applications to imaging but also other fields where we're seeing these models really start to make transformative advances because they're actually at the starting point. Very cutting edge and very much the new frontier of generative AI today, okay, with that teaser and hopefully setting the stage for the seventh conference on Thursday and concluding and reminding everyone that we now have about an hour to schedule open office. so you can work in your software labs, come to us and ask any questions you may have, as well as the Tas who will also be here, thank you very much.

If you have any copyright issue, please Contact