AI Learns To Swing Like Spiderman

Jun 23, 2024

I'm sure you've seen this a million times: it's Spider-Man strolling through the city, but this time something is very different. This Spider-Man is not human. He actually he is an AI and this AI started without knowing anything about his surroundings and yet. He went from falling flat on his face to

swing

ing in the air at over sixty miles per hour, so how did he accomplish this and how did he learn to

swing

when he had no human assistance? Let's go back to the beginning, the first thing we did. The AI ever sees is a state, the state could be anything, it could be a view or some numbers or it could be a picture of a beautiful house.

The point is that the AI receives some kind of sensory information about the environment it is in and then takes. this state and generates an action, the AI has a bunch of inputs it can interact with and the action describes how they should be configured like humans. We do something similar with video game controllers. Actions can be discrete or continuous. Discrete actions are like pressing buttons while Continuous actions are more like joysticks or levers for the sake of Spider-Man's AI. All actions will be continuous. Finally, the action is applied to the environment, which progresses in time and generates a value called a reward.

More Interesting Facts About,

ai learns to swing like spiderman...

Reward is what motivates our AI. and is awarded like a score in a video game, so you get points for doing the right thing and lose points for doing the wrong thing. AI's job is to figure out how to maximize the reward you get by choosing the right actions to indicate the actions. and rewards, this small sequence we have covered describes how a single frame or time step plays out for our AI in the next time step, the environment will change and a new state will be reduced, leading to a new action and reward, This process will continue forever. action and reward for each time step eventually we can reach a terminal state this ends the loop maybe the AI ran out of time or was successful or decided to go to work at Joe's Pizza the point is that the loop was broken the total reward that we have received is final and the AI cannot choose any more actions.

This closed sequence of states' actions and rewards forms an episode by organizing the data in this way. Now we can select any time step and see its future. The reason we do this is so we can analyze the result of our actions more directly if we isolate a specific time step. It's tempting to judge actions based on immediate reward, but if we look into the future we can see that the reward we got was taken away from us by a huge punishment to resolve. This we can add up each future reward creating a new number called return.

This significantly improves the judgment of our actions, but there are still some problems. We're not really considering the value of time. Yes, we are looking to the future, but how far should we? We believe that receiving twenty dollars now is the same as receiving 20 in 10 years, unless you are very patient or broke, time probably has no value to us and there is a good reason for that, the future will always have some kind of uncertainty, this means. that any reward received in the future should be worthless as it might not happen, uncertainty also increases with time so our future reward should be reduced by how much time has passed.

Let's introduce a new term, the discount factor, this is a value between 0 and 1. which represents how fast the rewards will decay at each time step. Typically this is set between 0.9 and 0.99. To use this number, we multiply along with the first future time step, since the discount factor is less than one, causing the reward to be reduced slightly. The subsequent time step is multiplied with the discount factor twice, making it even smaller, then we do it three times for this one and four times for this one and so on, finally we can take our new reduced rewards and add them together, this creates a new worth.

It's called a discounted return note how this is different from the regular return, the discounted return actually takes time into account and gives us a much more complete view of how good our actions are, and how clear the actions that are chosen, this is where AI comes into play. or more specifically a neural network, neural networks are like artificial brains and their main job is to take the input process and make a decision. They achieve this by linking artificial neurons, allowing information to flow from one place to another. Neural networks can also learn which is achieved by strengthening or weakening connections with various neurons, changing our information flows through them, we will come back to learning later, the only thing we are really concerned about at the moment is how to connect this to the environment in which our network can accept a number for each neuron in the first layer and since our input will be 20 numbers we will need 20 neurons, the 20 numbers we feed will be the state I mentioned before and will contain essential information about the environment, things like Spider-Man's position or the direction of his limbs or the deadline on his next assignment, as for the output, every time the web spits will become the action, the AI will have full control over both arms, so that for each arm we will need a tilt and tilt angle which gives them a 3D direction to aim, but you will also need an additional input for each arm to control the web shooters, all added up is six numbers, so we will need six neurons in the last layer.

Yes there is no leg control which is unfortunate but we need this to be as simple as possible so the AI can learn properly, in addition to the input and output neurons the network will also need some hidden neurons. The hidden neurons go in the middle of the network and are responsible for providing the information. AI his intelligence without them, the neural network can still learn, but its capabilities will be very limited for Spider-Man. The AI 512 heater neuron should be sufficient. This may seem like a lot, but it's actually about 10 of the neurons inside a jellyfish, so In reality, this may be the dumbest Spider-Man who ever lived, but he's still a very capable apprentice and we're going to take advantage. that now.

All the things we've talked about so far describe the downsides of a family of algorithms. Within this family, called deep reinforcement learning, there are a variety of different methods that we can use to make the AI learn, so we will have to choose one to get the AI learning for Spider-Man, we will use PPO, which stands for optimization of upcoming policies. Here's a summary of how that algorithm works: Our AI, which we'll now call an actor. It will interact with the environment generating time steps and warm-up episodes while the actor is performing. Randomness will be injected into his actions so that the actor can still choose, but the randomness. ensures that we explore all our options, eventually we will have another AI known as a critic, who judges how good it thinks our actions are.

The critic's job is to try to estimate the discounted returns at each time step, so the critic doesn't just have to do that. figuring out how the environment works, you also have to figure out how the actor will react, who is also an AI, oh yeah, and has the same brain power as the actor, so if you ever thought your job was hard, try being this. boy, once we have enough episodes and time steps, the actor and critic will see all the data and learn from it because of the randomness we added, there will be a variety of different outcomes in each episode, some of the episodes will be good and others. of them will be bad, just like Game of Thrones, the critic is constantly trying to guess a value that we will discover in the future so we can train him by calculating that Vader value and comparing it to the critics prediction, which will tell us how wrong the criticism is that we can then feed all of these error values back into the criticism and, through a process called backpropagation, the next time it will be a little less incorrect, since for the actor we can train it by maximizing the critics' estimate in terms simple that we are.

We are an athlete and we want to make our coach as happy as possible by consistently doing what they want us to do. The critic is constantly trying to predict the discounted return the actor will get, but he cannot predict the randomness we add to the actions. What ends up being decided is the average of all the possibilities or the value with the smallest error compared to the random results. This slightly changes what it calculates from discounted performance to something called a value function, which is defined as the discounted performance if the AI behaves normally or without randomness.

We can use the value function as a basis for performance; It essentially describes how well the actor should perform in a given state, so if the actor manages to overcome it, then the randomness in his actions must have improved them. This also applies the other way around: if the actor's performance does not reach the baseline, then the randomness in his actions must have made them worse. We can quantify this idea by subtracting the baseline from the discounted return. We figure this creates a new term called advantage. advantage function We can use the advantage function as a guide on how to improve the actor's actions.

If the advantage function is positive, then we want to encourage the actions associated with it and if it is negative, then we want to discourage it for every time step we make. We have put together that we can calculate the advantage function and then multiply it with this fraction, which gives us the error values to input into the actor. This fraction contains the probability that the actor selects the action that is possible due to the added randomness. This part is really confusing and frankly, it's a little complicated for this video, if you're still interested I've included some links in the description so that we can plug all these values into the actor and by using backpropagation again we can do that the actor is a little better at choosing the right actions.

Our actor and critic are a little better now. Now that we discard all the episodes we collected and do everything again in this new training session, the improved actor and critic will produce better episodes that will lead to new experiences that we can learn from. Keep repeating this process over and over, improving the actor by a critical level each time. Eventually, the actor and the critic will reach a point where they will excel at their tasks. At this point, the training is complete. Now we can throw away our critic and our fully trained actor. Now he will master the tasks we assign to him.

There is only one flaw in this process. Unfortunately, neural networks are prone to forgetting things they've learned, which could lead to a disaster similar to Peter Parker's performance problems in Spider-Man 2. To fix this problem. PPO adds an extra step to the actor's learning. Every time the actor goes through a training session, we check how different his behavior has become from the previous one. If today's AI strays too far from its old self, then we restrain it to effectively prevent it. Avoid changing further in that direction, this simple step helps the actor retain the knowledge he gained from previous training sessions, which prevents drops in performance and allows the trainee's new sessions to be more effective and for PPO that is all What to do now let's see what this is.

AI can do it, maybe we'll just swing around. The AI spent a total of 11 hours training and was eventually able to average around 1.2 kilometers per move at first. AI is not progressing at all. The webs have spun randomly and there is no rhyme or reason. due to your actions you keep hitting the ground over and over again after a few training sessions the AI

learns

that forward spinning nets are a great way to start, unfortunately there is rarely tracking but it's still better than nothing approx. thirty minutes later. He started to learn how to shoot more than two webs allow him to go a little further He started to make decent progress but he was still very sloppy One of the keys to swinging like Spider-Man is consistency, one bad web can easily nullify several good ones .

I can also see the emergence of what I like to call the Desperado worm. Occasionally the AI gets stuck on the wall and decides to sacrifice speed to survive, leading to this hilariously slow movement. The AIS progress after that began to slow down most of the learning that took place afterward. this was mainly for the sake of consistency, the remaining training hours were less about exploring and more about fine-tuning this AI was one of my favorites, now opting for a backflip at the start for more flair and really loving this right wall by some That's why I want to say that he really loved this wall.

NowWe're six hours into the training and we can really see the AI experience starting to shine through. Now use both walls to make the changes more consistent and much faster. there is a new sense of calm to this AI in general, it just seems more confident with its actions, thank you and here is the final AI that dominates, it now swings around the middle at very high speeds occasionally exceeding 150 kilometers per hour, it is So good, in fact, you don't even need to look where you're going or use your left hand, it's really fulfilled its destiny now due to some criticism.

I don't have another video. I think I just let the AI swing for a bit longer. so if you're still here enjoy thank you right no no no no oh sure right right

Watch Video & Subscribe

If you have any copyright issue, please Contact