YTread Logo
YTread Logo

Training AI to Play Pokemon with Reinforcement Learning

Apr 30, 2024
Right now we're looking at 20,000 games

play

ed by an AI as it explores the world of Pokémon Red. At first it starts out with no knowledge and is only able to press random buttons, but over 5 years of simulated

play

time it gains many abilities as it progresses. Learning from their experiences, eventually the AI ​​is able to catch Pokémon, evolve them, and defeat a Gym Leader. At one point, he even manages to exploit the game's random number generator, but perhaps more fascinating than his successes are the ways in which he fails, which are surprisingly relatable. our own human experiences it turns out that studying the behavior of an algorithm can actually teach us a lot about ourselves in this video.
training ai to play pokemon with reinforcement learning
I will tell the story of the development of this AI and analyze the strategies it learns in the end. I'll go into some technical details and show you how to download and run the program yourself. Let's start by asking how it works. The AI ​​interacts with the game in a very similar way to a human being. Take pictures of the screen and choose which buttons to press. Optimize your options. Using something called

reinforcement

learning

with this method we don't have to explicitly tell the AI ​​what buttons to press we just need to give it high level feedback on how well it is playing the game, what this means is that if we assign rewards to the goals . we want it to complete AI can learn on its own through trial and error, so what does this look like?
training ai to play pokemon with reinforcement learning

More Interesting Facts About,

training ai to play pokemon with reinforcement learning...

The AI ​​will start without any knowledge or skills and will only be able to press random buttons so you can get useful information. We need to create a gentle rewards curriculum that will guide you toward

learning

the difficult objectives. Perhaps the most basic objective to begin with is to explore the map. We would like a way to reward the AI ​​when it reaches new locations. In general, we would like it. To encourage curiosity, one way to do this is by keeping a record of every screen the AI ​​has seen while playing the game. We can compare the current screen with all the screens in the record to see if there are close matches, if there are no matches. are found, this means that the AI ​​has discovered something new, so it will give you a reward and add the new screen to the log, rewarding it for unique screens that should encourage it to find new parts of the game and look for new things right now that we have. a goal, let's start the learning process and see how it goes at this stage, the AI ​​essentially presses random buttons just to see what happens to accumulate experience faster, we will make you play 40 games simultaneously and each one will play for 2 hours now.
training ai to play pokemon with reinforcement learning
The AI ​​will review all the games and update based on the rewards you earned. If it goes well, we should see incremental improvement and the whole process can be repeated after a few

training

iterations, the AI ​​finds its way out of the starting room noticeably faster than when its behavior was random for some reason, although instead of exploring Route One, you fixate on a particular area of ​​Pallet Town, why does this happen? It turns out that when you're looking for news it's easy to get distracted, especially the area you're stuck in has become lively. water, grass and NPC walking around, it turns out that this animation is actually enough to trigger the novelty reward many times, so according to our own goal, just hanging out and admiring the scenery is more rewarding than exploring the rest of the world.
training ai to play pokemon with reinforcement learning
This is a paradox. What we find in real life Curiosity leads us to our most important discoveries but at the same time makes us vulnerable to distractions and gets us into trouble. As human beings we can reflect on our own sources of intrinsic motivation but we cannot easily change them. Otherwise, it is easy for us to change them so that the AI ​​returns to our current scenario. A reward is triggered if more than a few pixels differ from previously seen screens, but if we raise this threshold to several hundred pixels, the animations will not be enough. will no longer trigger the reward, this means the AI ​​will no longer get any satisfaction from seeing them and will only develop an interest in newer locations.
Please note that every time we change the rewards, we restart the AI ​​learning from scratch because we want the entire process to be reproducible after this change, the AI ​​starts exploring Route One and finally reaches Vidiian City. This is great progress, but now there is another problem in most battles. The screen looks more or less the same, so there aren't many exploration rewards to be had during battles. This in turn causes the AI ​​to simply run away from them, but ultimately you can't progress without fighting, so to fix this, let's add additional rewards based on the combined levels of all your Pokémon now with this new incentive to gain levels of AI gradually. starts winning battles cat catches Pokémon and levels them up eventually his Pokémon reach a high enough level to start evolving he actually tends to cancel evolutions at first before finally deciding it's beneficial however he seems to get stuck when a Pokémon fights for long enough for its default move to run out despite this, after many more iterations of

training

in version 45 there is a substantial improvement at this point, we now see an evolved pidgeotto for the first time and finally also It has been discovered what to do when a move runs out and is able to switch to an alternate one, this will prove very important for something that happens later in version 60, the AI ​​begins to enter the Vidian Forest and begins its first trainer battles .
At this point, you actually have enough experience to succeed in your In the first encounter after this, you slowly begin to figure out how to navigate the forest. Finally, in version 65, he figured out how to get through the forest and made his way to P City, but something still wasn't right despite making a lot of progress. The AI ​​jumps right into battles, even ones it can't win, to make matters worse it never visits a Pokémon Center to heal itself, meaning when it loses it goes back to the beginning of the game. We can try to fix this problem. subtracting the reward when you lose the battle, but this doesn't work as we expected, instead of avoiding difficult battles when you're about to lose, the AI ​​simply refuses to press the button to continue pausing indefinitely, this technically satisfies the goal, but it doesn't it is.
What we intended when studying the effects of this, however, we noticed that on very rare occasions there is something else that subtracts a large amount of reward only once per training run; There is constantly a single game that deduces a reward 10 times greater than anything we intended by repeating the game. In the moment just before this happens, we see the AI ​​entering a Pokémon Center and wandering over to a computer in the corner after logging in and aimlessly pressing buttons for a while, depositing a Pokémon into the system and immediately losing it. a large amount of reward.
This is because the reward is assigned as the sum of the Pokémon's levels, so depositing a level 13 Pokémon is losing 13 levels instantly. This sends such a strong negative signal that it actually causes something of a traumatic experience for the AI. It does not have emotions like humans do, but a single event with extreme reward value can still leave a lasting impact on its behavior; in this case, losing your Pokémon just once is enough to form a negative association with the entire Pokémon Center and the AI ​​will completely prevent it on everyone. In future games, the main cause is that we never take into account the unexpected scenario where the total levels may decrease.
To fix this, we will be modifying the reward function so that it only gives some reward when levels increase. This seems to fix the issue after restarting the game. training the AI ​​he starts making visits to the Pokémon Center we finally see him starting to challenge Brock in the buddy gym. This battle is much more difficult than the others and poses a significant challenge because the AI's previous experience is now working against it up to this point. He has had great success using only the primary moves and has learned to rely on them exclusively. Now you need to use something else.
This problem may seem trivial, but even humans struggle with the same fundamental problem. Our experience and biases help us make decisions and solve problems. faster, but they also limit our thinking and hinder our ability to approach a problem from a new angle to make matters even worse. If the AI ​​loses the battle too many times, it will actually learn to avoid it all together, but after carefully adjusting our rewards and countless failed attempts, the AI ​​finally has a lucky break in one of the games, we see it starting the battle in the gym with only one usable Pokémon that has only a fraction of its hit points in available moves.
Normally, this would be a pretty terrible strategy. I don't know if a Water-type move will be super effective against Rock Pokémon, but because Tackle is completely exhausted, it sees that it needs to use an alternate move and switches to Bubble and finally, after over 300 days of simulated play time and 100 iterations of Learning that the AI ​​defeats Brock for the first time, this accidental breakthrough helps him learn to choose the bubble as his default move and in future games he wins this battle more consistently. Honestly, this exceeds my expectations of what I thought would be possible when I started this project.
It took a lot of experimenting to get here, but I was still amazed every time I signed up for a training run and discovered that AI had reached a new area. I would dare say it was a very rewarding experience, so this seems like a reasonable stopping point, but just out of curiosity, let's see how far the AI ​​will go if we let it continue after the gym battle. He begins meeting trainers on route three and eventually reaches the entrance to Mount Moon inside the Pokémon Center. Here a man. will sell you a Magikarp for $500 Magikarp isn't of any help in the short term, so hopefully the AI ​​won't be interested in it;
However, purchasing it is a super easy way to gain five levels, so the AI ​​buys it every In all games, it buys a total of over 10,000 magikarps. AI behavior may seem silly when its goal has become misaligned from its intended purpose, but there are times when the same thing happens to humans so that the AI ​​progresses in the game we set it up to find ways to increase its levels similarly. Evolution has selected humans to survive as hunters and gatherers, which naturally set us up to find ways to acquire scarce food. When the AI ​​arrives in an area with a cheap but useless Magikarp, it buys the Pokémon every time because this increases its levels. totals and, as humans have reached our Modern Age of abundance, we instinctively purchase unhealthy foods because they contain historically scarce nutrients.
Each of these indirect goals arose from very different circumstances, but when their environments changed, both became misaligned and no longer supported their original goals. The AI ​​then begins to explore the cave inside Mount Moon. Up to this point, it has battled every wild Pokémon it has encountered, but Magikarp is so ineffective that it eventually learns a special behavior just for this Pokémon if Magikarp is sent into battle. in any situation he will try to escape no matter what, sometimes even while fighting a trainer, despite exploring many caves he seems to get stuck in this passage, possibly because the area is too visually uniform to activate the exploration reward if we zoom in time.
The games last more than 2 hours, he is able to evolve into Blastoise and Pidgeot, but never reaches Mount Moon. We could keep trying to improve our reward function, but instead we'll use this as a stopping point and start analyzing what the AI ​​has learned. This visualization shows how the AI ​​navigates the map. Each arrow indicates the average direction you moved while at that particular location. A fascinating pattern is that he seems to prefer walking counterclockwise on almost all edges of the map. this means when standing with an edge to the right he prefers to walk up which is shown in blue when there is an edge above he prefers to walk to the left shown in pink when there is an edge to the left he prefers to walk down, as shown in Orange and when there is an edge below it prefers to walk to the right.
Shown in green. It's hard to know for sure why he developed this behavior when the explanation is that this heris helps him navigate with limited memory and planning if he walks around the perimeter of a two. -spacedimensional, you are guaranteed to pass all entry and exit points, so by choosing a direction and following an edge you can reach all possible junctions. The AI ​​doesn't always follow this pattern, but it seems to help her recover after exiting. track, however there are places where your navigation fails and the AI ​​gets physically stuck, for example at the bottom of Route 22 there is a long area with nothing useful, the one way ledge at the top means there are many ways entering the area, but there is only one place it is possible to leave this acts like a fly trap and the AI's stochastic movement causes it to get stuck and spend a disproportionate amount of time.
Here we can also visualize how the AI's behavior changed over the course of its training. Here we can also visualize how the AI's behavior changed over the course of its training. The first training iterations are shown in blue, the middle training iterations are shown in green, and the later iterations are shown in red. Here we can clearly see that in the middle of its training the AI ​​was taking a path through the Vidiian Forest but then changed to a different one. An interesting behavior that developed mid-training occurs early in the game. For some reason, I started every game with the exact same sequence of button presses.
This was confusing, particularly because the movement did not even follow an optimum. I walk looking a little further however something interesting happens: he throws a Pokeball immediately on his first encounter and succeeds on the first try. What this made me realize is that although the game contains many random elements, it still behaves deterministically with respect to player input, this is something well known to speedrunners and it seems that the AI ​​has taken advantage of the fact that you start from the same state every game to reliably catch a Pokémon on your first try. We can also visualize other statistics to understand what happened in all the games the AI ​​played.
On the left we can see all the Pokémon that the AI ​​caught at least once. On the right we can see at what point in the training those Pokémon were captured. The height of each region is scaled logarithmically, so Pokémon representing the thickest regions were caught thousands of times. Sometimes, while Pokémon representing the thinnest regions may have been caught only a handful of times, reflecting on all of this, it's amazing that identifiable experiences can emerge from an algorithm playing a video game. So much has happened during these tens of thousands of hours that there simply isn't any.
There's not enough time to discover all the interesting stories let alone document them, so here we conclude the main part of the video in the next section, delve into some technical details, explore strategies for running experiments efficiently, consider future improvements, and go over how execute them. this program yourself, although I have tried to avoid it as much as possible until now, this part will inevitably contain a lot more technical terminology First of all, the specific

reinforcement

learning algorithm used to train this AI is called proximal policy optimization, it is a Pretty standard modern reinforcement learning. algorithm and, although it was originally designed in the context of games and robotics, it has also been used as the final step in creating large and useful language models;
However, while reinforcement learning sometimes seems magical, it can actually be an incredibly difficult tool to apply in practice. The fundamental challenge of machine learning is getting a program to do something without explicitly telling it how to do it. This means that if your model doesn't behave as expected, you need to figure out how to improve it indirectly in terms of its learning. algorithm or training data online reinforcement learning adds an additional layer of indirection In addition to this, the training data that is fed into your model is no longer stationary and under your control, but is itself a product of the behavior of the model at a time prior to this feedback.
The loop leads to emergent behavior that may be impossible to predict, so here are some strategies to address these challenges without resources on an institutional scale first. Simplifying your problem may be necessary to address limitations of the tools that you may have noticed earlier in the video. The AI ​​didn't actually start from the beginning of the game. Here you can see an older version of the AI ​​starting from the beginning. You have no problem choosing a Pokémon and winning your first battle. The problem arises. A little later, when you need to backtrack from Vidiian City to Pallet Town, this is because the exploration reward gives you no incentive to return to an area you've already visited.
It might be possible to solve this by hacking special rewards just for this. scenario, but I decided that simply overlooking this was a better use of time, the modified starting point is still very close to the original and by giving AI Squirtle the default you have a better chance of success later on, it's important to find a setting. which allows you to iterate experiments within a reasonable amount of time and cost, in many cases the bottleneck will be simulating or operating the environment that the AI ​​interacts with. In our case, the environment is Pokémon Red running on the Game Boy emulator P boy this.
The emulator runs at approximately 20 times the normal speed on a single modern CPU core. Running minigames in parallel on a large server with many cores allows us to effectively collect interactions with the environment at more than 1000 times faster than normal speed, meaning each learning iteration. with a batch of 40 games each lasting 2 hours it will complete in about 6 minutes if we use a small model as policy its inference and training time will be negligible especially if we use a GPU this means we can get results of small experiments in minutes. to hours and a full training run will take a few days, this can be quite expensive using the cheapest possible cloud options.
A single full training run costs around $50. All experiments run in this project combined cost a total of around $1000. However, if you are not careful how you choose and manage these resources, it is easy to spend many times more. Next, you will have to worry. Carefully consider how the AI ​​interacts with the environment and how your reward function is designed. The decisions I made on this project certainly are. not everything is optimal, but I will describe at least some of the considerations that were taken into account, for example, the AI ​​looks at the screen and chooses an action once every 24 frames, this is enough time for the player to move one space of the grid in the world, so every time the AI ​​looks at the screen it will always be perfectly aligned on one of the grid cells, this makes the exploration reward much more effective because it significantly limits the number of possible screens in which you can see the component that makes the decisions called policy and is represented by a small convolutional neural network, in particular, it is non-recurrent, which means that it actually has no internal memory of the past.
This was done to improve training stability, convergence speed and simplicity, so how do you make decisions without memory of the past? Well, first the three most important ones. Recent screens are stacked together to create a simple form of short-term memory. Second, some basic information about the game state is encoded as visual status bars. These show hit points, total levels, and exploration progress. A more conventional approach would be to encode them as abstract vectors. injected directly into the model, I chose this method because it is interpretable by both humans and machines, which makes it much easier to know what happens when debugging recorded games together, these encode enough information so that the model can make decisions without any another form of memory, now let's talk.
We talked a little bit about exploring the reward feature and the total levels were discussed earlier in the video. These were by far the largest rewards, but seven were actually used in total and even more were tested but ultimately not used. There's not enough time to cover them all in depth, but the general criteria for all of them is that they generally encourage playing well rather than focusing on a specific moment and should not be easy to fool. All this information about the state of the game is obtained by reading the values. Since the Game Boy emulator memory, the game does not have any kind of proper API, but most of its memory is statically allocated, so variables can always be found at the same memory address.
Pet Project has done an incredible job of reverse engineering the game's source code. encode and map your memory like a tangent, honestly it's mind boggling that all the logical graphics and audio for this entire game are stored in less than 1 megabyte, this is smaller than a single photo you take with your phone anyway, understanding AI. This behavior is essential to reach your full potential. One of the best ways to do this is through visualization. So how are the visualizations used in this project made? First, key information, such as player coordinates and Pokémon statistics, is recorded at every step of every game. games must be represented on a single map the game itself does not contain the concept of a single map with a global coordinate system, rather the world is divided into pieces of 256x 256 tiles, the game tracks the player's local coordinates within of a chunk and which chunk the player is currently on, so after deriving a mapping from the local chunk coordinates to the global coordinates, all games can be rendered together.
The renderer itself is a fairly slow program that places Sprites at the coordinates in which the players move, choosing the appropriate Sprite based on the direction they move and interpolating between each tile to represent the giant game grid, A Python script is used to generate an FFM Peg command to gather thousands of videos using a fairly resourceful server, the stream visualization was performed using the same player data. By adding all the moves on each tile and combining all of these strategies, it is possible to train a reinforcement learning agent to play a complex game using only modest resources, but what else could be done to further improve this process?
Let's take a moment to consider how this could become easier. cheaper and faster in the future first, as mentioned above, the AI ​​in this project started learning from scratch without any prior EXP experience. In the future it would probably be possible to apply something called transfer learning, this is when a model is pre-trained on a large, broad data set which can then be leveraged for new tasks very efficiently in the past. This has revolutionized the fields of computer vision and natural language processing. There has been some interesting early work applying this to VR, but it hasn't landed yet.
Partly due to the lack of large and diverse data sets for these types of tasks, however, it seems that it should be feasible to extract a useful world model from a sufficiently large data set. Here's a quick experiment I did using a clip for zero-shot classification of game states. It's easy to imagine how this could be used to create reward functions for new environments without special access to their internal state. It's probably only a matter of time before large multimodal models start to have a big impact in this area. A second method of interest is directly Learning Environment Models: a couple of notable works in this area are muzo and dreamer.
These approaches offer a large improvement in data efficiency by learning a model from the environment itself and optimizing the policy against the learning model. I recently ran some experiments with dreamer V3 and was very impressed with the results, a third method worth mentioning is hierarchical RL which decouples low level control from high level planning. This allows fine moves and long-term strategy to be handled by separate mechanisms, so those are some of the ways this technology could be improved in the future. To conclude, let's see how to run this AI on your own computer, so this is the linked repository.
In the video description, the first step will be to download it. Cando it using git or downloading the zip below. The next step will be to legally obtain your Pokemon Red Rom. You can find it using Google. It must be a 1 Megabyte file ending in ingb. I already have mine here, so let's move to the root directory of the repository and we'll copy it here and I'll rename it Pokémon Red. GB, now the next step, this is optional, we can create a cond environment, so we can just name it whatever you want, accept it and then activate it and now the next step is the same whether you have created the cond environment or not.
We're just going to install the requirements from the requirements file like this. When that's done, we'll just make sure we're in the baselines directory here and then we can run the pre-trained model script that will be pre-run. -Trained interactive, we will take a few moments to get started. Now the game should open and the AI ​​will start playing immediately. In this mode, you will not do any additional learning and will only play based on the experience you already have. You can actually interact with the game at the same time the AI ​​is playing it, so for example I can interfere with it by using the arrow keys and forcing it to go to this corner.
This mode is a lot of fun because you can put it in different scenarios to see how it handles them, if you want to disable it completely you can edit this text file, so by editing this text file the agent enabled text, if we change this to no and we save it, the AI ​​will stop performing actions and now we can take full control of the emulator and if we edit this file back to yes, it will reactivate and continue playing now. If you want to train the model from scratch, you'll want a lot of CPU cores and memory, but that's assuming you have them. the script you want to run is to run Baseline parallel, this will start running multiple emulators without UI and this will take quite some time;
You may have to wait many hours or even days to see positive results if you want to change any of the basic settings of the emulator or games, this setting which can be found in any of the Run files allows you to do this, other files in the repository They allow you to make changes to the reward feature or modify the visualizations if you have any. If you have questions about the code, feel free to open an issue on GitHub and with that we are at the end. Thanks for watching. I hope you have gained some insight into reinforcement learning or maybe even your own psychology.
I truly believe that these ideas are equally useful for understanding our own behavior as they are for advancing machine learning. If you want to support this work, check out the link in the description, goodbye for now.

If you have any copyright issue, please Contact