Simulation Parallelization | Steam Revolution Game Devlog #9

May 23, 2024

In this video I talk about how I parallelized the

simulation

code in my Steam Revolution

game

. I measured the cumulative times after running 1 million

simulation

ticks. This makes the math nice so that the number of seconds taken equals the number of microsc per simulation tick in the world. The map with 214 trains parallelizing my code made the time go from 43.2 seconds to 17.7 seconds on my computer, a 2.4x speedup. Accelerating simulation through multi-threading is challenging for a few reasons, first, the duration of each simulation step is very short, so the overhead of

parallelization

is difficult to overcome. Each simulation step is also composed of multiple interdependent stages that further reduce the time per parallel section.

There are six parallel steps per tick, so at 43 microc, that's an average of about 7 microc per second step, the simulation is not trivially parallelizable because the trains interact with each other and with shared resources, such as track sections and industries. The third simulation results must be deterministic and identical between the series and parallel versions. Replaying a simulation should give the same score each time to make the

game

easier. funny, also checking the server results requires that the server and client results match. I will give a detailed description of the challenges in making each part of the simulation parallel and the strategies I use to overcome them, although I specifically describe my solutions.

More Interesting Facts About,

simulation parallelization steam revolution game devlog 9...

For my problems, they can serve as inspiration to cripple your own code. This video can be viewed as my own, but it is based on my previous three videos. You may want to look at them for more context in development log number six. I described the optimizations I make to get my single. -Thread optimizations to run 17 times faster. This

parallelization

is a continuation of that optimization effort in development log number seven. I describe how cues work in the game. Here I describe how tracks are divided into segments and blocks, the mechanics of signals, and an overview of the simulation steps relevant to signals in

devlog

number eight I describe how I implemented a low-latency parallel library for my game that It beats openmp and PPL, so if you want to pause and look at them first, I'll wait for you here, okay, welcome back, but Before we get into the nitty-gritty details of the implementation, it's probably a good idea to give an overview of what the game is, even if you already know, in case you haven't seen the videos that I mentioned, Steam Revolution is a game about building train routes to transport materials. and goods between industries and cities, the goal is to deliver a given amount of goods at a level as quickly and efficiently as possible.

This consists of an iterative design process of building paths simulating adjustments, simulation, etc., until you are satisfied with your results by iterating frequently. I want to run the simulation as fast as possible. Simply check your score and compare your results with the online leaderboard. That's why making the simulation code run as fast as possible is so important to my game. Okay, before we talk about the accelerations of parallelisms. What I achieve is important to discuss what I am achieving on the computer that I am doing my tests on is a ryzen 2600x, the image that I show here is well, it is 2700x, but that is exactly a 2600x with two of the cores. disabled, they are actually the same die, this die consists of two groups of cores, in my case it is two groups of three cores, so when I talk about the speed of a group, that is where I am running three threads, one thread per given in a group. when I say running on all large cores it's a total of six threads instead of one per core and finally when I say all that means all logical processors because each of the cores is presented as two logical processors across of hyperthreading, the important thing to know here is that the communication latency between cores within a given cluster is much lower than when talking between different clusters, so you would expect the overhead of parallelization when running in a single cluster is less.

With that background out of the way, we can talk about the scale I achieved. I generated times from four different sets of data. The first one is a small level with only three trains and it was actually very fast and general execution for the single level. Threaded execution took about half a microsc per tick The test with 26 trains was one of the city levels where I had completed the level to be as optimal as possible, so this is the typical case of a level completed the test with Los 151 trains was the world level where I had completed it as much as I could before starting all my optimizations and I stopped because at the time it was going too slow for me, unfortunately that save game no longer exists. it got overwritten but then I kept filling in the entire world map and that resulted in 214 trains, so this is not the most trains you would ever have.

It's not an optimal solution, but it's a complete world map solution that you can see. that by the time I got to 214 trains the simulation time per tick had gone up to 43 microsc per tick when I was running on one cluster of cores, this dropped to about 17 1 12 micros when I then increased the number of threads to run on both clusters , those are the big times and in all the logical threads of the computer you can see that the times actually stayed the same or increased slightly, which means that when communicating outside the local cluster, the common communication overhead exceeded the benefit of having more threads running.

I can also rearrange the graph in a slightly different way. Here I show the speed of the multithreaded solutions relative to the serial solution. What that means is that if the value is one, then it is executed the same way. speed as if it were in a serial code and if it is greater than one, then it will go faster in the case of three trains. You can see that on all multithreaded solutions it actually runs significantly slower, about half to a third as fast. speed, however, on the largest number of trains we can see that in the group it arrives approximately 2.4 times faster and remember that a group for me consists of three cores, which means that I am actually getting pretty close to the factor of ideal acceleration. that speed stays more or less the same or even decreases as I continue to add threads, so what that tells me is that for my problem I should limit myself to a single cluster because adding more just makes it go slower, in fact this it's true even for the entire world map with a lot of trains and I think even if I had more trains eventually, if I used every core I would win, but that won't be a very typical case and I don't think I would win To a large extent, those were the general times for a simulation tick, but remember that a simulation tick is actually made up of several different parts, so let's go into a little detail about what those parts actually are.

There are 10 steps in total, some of which I was able to parallelize and some of which I couldn't. The first step is to update the track occupancy with which trains they are and which parts of the track. Below I propose a new location for all trains and then the trains that are waiting for a signal. resolve their load and if they can enter through the signal based on those results then the trains will confirm their movement steps and that means if a train movement was valid then the proposed location will become the actual location. Collision detection is done between trains to check if something hit something else and then I add new trains and cars to the trains as they leave the depot now that all the train work is done then everything else needs to be done and that It means updating the resource consumption and production industries, cargo shippers need to carry cargo from trains to industries or vice versa, we need to update the status of the smoke that is emitted from the trains and finally, we need to verify and update the status of the game to see if there is a level has been completed here.

I have color coded the steps based on whether I can parallelize it or not and the steps marked in green are parallelisms, the ones in red are not the first five steps, they are the ones that actually take the longest and are the ones that involve updating Of the trains themselves, I didn't parallelize by adding cars and trains, but then I updated the industries I parallelized and then the remaining steps of updating cargo loaders, removing radios and updating the game state were also not parallelized without the context of how long The steps that are actually taken are hard to say if that's good or bad, so here's a graph of the times when running the code in serial versus parallel with the larger 214 train test case.

What can be clearly seen is that the parallel times shown in red are usually much lower. that the series times shown in blue, as you remember from the previous slide, the first five of these times are the ones that I was able to parallelize. I couldn't parallelize by adding trains and cars and then updating industries. I was able to parallelize. and then I didn't put the last three steps in parallel. What you'll notice is that the things I didn't parallel with the eyes are also things that ran extremely fast, so the amount of overhead running them serially isn't very high and in fact, that's part of the Which is why I didn't put them in parallel.

What you can also notice is what took me the most time. I was also able to get a pretty good speedup out of this, I think something like that. It was expected because the communication overhead will be less in that case, but it is also quite good for the actual end results. What I think is also somewhat impressive is that even for extremely short duration things I was able to achieve a speedup, in particular updating the track occupancy was not something that was completely trivial to parallelize and it only took 3/4 microc in the serial code, but I was able to reduce that to half a microc in the parallel code before going into the details of how to parallelize each step of the simulation.

I want to talk about some generic sources of overhead, the main source of overhead will be memory transactions involving the sending data on cache lines or the state of cache lines, specifically just starting and stopping a parallel for Loop has a lot of overhead associated with it. I made a pretty detailed video about that in dead log number 8, so if you want to learn more about that, check that out, but the idea is that in addition to just starting and stopping jobs, any data read or written by a thread will also give resulting in the communication of that data over the bus, this means that if you read or write any data about the train state or any other simulation state in one thread and then read from another, that will involve some communication overhead, this too means that if you dynamically distribute the work between threads instead of statically partitioning it, in addition to the overhead of just saying "hey, this job needs to run on that thread", you'll also get the overhead of different threads reading different random data and if you had partitioned statically the threads, let's say the trains, and if you run two different parallel for loops over the set of trains, if you split statically, the same thread would probably be assigned the same train and you wouldn't have any data transfer because of that, but if you dynamically partition your data then you have no guarantee that the same thread is going to receive the same train and therefore you will actually get a double impact of data transfer, so although the idea of distributing work dynamically sounds very attractive in the sense that if you have work items of non-uniform size, yes, it sounds good that if a thread finishes early, it goes ahead and removes the next work from the queue actually for extremely short lived workloads. like having the overload actually just kills you and in none of the cases I have in my simulation could I do anything even approaching at the same speed.

The dynamic was always clearly worse. Another source of overhead is when you obtain an exclusive lock on a piece of data, the main reason this is slow is because acquiring a lock requires communication between threads, so even if you make an extremely short-lived lock you still have to perform an atomic read-modify-write operation and that will be the source of overhead if If you want to have more details on how I implement a lock, see my previous video. The other source of overhead is the time actually spent inside the lock. This is an effective serialization of whatOtherwise it would be parallel code if there was another thread. is waiting for the lock while it is owned by another thread so you want to minimize the amount of time it spends on a lock for me what this usually means is that if I say add to an array that is protected by a lock that I set first I do everything I can and then acquire the lock just to add the data to the array and then release the lock, perhaps less obviously there are a few more sources of overhead and one of them is that your algorithms need to be modified. to support parallel execution in my case, which involves sorting by train ID and splitting the work into independent tasks, these are things that would not need to be done if executed serially, luckily in my case they don't provide much overhead. but depending on your actual algorithm they can be better or worse, so this slightly affects my Serial times in my Benchmark because they run the even code, but Serial doesn't matter too much, but you should keep in mind that it has some small effects. , there are also some external factors that don't necessarily have to do directly with your program.

One of them is that simply running it on multiple different threads means that the CPU will consume more power and produce more heat and that may result in the CPU or the operating system slowing down the CPU to conserve power and stay within a budget. or to prevent overheating. This is something that will even affect the serial parts of your program if you have different threads running. at a busy weight or even if you have all threads idle, the CPU can become heat soaked and then run at a lower clock speed due to hysteresis which had previously been consuming a lot of power and eventually needs to recognize that your program is not the only program running on a computer and therefore there will be potential competition with other programs, some of them could be the operating system itself or critical services and some of them could be a web browser or another game. running or who knows what, okay, that covered the sources of overhead, but I would also like to talk about how I handle errors during the simulation because that is something that will apply generically to multiple different parts of the simulation the way that I do. dealing with this is quite simple whenever part of the simulation encounters an error then it will add an error marker to an array the error marker is quite simple it is a structure that has the location where the error occurred on the map it has a string that stores what the title of the error is and a string that contains the text that should appear in the text box if there are any errors in the array, when a simulation step is started it will simply return and do nothing and the effect of this is the simulation stops, of course this is a shared global array of errors, so I put a mutex around any access to that array in all the details of the simulation steps.

I'll try to also note every time a step might create an error. I won't go into much detail about how it works, but know that I covered it here below. I'm going to go into detail about how I parallelized all the different simulation steps, but if you liked what we've seen so far go ahead and hit the Like button. The first step of the simulation is to update the runway occupancy. This is actually updating two different things, both the segments and the blocks of the track. Segments are sections of the track that are divided by signs. and blocks are a collection of segments that have no signal separating them.

Each segment and block contains a spin lock that the thread must acquire before modifying its occupancy list. List modifications are relatively uncommon, so there is no large overhead in acquiring the lock. I iteratively update occupancy lists, meaning I only modify a list when a train enters or leaves a segment or block so I know if I entered or exited. I also need to save a copy of the previous position so I can compare a difference, so on a parallel 4 over all the trains in the front of the train, when it enters a segment or block, I acquire a turn lock and add it to the segment or block and at the back of the train when I leave. a segment or a block I also acquire a spin lock in order to remove the entry of that train, so this is one of the basic mechanisms through which trains will communicate with each other about where they are.

The next step of the simulation is to propose a new location for a train, this step actually has a couple of different subcomponents. The most important thing to understand is that the train state is divided into a proposed location and a current location. This is important because when proposing a location, trains can read. the current location of other trains and will simply write to its own proposed location because the data being read and written to is different between different threads. There is no source of discord or non-determinism. This is important because trains really do it. read each other's locations and in particular this is to calculate following distance when there is a free signal, it is possible that two or more trains are on the same block and on the same track, if one train were to collide with another, it will not crash the other train it will just follow it for it to work, the trains need to know where each other is and they determined this through the occupation written in the previous step, the search for the next train that a train could run on.

It's a slow thing to understand, so trains don't do it on every tick, but rather they know what the distance of the next train is and its maximum travel speed, which provides a lower limit of how many ticks until the train can have a collision, so the value is written to a counter and trains don't check until the counter reaches zero again, in addition to skipping a lot of work, this skips a lot of read data written by other threads, there are also the more mundane ones. upgrades that only involve a single train, this involves basic acceleration of the train and also having a maximum speed that is based on the curvature of the track, each time a train reaches the end of a segment on a track, then moves on to the next segment. described in its route, if the next segment is in the new block, then the train needs to enter a weight queue to enter the next block and again the blocks are separated by signals, whose weight queue to enter is determined by a pointer which is stored in the y block and I'll go into more detail about that in the next section, although the point is that there are multiple weight signals and each Q is a shared resource, so access to modifying a q is controlled by a lock turn by Q if the signals are added to by serial code, then the result is that the waiting entries would be sorted first by the time they were added and then by the train ID, as the trains are processed by multiple threads, then the order of train IDs might not be the same and so to ensure it has the same order, Q is sorted by time and then trains ID.

This means that the results of execution using parallel code will be deterministic. It is worth noting that it is also possible to generate errors. when proposing a location, for example, if a train does not have a next ROUTE segment, finally it is possible for trains to generate sound effects, each train has a variety of sound effects that it generates, these basically consist of a chuff while moves chuff chuff chuff chuff and also a whistle every time it stops, since each train has a separate array, no lock is required to modify these arrays, sound effects are also added in the simulation, but the simulation does not remove them, but before each frame is rendered in the set of The sound effects that were generated are collected and processed when running in server check mode or ultra-fast simulation mode, then no sound effects are added to the matrix at all.

The next step of the simulation is to drive the trains that are waiting at the signals. There are two reasons. why a train might be waiting for a signal, one is because the block the train wants to enter is already occupied, the other is if the train is loading or unloading cargo in an industry or a storage area, if a train does not need to wait any longer, its proposed location is not changed; However, if a train needs to continue waiting for any reason, then its proposed location is marked invalid at first. It seems that solving the weights between trains requires serial execution.

This is because what a train does depends on how other trains behave, for example if several trains are waiting to enter a block and one enters, that means none of the other trains can enter. The same goes for industries. Industries may not have enough resources for one train to load and then if multiple trains try to source from the same limited resource then one of them will not be able to do so. How do you know which one? The solution I came up with is to recognize that some trains will depend on each other. but others don't, so what we can do is determine which blocks are dependent on each other and have all those interdependent blocks share the same q and then the weight inputs within any of those signals are processed serially, but the Cu can be processed in parallel, so what we need to do is find out what the different sources of interdependencies are.

I think the most obvious of these is that all the signals that enter a block depend on each other and for this reason, instead of storing a pointer to the Q in the signal I store the pointer in a block because all the signals that enter to a block will share the same Q always, then we must consider how the blocks are connected to each other and one of those ways is through chain signals a change signal says that to continue you must have access not only to the block that is adjacent to that signal , but also to the next block, which creates a dependency between blocks and chain signals can be chained correctly, so if the input to the next block contains a chain signal, then it will also depend on the block after that so that can have an arbitrary number of blocks that are connected by chain signals, eventually all the signals that share an industry or that share a Storage will also be interdependent and interestingly, for this to work I found a subtle bug which is that I need to make sure Make sure that trains that span more than a block are not loaded, otherwise you could have an arbitrarily long train and basically all the signals.

It would be able to load or unload from an industry as long as the train didn't like to cross paths I guess, and this would cause a nightmare where every block is interconnected with every other block, it can also be kind of annoying when you're playing. If you thought a train had left a station and was not going to load or unload but for some reason the signal is too close and the last car of the train is still next to a platform and then you load a few more things anyway, the end result is that we can think of the blocks as vertices in a graph and the interdependencies between the blocks as edges between those vertices, what we need to do then, given the data structure of this graph, is find the connected components and create a Q per connected component.

I use the Join Search Algorithm to determine connected components during simulation initialization. In practice, the results are basically what you'd expect: industries generate clusters and tend to be separated by the stretches of road they are between, so it's all a bit abstract. Speaking of graphics, so let's look at an example of how this works here. I have a picture of an industry with several different tracks, platforms and signals. I have a previous video on how the signals work, but the basic summary is that the yellow ones are chain signals and the orange ones are chain signals.

Free signals and blue signals are block signals. The signals can also be unit directional or bidirectional, so in this industry you can see that they are terminal stations and rolling stations, so the first thing I do is mark the blocks with these blue vertices. To indicate sections of the track that are separated by signs, the first thing I do is add edges that are described by the signs in the chain. I'm coloring them yellow to match that the chain signals are also colored yellow next to all the blocks that are connected. through an industry are connected by borders between them colored in magenta, all the blocks are actually connected to all the other blocks, but drawing so many lines would be quite messy, so I draw them as a kind of connection through a virtual block of the industry.

This is mainly a shand for illustration, but it's actually a little bit about how the code works. However, this shows how we canGenerating a graph of the connections between blocks and around this particular industry, you see that there are actually two different large connected components that span each other. Tracks that run between industries don't actually depend on any other part of the track and act as a buffer. This means that industries tend not to depend on other industries. This means that there tend to be many different weight signals that can be processed in parallel in my large world map there were a total of 621 blocks which were then separated into 182 Q in my smaller example of the Birmingham level consisting of 71 blocks that were divided into 23 Qs, which describes how the blocks are grouped. on shared signals, but how do I actually process trains that are waiting in a queue?

If a train is not loading or unloading, it will try to reserve a path. The basic idea is that a train can only claim a path through a block if the block has not yet been reserved, the simplest case is if the train is waiting on a blocking signal, in which case it just looks at the next block , but if a train is waiting on a signal chain then you need to do a depth first search to find the signal chains and where there is a free path it can pass if a train finds a clear path and then add to the block reserve list.

I will note that it must be a TRS list reserved for a signal because it is possible for more than one train to potentially be in a block if the trains passed through a free signal, but that block can also have a blocking signal coming in and the blocking signal will require all trains to have left before it allows it. a train through Situations where you would want to enter freely and also block entry to the same block are probably somewhat rare, but it is possible to create a priority signal through this means where, if you have a main line, trains can enter freely and you could have a side line entering the main line with a blocking signal, which is basically a yield.

I'll also point out here that although it is possible for multiple trains to fit into a block, it is not necessary to have a spin lock around the reservation array for a block. because trains will be guaranteed to be processed serially because by definition of connected components all trains that could compete with each other are in the same shared queue and processed serially if a train is loading or unloading then a loader will potentially be created loaders and loadloaders are stored in a single shared global array because this is a shared resource so it needs to be controlled by a global lock in my case.

I'm using a spin lock but in theory you could use a mutex. Also, it doesn't really matter in terms of accessing shared global arrays. It is also possible for the simulation to generate errors at this point, for example if a train cannot reverse at a terminal station or if an invalid route is found as I mentioned before, if the train needs to continue waiting then the proposed location is marked as invalid; however, if the train is able to continue, the weight entry is again removed from the weight queue, this is an array that is shared by multiple trains, but because all those trains will be guaranteed to be in the same queue as the processor, I don't need to have a lock that controls access, that Q can simply be modified freely, well, that was the hardest part and I promise you the rest will be easier in the future, so let's take a breath too.

If you want more development updates please subscribe to the channel and ring the bell once the weights have been resolved, the next step is to commit the movement of the trains, there are a couple of steps to this but essentially the main idea is update the location of the engine with the proposed location in this context the location is just the distance along the track and then additionally once you move the engine then the cars following the engine should also have their positions updated . The carriage positions were not proposed during the proposed placement step, they just implicitly follow the engine.

This part only modifies a single train in place, so there is no contention. between threads, if the train is moving at this point, it will also consume some fuel and produce puffs of smoke. The puffs of smoke are of no importance to the simulation, but they do provide some visual interest. Smoke puffs are stored in a global shared array. and access to the array is again protected by a spin lock, this could be a mutex, but it doesn't really matter if a train runs out of fuel, it can also produce an error which is again protected by a mutex, which gives us leads to the most expensive situation Part of compromising the movement of the Train, that is, updating the attributes of the Train in terms of the bulk of the simulation.

Location is just the distance along the track, but sometimes the actual 3D position of the carriages is also needed, in particular for collision detection, so position is considered. an attribute that is determined from the distance along the road and there are also other attributes such as slope and curvature that are required. The slope and curvature affect the acceleration and maximum speed of the train. Once the positions have been determined, then it is necessary. to update the collision acceleration structure, the collision acceleration structure is just a grid of tiles and the trains that are in a tile are added to that tile where the bounding box of the train is also expanded twice the collision radius, allowing a single point query to determine all trains that are potentially collidable, each train stores the range of tiles it was in before and calculates a new range it is in now, if they are different it will be removed and then added back to all relevant tiles because tiles are shared between different trains, each tile has a spin lock that protects the array of train IDs that are stored inside it because these arrays are only modified when a train moves between different tiles, in They are actually not updated that frequently.

Trains also tend to be spread out reasonably evenly across the map, so it's not a big deal for trains to try to compete for a single resource because they won't necessarily be accessing the same set of tiles as this brings us to Collision Detection which actually is surprisingly fast Collision detection is fun because it doesn't actually directly affect the simulation, the only function it serves is to produce an error if the trains collide, as long as you knew that trains couldn't collide in theory, this could all be skipped. , but it doesn't, so here are the details: Each train is processed in parallel and there is a simple optimization which is that if a train is not moving, you don't need to do anything about it and it can be skipped.

Also, since the train is moving forward, it can only make the engine hit something, so I only test one point, which is the source train's engine. In the previous step I built the list of trains that are on each Collision tile so that the First thing I do is get what is the list of trains on the tile that contains the engine and then for each train on that tile I detect If the engine collides with it, my crash test first consists of a heading rejection that is determined only from the bounding box. of the other train and if I cannot reject the collision, then I do a detailed detection where I determine if the engine crosses any of the cars of the target train.

I will note that this test is not symmetrical because I am detecting only the engine versus the entire other train, which means I have to do n*nus1 collision detection tests. If collision detection were symmetrical then you would only need to do half as much testing, but overall I think this is a win because it avoids having to detect the collision of every car versus every other car on the other train, this brings us to the next simulation step, which is also the first step that I did not parallelize, that is, adding cars and trains. This step is already extremely fast and is something that is not executed during the entire simulation because there comes a point where all the trains have been added.

Once the trains have been added I set a flag saying this stage is complete and then these functions. are not called at all in the future, the actual checks to determine whether a train can be added or not are also extremely fast, so performance is simply not a big issue and this brings us now to the last step I paralleled, this It is also the first step that is not related to trains in each simulation tick. Industries need to consume resources and generate products. This is also the first case where the code is actually completely trivially parallelizable, there is no cross-industry interaction and there are no bugs that could even be. generated, this takes us to the next step, which is to update the loaders.

This is perhaps more interesting due to its simplicity than anything else, since most of the time a charger doesn't need any updates because its movement is completely deterministic and therefore there is no update to position or anything like that. is completely implicitly defined by its start time and end time. The only properties of a loader that are relevant to the simulation are the start time, delivery time, delivery destination and the end time when the loader is to be removed, these times are specified as integers that indicate what simulation ticks occur in this code is not parallelized because first of all it is extremely fast most of the time it just loops and does nothing and even when it needs to do something.

It's very simple, it's basically just increasing the resource counter. Also, it would be excessively difficult to parallelize the payload delivery - it's actually not that bad because you would basically have atomic increment - but removing finished payloaders would be quite slow or potentially difficult. The simplest solution. It occurs to me that it would be to copy the payload loaders to a completely new array and just not copy them if they are meant to be removed every time a load loader is added to the new array, then there would be an atomic increment, but atomic increments are slow and that would be orders of magnitude slower than just doing the code serially, this brings us to smoke handling which is similar to cargo loaders except even simpler, again smoke movement is completely deterministic only from the initial conditions and there is a tick where the smoke puffs need to be removed, this is done serially and in fact as a single line of code and for completeness the final step of the simulation is to update the state of the game.

This is usually extremely fast and only checks whether the wind conditions of a level have been met or not, if the conditions have been met it will update the unlocked regions in the world, save the best time and solution and send the score and the solution to the server for verification, but in the vast majority of cases, that is, every tick except one, none of those steps are done now that we have gone over all the details, maybe we can look at that section and time a graph and appreciate a little more, it's kind of impressive that we got speed. update track occupancy because it was non-trivial and is also extremely fast when a new location is proposed for a train, the trains are reasonably independent but have to share access to signals and potentially search for other trains, albeit infrequently , so we can get decent speed there by solving for the weights and load loading is probably what we got the least impressive speed from this is also what was hardest to parallelize because there is an inherent interdependence between the trains that you need to solve.

This section took the most explanation and required building a graph of the interdependencies and statically determining the connected components because those connected components vary drastically in size, sometimes they are a single node and sometimes they can be a dozen, the end result is that the workload balance probably isn't the best and there are also some other things where trains will communicate with other trains, for example they have to access some shared arrays for freight loaders, so this just doesn't work. it parallelized as well as some of the other things he said it still sped up and it was worth it to make the move by far the slowest and it also got pretty good speed up so I'm really happy with those results.

Collision detection is also quite fast with very little interaction between trains and Finally, upgrading industries was by far the standout winner for speedup percentage. This was the most trivially parallelizable step of the entire simulation, but it didn't take a huge amount of time in the first place, so it actuallyIt did not reduce the total time. As much as I hope all the other steps have been left in series, but they don't take much time in total, so Om Doll's law doesn't affect me too much here, so in conclusion, I start the process of parallelizing my code.

It wasn't clear to me that parallelization was going to be possible. The basic problem is that, first of all, the amount of time spent on a simulation tick is extremely small and simulation ticks are made up of multiple different steps, so the amount of time a parallel section has is even lower because of that, what this means is that the overhead of simply starting a parallel job is extremely significant and my initial tests with some available libraries showed that the overhead was too much. To get some benefit and the code was significantly slower just from starting parallel work, my previous Dev Vlog was a video detailing how I fixed that part of the problem.

The second problem I address in this video is that the simulation is not trivially parallelizable, so it's not even clear if I'm going to be able to speed up regardless of whether there is any additional cost in starting the job or not. This is because the trains inherently interact with each other and that is what makes the game any fun to play the trains follow each other they can crash into each other they have to negotiate exclusive ownership of the blocks even with chain signals and trains share limited industrial resources I solved these problems with a couple of different techniques, the first was to break the train cycle dependency by updating its position while reading the position of other trains.

I did this by splitting a train's position into a current location and a proposed location, this required modifications to the simulation taking care to consider the context of how one position was being used in the next The technique was to split a single global weight queue into independent groups of signals using graph analysis. The idea here is that as long as you can prove that the blocks are not going to interact with each other, they can be split into separate signals. The last major technique I used What is used is to use a detailed lock of a spin lock by some resource specifically.

I had a spin lock per track segment, per block, per weight queue and per collision tile, the idea here is that if there are many instances of a resource, even if there are many trains, the possibility of two trains trying to access to the same resource at the same time will be relatively rare, this means that locking a resource actually has a fairly small overhead in almost all cases, thanks for watching my video. I hope you got something. If you have any questions or comments, feel free to write something below, as I mentioned above. This is also the conclusion of a multi-part video series, so if you found this interesting, feel free to drop us a line.

Go back and watch some of the videos above for more information or context. If you haven't already, feel free to like the video or subscribe to the channel.

Watch Video & Subscribe

If you have any copyright issue, please Contact