Lecture 24: Simulation & Memory Latency Tolerance - Carnegie Mellon - Comp. Arch. 2015 - Onur MutluJan 02, 2022
we will talk about
tolerancebut before that i would like to finish drm checking but before that there are 5 lab results i think they should be available to you i prefer this to be a single modal distribution Here, let's buy model right now, so why don't you find out what was wrong and fix it? If you're here, I don't know what happened, I should definitely talk to the tas. too bad if you don't turn in if you already got ten percent extra credit but I wish you would still try to turn in to learn the material and I'm really happy that there it is the median is 93.8 which means people are really doing the laps but i would suggest trying to change this layout even if you are late to learn the important material some reminder about homework so let's change or maybe some of you have already changed to c level
simulationlab 6 c level
simulationof data cache and branch prediction how many of you st well that's good it's a good fraction of what it is here and you're enjoying it yeah a bit is it better than rtl level you prefer rtl level ok i'll talk about that in a bit and homework six will be due april 10th we've extended the deadline a bit and we'll have a midterm still huh and just a reminder ok in case you didn't knew n, I know that you all know that you have a very good attitude towards this course. the goal is really to learn so the course will continue to move quickly so i keep up with you and if you need help please talk to me and certainly we can't do or debug tasks for you unfortunately or fortunately not really wants that to happen. it will happen anyway, but we can give you suggestions and my goal is to enable you to learn the material, so even if you haven't submitted it or didn't get a good grade on lab 5, feel free to turn it on after the course is over. about you can still learn the material that way right and you never know what principles you will use where and at what point you say that is the best attitude to learn well you just take it all in and sometimes somehow mixes it all up the rest that happens here and eventually you don't even know if it will affect you, in the end, sure, that's it, if there's a harder field than figuring out what's the best platform to design from 20 years from now, that's really the field from psychology, right, you don't know what happens here, it's not clear if we ever will.
I know, okay, with that being said, since we're switching to a different simulation mode, I'll talk a little bit about simulation, and I like to call it the field of dreams. In reality, you are really exploring a field of dreams that you have, but there is always reality, so any
architect is really, in part, in large part, a dreamer or a creator. In the end, you're really creating things, right? And simulation is a key tool of the
architect. without simulation, you can't actually survive. In fact, real architects do assimilation as well, they actually design buildings, put things together and simulation allows for exploration of many dreams and a reality check of dreams as well and deciding which dream is better.
More Interesting Facts About,
lecture 24 simulation memory latency tolerance carnegie mellon comp arch 2015 onur mutlu...
You have so many dreams to choose from, which one is it? better and you're doing that you're actually going to do more of that with the higher level simulation it also allows the ability to trick yourself into dreams unfortunately you can do this with any simulation actually the rtl level simulation It doesn't. It doesn't matter, I guess it's harder to do with the actual hardware design in the end, but you can fool yourself with any simulation of any simulator you've ever created if you're not careful. So why are we doing a high level simulation in this course? you took this course about five six years ago you wouldn't do high level simulation at least maybe as much high level simulation but we added these laps because it's really important and the problem with l Low level simulation or RTL simulation is really intractable for exploration of design space.
It takes too long to design and test, especially a large number of workloads. Design Evaluated With that, especially if you want to predict the performance of a good chunk of a workload in a particular design, how long would it take you to simulate a billion instructions using the structural varilog of an entire processor, even a simple processor? a lot of time and especially if you want to consider a lot of design options, think about all these right multiplications, you are simulating a workload for a long time, say to
completion, you are doing it with over a thousand workloads and a thousand may not be enough too and you want to consider maybe a million different design options, in fact a million is not an unreasonable number when you're actually building a processor on intel amd nvidi You might want to because there are so many knobs you actually build on the processor correctly and, in addition, there are algorithms with too correct cache size associated with the block size.
I have an error here. all programming algorithms take time to evaluate correctly how you validate them to solve a board execution even something simple like this and even within out of order execution you have so many design options correct and you really need to design a processor somewhere At the moment you have a few that satisfy some metrics based on the design choices you make and can rather imagine, so the goal of high-level simulation is to quickly explore these design choices to see their impact on the workloads we're running. designing. platform for and on top of this you can even add other things well you also have the right platform you have the interconnect you have maybe other processors you have
memoryand there are different targets in simulation and I would like to spend some time on this because it is very important and probably every time you do engineering or design new things you'll do simulation, one goal is to explore the design space very quickly and see what you want to potentially implement on the next generation platform well we want to do in-order execution or out-of-order execution proposing as the next big idea to advance the state of the art that's another thing that this allows for, if you can explore the design space quickly you can get to these big ideas much faster, right, and the goal here is primarily the series , the relative effects of design decisions, right?
No it's not, it doesn't really care if a design decision buys one percent more. performance or 10 percent more performance, you just want to see the relative effect, that's the key goal in this design, space exploration, the second, maybe an opposite goal, maybe the end of the continuum is maybe you . to exactly match the behavior of an existing system and behavior i.e exact implementation behavior so you can debug and check loop level accuracy correctly because you want to see every loop you want to be exact at the loop level cycle and maybe on top this proposes little tweaks to the layout so that you can really improve performance or power or some other metric right now.
This is a totally different goal, very different from exploring the design space very quickly because here it really wants to match you, let's say, let's say. you've designed a pentium pro and then you build a simulator that exactly matches it and in fact that's not exactly how it happens but that's what a lot of people validating processors use a process use a simulator that hopefully matches the behavior exactly and then they try to fix the bugs or maybe make things a bit better, so the goal of simulator design here is very high accuracy. cy not necessarily and you really want nanosecond level loop level precision doing high level yes but actually when you synthesize it like the synthesis changes for each know what you are targeting or the standards and libraries you use it, so there's a lot of variation when you synthesize it, can you really get a good idea of the performance you get with just C-level simulations? high level ok but your goal is to look at the algorithmic effects eg for us to start the execution again you can get what are the benefits if you actually implement this well on the low level much more than the execution provides let's say no I don't know, maybe you'll get a 50 per cent performance benefit on high with some inaccuracy and when you implement it, maybe you'll get 30 percent correct.
Yeah how can you see the effects on energy if you're doing it in the s second level if if you're not doing this you mean or if you're doing this um then you're saying in the second you're doing rtl simulation well it could be yes it could being an rtl level it could still be sea level actually they're on the ground it's shades of gray here too but may answer your question you can see the energy at all these different levels the question is what so accurate you will get ok one thing you could do for example if you are doing a sea level simulation let's say not an rtl level simulation you could associate actions let's say you count the number of load hits to the cache and, based on some other power model you actually have a nano joule number for that and multiply those two and you get the power spent across the entire workload run does it make sense so you can get all these numbers yes it is true or, yeah, and we'll get to that really but usually you don't magically dream up and implement rtl correctly, you first need to figure out, let's say you're starting from scratch, what do you want to implement on a renderer and then you have so many options in order and execution out of order, a very good example eg why do I want a renderer in order?
Do I want a processor out of order? The industry doesn't design them both in rtl and pick them last, you actually need to do this high level simulation to figure out which buys which. it actually satisfies the metrics that you want to optimize for the workloads that you care about as an industry, so you should do this unless you want to design all of these options in rtl properly and you can really do it in terms of metrics that you can model energy you can model performance, you can model reliability at all of these different levels, what changes is the potentially acceptable level of fidelity, so there are other intermediate goals as well, right, let's say you export the design space here a little bit and you say, oh , board autorun actually provides a huge performance benefit on these workloads so let me go into a little more detail and let me make it into a more detailed simulator maybe before I didn't model some of the stuff correctly true, maybe you didn't accurately model all store load dependencies, that's why you got a performance benefit or from 50, but going one level below it actually models the store load queue more accurately again, which is more accurate hopefully in a way closer to the implementation.
That way you get a better idea of the performance to get to the next level and then the other goal is to gain confidence in your design decisions made by higher level design space exploration. to build maybe smaller models to gain confidence in them that's why you need to go down to the lowest levels and it really depends on your goal you can stop at a particular general level if you're designing a processor at the end of course , you need to go straight to the current hardware design, but your goal, if your goal is research, for example, maybe start exploring design space and move up a few levels. but you might not even get into the rtl level correctly because your goal is to discover the big ideas at that high level and maybe someone else will take those ideas and do the correct implementation, that's what a lot of
computer architecture researchers do in the countryside. for example, systems researchers may also not necessarily design a processor because that's a lot of effort right there, there's a limited amount of time in the world, and their goal is to generate bigger ideas rather than actually implement those ideas.
These are actually goals that are in the end, they contradict each other, okay, okay, so there are actually three metrics. The way I like to think about simulation is that there are three metrics to evaluate a simulator and these lead to t or offsets speed the speed of the simulator the flexibility of the simulator and the accuracy of the simulator so speed is basically how fast Does the simulator run based on how many instructions per second, let's say it can simulate, or how many cycles per second it can simulate in the Right System Flexibility How quickly can you modify the simulator to evaluate different algorithms and design choices?
That's also important because if you can't modify the simulator to easily switch between runningin order and out of order then you have a problem and rtl is very difficult. to do that for example that's why we don't ask we don't have another lab say now you built your pipeline processor offline it's easier to do it at sea level though ok accuracy how accurate is it performance or energy. or some other reliability is something else because of the way the numbers the simulator generates are versus a real design this is basically your simulation error and you can offset all these metrics and the relative importance of these metrics again varies depending on where you are in the design process and what is your goal right so like everything else in the world the answer is it depends which one is important well it depends so let's say let's take a look at how we can trade these things basically speed and flexibility the speed of the simulator and the flexibility of the simulator affect how quickly you can make the correct design trade-offs and you can make and evaluate the design trade-offs because in the end you are bound by at that point the accuracy of the simulator affects how good the offsets of your design may end up being correct if it's completely inaccurate in high-level simulation, perhaps the c Things you decide might be false if you don't implement them, for example, outside of order execution at least up to a certain level of fidelity, you may think you get much more profit than you could pay for the implementation cost, that's why it's really important, so it's really and part art partly again and how fast you can build your sim is also affected by accuracy that's the sim design time that's another constraint how long is it going to take to design your simulator? and these things are really important for example when intel was designing pentium 4 the simulator was one of the bottlenecks oh and they wanted a simulator where they could very well explore different design tradeoffs but they didn't they had it on time, so they needed really good programmers who know simulation to do the simulation, so if you really know how to do it well, you can also make a lot of money, by the way, not that that's not the only goal in life , but that's okay, flexibility also affects the amount of human effort you need. spending modding the sim correctly is not just how fast you can get to design trade-offs but also how much effort you are spending and what usually happens is how the ulator sim grows even if it is a high level sim as you start out to add more and more knobs and stuff if you don't do it modular what happens is it gets very inflexible it gets very very hard to tweak so it's good to start a good baseline but with Over time, if your simulator evolves in a way that actually leads to less flexibility, sometimes it's good to start from scratch, in fact that's what some companies do, they start from scratch through simulation, that's fine and you can swap between the three uh metrics to achieve design exploration and decision goals ok so high in high level simulation so an example for example if you can you can lower the level of precision uh a little bit to improve speed and flexibility that's a key trade off and this is uh usually you can get at most two of these three it's very hard to get them perfect so normally at different levels simulation, choose at most two out of three at the high level, for example.
Wide you choose speed and flex over accuracy a bit on the low level you choose accuracy over speed and flex if you come up with a sim that is good at everything let me know it's going to be really hard so that high-level simulation is the key idea. Here you have to raise the level of abstraction of the modeling to give up some precision and allow for speed and flexibility, and also to allow rapid design of the simulator, so there's a big advantage in this, but of course there's always One drawback: You can still make the right trade. -offs and you can do it quickly why because all you need is the modeling of the key level high level key factors you can skip potential corner case conditions right now what is corner case and what is not that is the art of Simulation now has to Somehow decide that if misaligned accesses aren't going to happen as much then maybe you don't model them at the right very high level or maybe you do it this way. in techniques, this is where creativity comes in a lot, maybe assign some probability to unaligned accesses and randomly assume some accesses take with some probability, some actions take longer than others, instead of modeling everything very precisely, if you do it at the gate level then you actually have to model each instruction accurately whereas at the high level you can say oh with some probability this instruction will take five cycles with some other probability that it will take 10 cycles with some other probability it will take 20 cycles right you can have a really abstract model and you can choose not to model some things for example you can choose not to model at some level of simulation charge storage is a very complex part of the design you can choose to Don't model the case where a load actually matches five or four different stores in the store queue, because that check actually takes a long time. long time ok and again all you need is to get relative trends accurately not exact performance numbers and hopefully if you can achieve this then the goal of high level simulations is done right.
The downside is of course, I mean, anytime you give up precision, you always open yourself up to possibility. of wrong decisions, right, that's why you have to be careful when you do this, how do you make sure you get the relative trends accurately? right time that's why you have this progressive refinement at the end you don't stop with the high level simulation you start with the high level models you explore the design space and then you create some mid level models that are less abstract you progressively refine different parts of the system and then you go down and design low level models rtl with all modeling and then you do the actual design by the way these are not the same thing too right rtl also has a simulation error right at the end it's making a simulation error until you get the hardware because the hardware might be different from rtl, maybe not too much, but a bit, in fact, on some processor designs, rtl differs from the hardware by three or four percent that's the simulation error at the level rtl correct you don't get the exact loop tell why because people optimize the layout at the bottom level also correct and then the question is how do you actually try to go and co Combining these two things after you build the hardware that's the verification and validation part, okay, so as you refine as you go through the list above from top to bottom, the abstraction level reduces precision, with luck goes up, but not necessarily if you're not careful, your accuracy can really degrade or not go up because again, you're doing simulation, this is all simulation and the speed and flexibility goes down because you're adding more and more detail and actually , you can go back and correct the problem. higher level models and that's really a very powerful thing if you go down to the lowest level you can go back and say oh maybe the higher level air model had a mistake here in his autobot run modeling so I'll fix it with what I learned from the lower level model and that is actually the benefit of a company that has designed many products.
In fact, it has all this information. You can even do machine learning and find out which decisions were good and bad and design a better one. simulator next time you start ok so in this course basically a good architect is really comfortable at all levels of refinement including extremes in fact if you want to make quick design decisions you would like to feel comfortable on the high level if you want to implement something you'd like to be really comfortable on the low level so this course is designed to give you a taste of both so you've done well you haven't b yet but you will a high level abstract simulation in lab 678 and by the way this is not the most abstract simulation model, it's still reasonably accurate at the loop level, though not perfectly because for example the dram controller you'll design in the lab seven will not work well with lap6. you're assuming some memory
latencybecause it's certainly abstract, but in lab7 dram control your design won't parse a lot of parameters, update will be skipped for example, you won't have to deal with write read latencies for example , and you'll also see in the low level rtl simulation and I hope you enjoyed it if you're not completely done complete the labs ok so there's optional reading there's a lot of reading on simulation but this is an article that one of my students recently made is a very short article on uh reemulator basically a fast and extensible dram simulator if you're interested in this you can take a look at it as we've covered dram too ok any questions we can act on I usually have a long class on simulation with more details, but this is more than we had before.
Basically it's really a progressive refinement and they start with high level models, in fact the high level models could be spreadsheet models, what I call the spreadsheet simulators. I don't know, imagine things like that and then based on previous designs you have an idea of how long each of these will take and you feed them. You have a complex equation as a simulator that takes all these inputs and gives you an estimated cycle count. Let's say if it's a correct performance model, it's based on a pure trace analysis plus a spreadsheet that takes the trace analysis and an equation that gives you the number of c Cycles it will take to run or number of cycles The performance of a particular design, right?
You could imagine that it is some kind of analytical model, right? You're really abstracting even more. If so, well, you could consider the trace analysis simulation correct, and then progressively refine it, so your question was basically, are you really trying to synthesize this high-level model correctly? You could if you have enough detail, but that's not necessarily a simulation in that case, is it? How do you verify that your idea at the high level is also really good? No, I see that it is a difficult question at the moment. Basically, you are asking. Is asking. How do you actually verify your idea?
In the end, the validation of an idea is finally. implementing that idea into a system is the actual validation, but then if you're not in a company with other goals, your goal isn't necessarily to ensure that the idea actually works in a real system, but maybe to generate other ideas if you are. doing good research if this idea spawns other ideas that are better which are finally implemented then maybe your idea will succeed in that sense again it depends on your goal but you can't check in the end the correct answer is you can't check without really build it ok but maybe someone else will actually build and check it yea you mean go to manufacturer or build it first in fpga oh again it depends on your goal is correct and fpga is not necessarily correct real design so an fpga is definitely how to build it on fpga it's probably somewhere around here it's still a prototype if you like I would consider a simulator as well in essence it's really u A prototype simulator is fine, but anecdote again, even if you build something on an fpga, that doesn't mean that the company that wants to take that idea believes you well because it's just an fpga, right?
Putting it at 200 megahertz say max while the company will try to design it at six gigahertz you ran into other issues but still hopefully you're closer to that actual design ok so hopefully this helps give a good simulation idea any other questions ok let's move on to memory latency
tolerancetechniques but before that we'll finish the erm and we're going to dream some more in the area but before that there's another seminar on april 3rd is before our class so again you can go to the seminar and we can walk back and get food in between. paper, remember the mlp or the cache?
Okay so it's going to talk about the architecture of the 3d memory system again very relevant to what we've discussed basically it's going to talk about how to use and enable 3d dram technology this is basically you have a controller on the top bottom and you have dram stacked on top of the controller and maybe you have more drm stacked on top of that way it makes the controller and dm cooperate together and you basically have multiple dies on the same chip on the right and then you'll talk about some of thedesign decisions related to it, so we hope you'll find a lot of the things we've covered in this course interesting, for example, how design decisions that are typically made for conventional caches, as you can see here, can be detrimental to the performance of dram caches because they exacerbated latency so you can take one I think you would learn from the seminar if you can attend ok so you know these required reading there are two of us we talked about bloom filters uh the bloom doc might be a little hard to decipher so I suggest you take a look at the sections these are really short sections right they are less than a page yes sections 3.1 of this document and sections 3.3 of this doc for more information on bloom filter designs at the high level ok any questions that was quick who likes to attend these seminars by the way well that's good most of you actually who doesn't like who hates going to seminars that's the other one you don't like which one was there in particular one you hated you said larry page ok this is someone from google yeah , ok, this is the one we visited, ok, yes, I see, ok, sure, yes, why not? you can really like the seminar or someone can hate the seminar which is always perfectly fine actually fine let's talk about the difficulty of dram control before we move on to memory latency tolerance i think it's important because this points out some dreams in designing as an architect so dm controllers are really hard to design because there are many things you have to consider and these are increasing first of all you have to obey drm time constraints to fix it is correct and there are a lot of time constraints on dram your required reading will be a ddr to dram datasheet next week you probably wouldn't like that and you can find out all the dm time constraints i don't claim to understand all these time constraints after looking the datasheets but a few things we have discussed there is twtr for example the right latency to read there is a minimum number of cycles you wait r before issuing a read command after a correct command is issued and you have this because you need to turn on the bus correctly the bus can drive this way and if you need to turn on the bus to write or read the other way, you should wait a while for things to settle down. cycles between issuing two consecutive wake commands to the same bank zero cycle latency basically and this is kind of key latency in the year and when people measure dm latency they talk about trc most of the time and imag Think about more than 50 people and when you are designing a control, you must obey them all.
First, you need to find out if you can issue a command given all these restrictions, and second, if you obey it and which command to choose. true, you need to keep track of a lot of resources to avoid conflicts and we've discussed this before, there are a lot of them, you need to handle dram refresh on top of this, it's a critical thing that you need to manage, you need to manage power consumption. very briefly discussed in one slide if you remember and on top of this you need to optimize performance and quality of service in the presence of all these limitations and reordering is not simple actually we've talked about reordering like it's really easy but in the end, we're really reordering things, we talk about reordering instructions, for example, in the processor, that's not simple, plus if you think about it, it's not in order, it's so simple, but once you start reordering things. you need to have a priority encoder and you have to make sure you obey loop time constraints and fairness and quality of service also complicate the scheduling problem it's not just for performance you're optimizing it's really all these other things and if you if you want to learn about other dm time constraints definitely look at a datasheet but datasheets aren't generally tractable so you can look at some of these documents I mentioned here so actually it's just one two three four five six seven 14 of the important dm 14 time constraints at least so it's really interesting time constraints like this for example four activation windows this is how many activations you can emit within a set number of dram cycles, like four activations, why? it actually puts out too many activations then it really overloads the system in terms of energy so you run into reliable weaves so arm checks actually or drm specs limit th the number of activations you can emit within n number of cycles and you can issue four activations within 24 cycles here basically it's kind of interesting right?
Hopefully you're reading the background sections of these two documents for the trade and this is getting a lot. harder in the future why because we have all these heterogeneous agents sharing the drm drivers different agents you can imagine multiple of these my multiple correct gpus and there is interference between all these agents and maybe you have multiple different types of memory right ? you have some embedded dram here some 3d stack dram and some commodity ram and you need to ensure the timing constraints for all of them and they all have different timing constraints actually so you can't design a single controller that works for all of them , so you need to really design different controllers and we'll see there are other memories coming out emerging technologies that have that require other controllers with slightly different characteristics and there are a lot of goals at the same time that you want to optimize to achieve fair performance quality of service power efficiency point point point then the reality is that it is difficult to optimize all these constraints while maximizing the quality of service energy efficiency performance, so the dreamer would always like to dream well, wouldn't it be nice if dram control would automatically find a good scheduling policy?
Let's just stick with the scheduling policy, wouldn't it be nice if the drm driver did something like this? Can anyone guess how you can do this. No, did someone say machine learning or I figured you said it, yes. Machine learning. this to give you an idea of a dream architect and how you can validate it and if you tell me that you have to design rtl to relate this is a bad idea actually you could but it will take a long time to at least see the high level potential benefits of the idea so the idea is to use machine learning because drm drivers are hard to design it's hard for human designers to design a policy that can adapt very well to different workloads under different system conditions again the The system is also dynamic, you have all these workloads coming in and going out, they have different requirements, and you're designing a drm controller that has to work well under a lot of conditions, right?
We've seen a lot of policies in the past, they're all though we've tried to improve the adaptability of the policies they're still based on. Based on human designers intuition and right decisions, wouldn't it be nice if we could design a memory controller that adapts its scheduling policy decisions to workload behavior and system conditions using machine learning? g that way the human designer maybe just specify oh these are some of the important system parameters you need to consider to create a good online scheduling policy sounds like a dream it's a dream actually there is no dram controller that does this today but i think more and more as systems get more complex and harder to design approaches like this would be very helpful i will give you an example actually in the past the operating system uh la disk latency for example large is a big problem true it takes a long time to get data into disk memory in the past, the operating system used to prefetch correctly and this prefetch was based on a simple access pattern prediction to which page to access next sequentially which one we're actually going to talk about this processor-level prefetch in the next lesson, but really, if you know a lot more about the system, you can design a policy based on machine learning that can do much more prefetching. intelligently, for example, over time you're running the operating system, the operating system collects a lot of information and says, oh, I found out that at 3 am. it binds along with it to memory when the owner is about to check his email it's a machine learning algorithm so this was a dream but that's it I don't know about apple but it's already in many of Windows products that Windows actually has such a pre-grabber that operates using simple machine learning principles, maybe not very sophisticated and I bet some of these other operating systems do things like that too, so it was a dream at some point, but now it's a reality, so maybe this could become a reality as a system. it gets complex it gets complex as you need to take a lot more things into account to make design decisions you may need to resort to techniques like this so it's good to know that multiple areas are at work Usually the observation here is the reinforcement learning actually maps very well to the memory controller.
How many of you know about reinforcement learning? How many of you know about paulo dogs? the dogs in such a way that they could associate the sound of a bell with just arriving food, that was at least one of the things, so how did he do it? he basically rang the bell and fed the dogs food and after he did this maybe three or four times the dogs were quick enough to figure out that after the bell the food will come so what is happening here is that you are reinforcing the relationship between a bell and food or a state or type of action and a reward, the action is really the bell ringing and the reward as food as you continue to receive food each time the action occurs, your learning is reinforced ok and later after a bell if you don't actually get food you will start to forget because i your reinforcement is reduced correctly so that's reinforcement learning and the same principle has actually applied to systems, for example, people have designed uh ways to automatically fly helicopters using reinforcement learning.
Initially, the helicopter reaches a few places and eventually learns to fly on its own. true, it may not be the best algorithm to use initially, but over time it can learn to fly automatically, so memory control is actually a bit lucky to have similar behavior, basically memory control, you can think of it as a reinforcement learning agent that can dynamically and continuously learn and adapt learn and employ the best scheduling policy so that's a high level of what reinforcement learning looks like basically you have an agent , a dog interacting with the environment performing an action or some action is happening and then it's getting a reward and also a status so given a status and you can actually in this way you can educate cat dogs to do something right , in fact, there are really great stories in psychology, but we won't go into that, for example, b.f skinner bred many pigeons to deliver messages in world war 2, so you have these pigeons maybe we can talk about that later basically they are given some status if the pigeons do some action they deliver the message they get a big reward if they don't deliver the message I don't know they cage them or something like that even though skinner was a big believer in positive reinforcement not negative reinforcement so he believed negative reinforcement is a bad idea no punishments you always do positive reinforcement that way you learn better thats what he was so you dont get a basically you either get a reward or you don't get a reward So given a state if you do some action you get a reward and over time you can associate the state and action pairs with the rewards you'll get you can actually do this a long or short term and if you look at the scheduler it looks like this agent given a system state, if you do an action you get some database utilization or some reward over time and you can bird riguate what that utilization was and you can attribute it to state and action pairs and the general over time if I mean you need to read some docs to figure out how to do this you can learn how to choose actions to maximize the reward value in the long run , you can actually weigh the rewards such that you can maximize long term performance and that's the idea of a dram controller thatworks on these principles can dynamically adapt memory scheduling policy through interaction with the system at run time can associate system states and actions or commands think of them as long-term commands reward values each action in a given state leads to a learned reward because you can actually measure what your reward is and you can program the command with the highest long-term estimate reward value in each state because you know your state well and can continually update the reward values for different pairs of state actions based on system feedback and you're making all of these decisions very, very quickly, you're actually gaining billions and billions of decisions within uh over the course of seconds on the processor, especially if you have a lot of memory controllers as well does it make sense, so you may have a lot of questions and don't see I'm going to talk about that, that's just to give you the high level idea so maybe you can apply this elsewhere what are the states?
Well there are many potential things that could be part of the status, what are the actions and what are the rewards? So the reward function is actually a difficulty this really depends on what you're actually trying to optimize correctly so you can potentially say I get a reward plus one for scheduling zero read write commands at all other times because all what I care about is actually reading and writing this is good if the goal is to maximize data bus utilization because this doesn't really take into account if you are fair on this just be careful to maximize issuing read and write commands and with Over time, maybe you'll learn policies that are really good at maximizing this reward.
What are the attributes of the state? Well, it could be a lot of things, but if you look at that document, it could be the number of reads, rights, load fails, number of rights pending, reordering buffer heads, waiting for reference pull request , the relative ordering in the reordering buffer, if you're doing that, if you have another ordering process, we could actually imagine a lot of things and the selection of these states that need to be fed into the reinforcement learning algorithm are actually a problem in yourself, how do you solve them? You're actually putting the complexity elsewhere, but hopefully this is a simpler problem than finding the policy that's good online. and the stocks well the stocks could actually be many many things these are basically the dram commands at the end and could potentially issue a no up on the um too ri Well if that maximizes the payoff in the long run you I will give an example.
Well, actually, for example, sometimes you might want to leave the data bus unused for a while by issuing a no-op. Why let's say you have a row that is open and you have a request that is in another queue in the same bank, maybe you don't want to close that draw right away because a cycle later another request will come to the same row if you really wait for a cycle , don't issue a pre-load don't get rid of that row, you could actually exploit that bot for the locality and service you ask for quickly and improve system performance, but the existing erm drivers don't do this, your goal is to basically maximize they don't have this foresight if you'll know they're a bit myopic they don't look into the future whereas a learning algorithm like this can potentially solve that case and can say oh in the past when I was in this state if I issued a no op it was actually better and how can you really how can you really figure out issuing a no trade well which is actually one of the cool things in machine learning sometimes you want to exp ploring space and how you explore space sometimes you actually issue some legal commands that are random with some low probability of you saying oh my state action pair tells me that it would actually maximize my reward if I issue this command but I'm going to pick another one random command just to scan the space a little bit that way you can really figure out There are also new policies that you might not have figured out easily that's the nice thing about machine learning it can actually scan the space a lot better it's very different from a very fixed policy like we have discussed for correct thread cluster memory scheduling example or first row policy ok so what is the performance benefit?
So as a dreamer you have to simulate and this is the simulation using a system level simulator c level similar to w what its going to build but more detailed and this shows basically what its performance is this is basically some multi threaded applications scientific applications i think it is a quad core model with a single memory controller this is the performance of rohit first scheduling policy fire cfs this is the performance of a fcfs of scheduling parts in order so there is a big difference in performance, TRUE? And if you do reinforcement learning, that's the performance improvement, it's about 20, let's say at least with this model and this is the optimistic policy where I don't have any constraints, this is the ideal upper bound, if you want, if you can design an ideal memory controller that actually isn't, the only constraints here are actually the database constraints, so it's even more optimistic than an ideal controller. correct because you're assuming everything is actually a robot to hit well so it looks like there's some benefit to this it's about 20 and the benefits um it can be higher here when you have different types of buses ok so What are the pros and cons of the idea that there are always pros and cons? correct long term goal if you think first ready first come first or hit in line first is very short term you want to maximize the robot for hit rates but that is not necessarily performance as i just discussed correctly hopefully it reduces the burden on the designer to find a good scheduling policy and designer and this is one of the advances of machine learning if you can apply it successfully the designer can specify only which system variables might be useful to consider in the algorithm that is the state of the system what goal to optimize that is the reward function but not how to optimize it and hopefully the underlying algorithm optimizes it ok it has downsides too actually a big downside as a designer is it's a black box for the designer, the designer is very likely to implement what he cannot, much less likely to implement what he cannot easily reason about.
There is no reasoning about such a policy and that is always the downside of any machine. learning policy it's very hard to reason about and debug what's going on, debugging is a bit easier than reasoning about how you specify different reward functions that can achieve different goals, which is actually again true for any learning application automatic if you have a goal you may be able to easily specify the reward function for using the database is an example what about equity? what about the quality of service? how do you do that? implement correctly because the complexity of the algorithm can be very high any question sounds interesting well it is a memory controller yes How could if you are doing it in the memory controller you have dedicated hardware that is correct but if you are doing it Let's say preloading a machine learning powered pregrabber in your OS then it's in software you can still use the same algorithms but your implementations will vary there's also one other thing about which algorithm to choose in machine learning so if you've taken a machine learning class, you've probably seen a lot of correct algorithms, a lot of approaches to machine learning and not all of them are the same applicable here, for example, in this case, okay, let's take the example of branch prediction, in actually, that's another place where you can apply machine learning and we saw it, the right perceptron, a perceptron very simple it's actually a very simple neural network which is really machine learning agent if you want and you can apply to branch prediction there you'll immediately know if your decision is correct or not because you're if you're optimizing for the accuracy of branch prediction after the branch is resolved you know if you did the right thing or not so you can immediately feed that back to the algorithm it can be a little hard to do if you are optimizing for a long time long term performance and scheduling from memory because in branch prediction the problem is a bit easier in the sense that you get this quick feedback about the correctness of your decision, whereas here what is the correct decision, the fact that you use your data bus of immediately may not be the best, because it may turn out that because you use your data bus so well, you are actually denying service to someone else in the system ok so you may need a different algorithm for this ok I won't go into any machine learning but this is something I would definitely suggest.
I already suggested many courses to him. learning is also a good course to advance as system designers and you are lucky to be at cmu actually because we have a machine learning department here we may be the only university with a machine learning department yes be proactive in altering your schedule policies or can you like simulation work with basic uh uh that's a great question you're asking a really great question and when we explore this idea we really look at that question it's much better to do dynamic updates to the policy but basically what what What I suggest is why you don't realize you're still using offline machine learning, a really good algorithm and just implement it again.
You don't know how it works. Maybe it's a black box. that in simulation and we made it turn out to dynamically alter policies much better because now we are adapting to system conditions also changing dynamically but doing what you suggest is to be better than fr cfs for example or press first schedule ok ok, maybe this is a good place to take a break, it's an early break for you guys, but maybe we can recharge for four minutes and then we can come back with a different topic, we have a lot to cover and we'll start with tolerance to memory latency and these are some reads for you these two are required these are optional so remember latency tolerance we've talked about this just uh when we talk about out of order execution main out of order execution the benefit is tolerating latencies if everything was a loop, in other words executing wouldn't give you any benefit right?
In other words, the executing processor tolerates the latency of multi-cycle operations by executing independent instructions at the same time, that's a very good way to tolerate latency while you're waiting for something to do something else well that's something else they could be standalone instructions running out of order what else could be something guarding against some other threat in multithreading or fine grained hyperthreading and autobot execution does that by buffering instruction reservation station and reorder buffer if you remember and remember we talked about the instruction window, these are the hardware resources needed to buffer everything decoded but the instructions not yet retired or committed make sense, the reordering buffer basically stores all those instructions but needs resources throughout the processor, such as physical log file entries, establishes reserve ions, load, store inputs q for those instructions that are not yet retired but decoded and I think I have asked you this question: What if an instruction takes 500 cycles to execute correctly?
How big is the instruction window you need to continue decoding? If your problem is 4 and if the instruction takes 500 cycles, you need an instruction window of 2000 inputs. keep decoding correctly without stopping, does it make sense? Okay, we've already discussed this at some point. or you can go back to that
lectureand we've also talked about how many latency cycles out-of-order execution can tolerate, so the problem with any kind of execution is that when a long-latency instruction isn't complete, it blocks the withdrawal of the statement if you have a processor in order which is actually much harsher even in a dependency it blocks progress in the pipeline and this is true because we need to keep exceptions precise that's another way of looking at it well if you don't need to keep exceptions precise, you could remove everything out of order and you wouldn't need to block the removal properly, but there's a good reason we have precise exceptions as we've discussed and theincoming instructions actually fill the instruction window which is the load of the reorder buffer reserve stations storage queues record file inputs and once the window is full the processor cannot put new instructions into the window and this is basically called a full window crash you have a full windows crash you can't get any more instructions into the machine and this kind of full window crash happens very early on a processor in order if you want to think in order but don't think in order let's think about high performance right now a full window lock prevents the processor from progressing through the execution of the program so let me give you an example here so let's say you have an instruction window of eight inputs and you have this load that requires a cache miss. hundreds of loops and then branches that depend on it and then you predict that branch and you get all these independent instructions that can be executed without waiting for load and branch but can't be removed you need to preserve the results in machine structures like Result, your window fills up and the most recent instructions can't be executed because there is no space in the instruction window to the right and this could be a load that also misses the cache in the cache, but it can't even bring it to the machine correctly, so if you had one more entry in the instruction window, you could have pushed this upload to the cache and not the cache to the machine, you could have executed it, you could have started the bug it throws, and you could have fixed that bug in parallel with the error generated by this load, but just because your instruction window was only eight inputs, it couldn't do it, so you lose a lot of performance. to because of that and the processor needs to stop until the vend time is a service then once it's served once this load is done it creates space creates space in the instruction window so that it can actually bring this next load only to stop again soon because it will take a failure that takes hundreds of cycles and long-lived caches are responsible for most Windows hard stops.
I'll give you some data, this is actually based on a paper based model. it's required reading but it's based on a circa 2002 pentium 4 type processor model and this is data averaged across many memory intensive workloads this is a subset of the workloads Intel used to design their processors at the time and with some simulator again this is a simulation it's a field of dreams but it gives you a lot of information you can analyze the data properly for this is average across all those workloads what fraction of the execution time is spent stalling basically what fraction of the execution time each cycle examines is this cycle and asks if in this cycle the processor is experiencing a full window hang or is it not ok most cycles where the processor is experiencing a full window stall sounds bad right they are designing the processors and they are expecting data all the time well that is still true and it turns out if you analyze this in the simulator, you realize that most of the windows hard stops are due to l2 cache misses on this particular model, so why is this happening?
It is happening because the applications need data and the memory latency is long. g correct and it's not easy to reduce it in a later
lecture, in fact we'll talk about methods to reduce DRM latency but these actually come at a cost, hopefully not too much, but we'll see this in later articles because it's really important and even if you reduce memory latency it's still long that's the unfortunate fact so just to give you some numbers I like to quote xbox xbox 360 number on memory latency which is more than 600 cycles for example at the moment the frequency is not that high actually it is just because it is very far away and the memory is large and it takes a long time to access the memory because there is a fundamental capacity latency compensation, as the capacity of memory, your latency increases and contention for memory also increases latencies, if so if you have a lot of things competing for memory, remember the upload latency curve I gave you in the last lesson. ion as the load on the system increases latencies start to skyrocket that's very critical that's true for memory too ok so how do we tolerate outages due to memory? do that by reducing the memory license, let's say you have a given memory latency or you can tolerate the effect of a crash when it happens, so out-of-order execution in a sense is okay, you tolerate the effect of a crash when executing Standalone things, there are four fundamental techniques for accomplishing this, and in fact, we've seen all four, except we haven't gone into detail on one of them. voter execution right and data flow you can think about it here but it could be a different model and actually a lot of techniques have been developed to make these four fundamental techniques more effective in tolerating memory latency let me cover this briefly caching we have already covered correct and modern processors use all these techniques by the way if you look at a processor a high performance processor at least that uses all these caching techniques i don't know of a processor that doesn't use caching caching nowadays, but that's also a tradeoff, if all you're doing is random access caching that's a bad idea, ok, then this is widely used, simple, effective but inefficient and passive, yes, for example, cold fails.
I've discussed if you have a cold failure, caching can't eliminate that and not all apps or phases exhibit temporal or spatial locality so maybe it's good to look at some other prefetching techniques which we'll look at especially in the next lesson, but we'll be looking at a special form of pre-search in this lecture, basically these were all these ideas that, by the way, were introduced in the 1960s, and progressively people have gotten a lot better at how to use them, implement them in a way that prefetching works well for regular patterns of memory access, as we'll see, but prefetching irregular access patterns is difficult. plus four to plus five you can easily predict if you're accessing memory randomly well it can be hard to predict what to preload next multithreading is one of my favorite topics as we know it works fine if multiple threads are good if it can somehow output then gpus is heavily multi-threaded as we've seen well a warp consists of multiple threads operating on doing the same thing operating on different portions of the data and also has many warps which are basically many many successful threads, this is good for improving overall system performance, but what if you only have one thread?
This just doesn't work right. fine-grained multithreading we've discussed ed this if you only have a single thread you lose performance significantly, but with fine-grained multithreading out-of-order execution this actually, as we've discussed above, with the stream-bounded form of data and data flow is the best way we know of in exploiting irregular parallelism today, if you want to discover dependencies in irregular program data flow and out-of-order execution are the best techniques and as a result , now you can tolerate these irregular cache misses that prefetching doesn't work well and by definition caching doesn't work well because there are cache misses and that can't be tolerated across multithreading, the problem with messy execution is just what we've seen before, you have limited instruction window size and if you want to tolerate latency failures it really prolongs das, you need to significantly increase the size of your instruction window, so in this lesson we'll talk about a specialized prefetch method that alleviates this problem without a word execution that we call prefetch and this will give you an idea of this input field. dreams I talked about earlier so this was a dream where you actually do the layout and simulate and hopefully someone else takes it and actually implements it and others have done it so I'll go through this real quick so you see that problem basically this load if we had one more input in the instruction window we could have parallelized these two really long latency misses the question is how.
Do you really believe that entry anymore? Let me actually simulate the field of dreams again, right? You have this reality. You are stuck most of the time. How would you actually get the potential benefit of increasing the window size? simulation you increase your window size to the right actually it's easy if you write your like modularly and if you increase your window size it looks like this basically this is the average throughput performance of all these applications on one machine which has a 2048 input window the size of the reorder buffer is 2048 and the other resources have a scaled size that corresponds to this 2048 input window if you look at this disregarding its significant performance true the performance improvement is about 33 here , which is nice and it doesn't stall as much anymore, does that make sense?
So the question is how do we approach this without building this huge window, well let me restate the problem basically. you need these large instruction windows if you want to tolerate today's main memory latencies using another command execution as main memory latency increases the instruction window size must also increase to fully tolerate that latency the problem is Building this big window is a challenging task especially if you would like to achieve low power and energy consumption because building a big window requires more dampers, more reserves. Stations and tag matching logic get more complex if you remember what we've discussed.
Low storage becomes more complex if you want short cycle time again. remember the other word run read those increases because you are increasing your window size and the complexity of your layout and check actually skyrockets as the size of these structures increases so efficient scale of this instruction window size how to do this efficiently is actually one of the main research problems in executing outside of work and data flow today, how do you get the benefits of a large window with a small one or in a simpler way, in fact, if you look at the progression of Intel and AMD? processors have already done this to a large extent they've already figured out a lot of ways to make out of order structures a lot more efficient over time, um, they use it a lot of different tricks they use to get the most out of little window resources, we're not going to go over them but we'll talk about some other ideas so yeah how do you achieve the benefits of a large window with a small one or in a simpler way so one example we've discussed was correct bloom filters if you really it uses bloom filters to remove some lookups in the loadstore queue which improves efficiency but that doesn't give it more latency tolerance which only improves window efficiency, so this type of trick has been employed a lot in modern designs, ok, how do you efficiently tolerate memory latency with the autovar execution machinery in a small instruction window?
If you remember we talked about memory level parallelism, I'll go. through this real quick because that's what we're going to try to achieve, if I remember the picture I showed in instruction window 8 below last upload, the loss can't enter the instruction as a re So, you can't paralyze well, that may actually be the most important instruction, because it will take 500 cycles and keep the window busy with some instructions that are only one cycle of execution, the green instructions I showed you earlier can do. it makes much more sense to somehow create space in the window so that it can get the next instruction, raise the cache miss associated with it, and fix it in parallel with the original miss, so basically a better thing for performance is to maximize these bugs or actually make all the bugs parallel as much as possible instead of having if you assume you keep the same number of bugs in a program instead of having them in isolation having them in parallel this allows for latent talents because you superimpose the latency of the longer latency things into your sequence of execution, then the key question is how does it actually generate these multiple out-of-order failures, does the execution have the ability to generate it,but it is limited by the size of the instruction window, so we'll talk about early execution, so running hit is basically based on this observation.
We want to increase the size of the window, but we don't want to pay the cost. So can we somehow get the memory level parallels and the benefits of this large instruction window without actually increasing the size and that may be the answer, basically, that's why we're talking about it? the oldest instruction is a long latency cache miss in the window, you can check the architecture status and enter a speculator of special mode, purely speculative processing mode called runahead mode, runahead mode, you can speculatively pre-execute instructions like normal execution, except that it doesn't update this architectural state and the purpose of this pre-execution is to generate the prefetches generate those cache misses remember that instruction that poor instruction that couldn't get into the instruction window okay.
If you check the state of the architecture at that point, you can start removing instructions from the window and you can create space in the window for the next instruction to arrive and that will cause a cache leak and you can start that. early miss cache is ok and of course some instructions can't be executed correctly because you entered this mode because the oldest instruction is a cache miss and these cache miss instructions are marked as invalid and discarded because it doesn't have the data for them, so anything you don't have the data for, you can put it right and when the mode is processed when the original fault returns that caused the entry into this mode, it stops the speculative execution mode called pre-execution mode and restore checkpoint and run resume normal execution going back to that statement and running it non speculatively makes sense ok let me give you a pictorial view of it so ideally what you want is a perfect cache. you never go flat again well we are dreamers we will try to do it the best way we can but what you get today or within a small window is you lose the cash and shortly after your window fills as a result you go flat for a long time time and then this cache fails satisfied you can start pulling back your window instructions you can compute but after some point you get another cache missed and shortly after that your window fills up and starts stalling again so what runnerhead does it's when you get the long latency cache miss instead of stopping it basically shortly after the cache miss becomes the oldest instruction in the window when it becomes the oldest instruction the window checks the architecture state and enters in a speculator mode it executes the main mode where it removes the instructions from the window and keeps processing new instructions in the window and should Because you can now that you're not limited by the size of the instruction window, you can actually get this payload to the right as well, and now you can actually execute it assuming that payload two is independent of payload one.
You can actually run this load as well and it generates a cache leak and you can start fixing that missing cache in parallel with the cache missing that entry called in speculative execution mode and once that's done you can flush the pipeline, restore the checkpoint and fetch payload 1 again and rerun everything again and hopefully it's fixing this glitch while it does all this and once you rerun payload 2 it now hits the cache because this error was prefetched because instead of stopping it you did the speculative execution that generated that error and started fixing it and now you save cycles because the processor doesn't stall for the second fault basically what it has done is instead of stall for this first long latency miss, we basically checked the state of the architecture and kept the processing instructions and that allowed us to hit a cache miss é standalone or standalone instruction that raises the cache miss and the service that cache misses and overlaps the latency of that cache miss with this other cache miss and some calculations here and at the time that ted actually executes that instruction you get to collect makes sense yes you ask now yes yes go ahead yes yes go ahead aside um do you just extend it just continue until you find no more fault so these are all design choices that's a great question basically now you're dreaming what do you do ?
Basically do you really stop here or do you keep going okay because uh I mean what I've shown here is that red head mode ends when the fog that caused an entry into that forward mode actually comes back that's a design decision, but you could make the design decision, oh, maybe if I go a little further, I'll find more faults, right? And those can be really useful, okay, yeah, are you actually executing the instructions? So that's another big question, the easiest way is to run everything that comes your way correctly, if you're just checking for loads then how do you skip?
How to find out which instructions to skip? In that case, think of it as when you enter this mode. actually running the program, it continues as if nothing happened except for a few values, the value that was being loaded on load loads one, it marks it as invalid, it doesn't know that value, it's invalid in the log file, so if i if i drew this let's see ok i already have something here but ok no sign i guess there is a reason for it ok it's a post we could be doing go ahead now and pre search there might have fact, for example, he could have continued drawing. here ok so you have this instruction window you get a load let's say it's loading into r1 and then let's say this is a branch that's using the output from r1 and then you add multiply dot dot dot your window is full that's the oldest the instruction is the youngest, so it's normally high here, but the slower you realize is the long latency cache.
Ali as table retirement log alias table is the easiest way to do it right or you can just copy the entire log file that's another possibility but it basically checks the log file so this is what it's going to go back to your check program counter this load so you're going to re-execute because you don't have the value the goal is to remove this statement from the window now all of this becomes this but what you do is mark r1 you're lo i'll use later in run-ahead mode as invalid so basically it has validity bits saying if i actually have the result of this r1 and invalid means no i don't have it so basically to enter rent-ahead mode, take the target record here and mark it as valid ok now this branch statement if you get that value right it's invalid right if it predicts it correctly that's great but if you do it you predicted it wrong, you'll never know if you predicted it wrong, so get rid of the fork. from the window also now you have created two inputs in your window well this poor load that could not have entered the window now could enter the window now could run it fine why how can you run it can you run assuming it? doesn't depend on r1, if you depend on all these instructions that are independent of this r1, now you can calculate your address, you can trust that calculation, and output an address that you would have output anyway, because this has nothing. having to do with r1 makes sense, basically we're just making room for these independent instruction loads, but in the meantime we're looking at all the instructions in the program, just like the execution of other words, figuring out what happens if you have an instruction that depends. in r1 let's say you get an instruction here in r1 maybe it doubles r2 it's 2 times r1 ok what happens here is you find out that r1 is invalid when you execute this instruction you mark r2 as in valid too because you don't know the correct result, basically this bit specifies the fact that a record depends on some error whose result it doesn't have yet, that's what invalid means, that way you can distinguish valid results and invalid results and valid results are results where you can trust, you can use them to precompute addresses, and you can use them for branch resolution. get the stores back but you don't want to update the memory with this entitlement it's the same again this is a purely speculative run mode it doesn't update any architectural state that's why we flag the architect we check the architectural state to begin with like so Let's take a look at how this actually works in a bit more detail, but let me tell you about the benefits first, basically, if you're pre-executing loads and stores in this mode and again, it's purely speculative that you don't update the state of the architecture that is visible to the software, as a result, these that are independent of altimas instructions generate very accurate data prefetches for regular and irregular access patterns because you are actually executing the program at the end and the instructions in the program path predicted are instruction prevention or trace cache and l2 and actually we'll talk about the preca hardware capturer, but the branch predictor already knows that it can train these tables using future access information, right?
This could be a good thing or a bad thing, actually let's take a look at these mechanisms, first of all you have to somehow get into edit mode and you have to somehow check the architectural records there and I already told you how to do it correctly, in You actually need to check the program counter point as well so you can go back to it and remember when we talked about branch prediction, you might want to check the global history record as well because what you're really doing is you want to run the program from speculatively and then go back to where you were in the program if you actually update the global history record and go back again and use the updated values, well you really messed up your global history record because you updated with things that are in the future in a sense ok let's take a look at instruction processing and stepping mode this will answer a lot of your questions your questions, it's basically the same as normal instruction processing, think of other words execution, we're doing exactly the same thing if you're in command execution, we're doing exactly one correct command execution, for example, in execution of command, if a data value isn't ready, you could run forward correct, you could actually go ahead and invalidate all that stuff, except there are two differences, one is purely speculative, the visible software register or memory state doesn't. es are updated and these instructions that depend on altitude errors are specially identified and dealt with, what does it mean? space in the instruction window because you might get something that hopefully doesn't depend on this cache miss and that's going to be useful that's the first thing the second thing is your results are unreliable if you've logged one here well, actually it doesn't.
I didn't log one because the value isn't available, so don't trust those results. Well, actually you could do something else. You could predict those results correctly. If you have a good value predictor you could say oh I'm going to predict the value and use that and that's ok because this is purely speculative it won't hurt if it actually throws an exception because it won't throw an exception at the end because it's purely speculative , you never declare that you never handle an exception that is purely speculative an exception is handled only if it's actually raised by a statement it's supposed to actually execute here we're only doing this for preloading ok so basically two types of results are produced in this mode invalid means it depends on altimis and these invalid results are marked with invalid bits in the log file and storage buffer basically your log file needs to be increased with these invalid bits.
The store buffer also needs to increase it because you can actually store using a you can store invalid data directly to an address and a dependent load shouldn't get you shouldn't trust the data if loaded from a store that stores invalid data that's ok and invalid values are not used for prefetching and branch resolution again this is a design choice also remember this is a speculative processing mode you can say oh i'm going to predict the value i'm going to use these results and if my prediction is reasonably accurate, maybe you'll get more cache misses that are correct, so how do you actually remove the instructions from the window?
The change and exit mode is also a bit different because if you are in all of these instructions, you exit to examine thepseudo checkout and I call the pseudo checkout because it's not actually checked out, it's not actually updating the state of the architecture, it's actually updating whatever the state of the internal microarchitecture at this point is really the check state, actually it is not updating memory even if it is invalid instruction is removed from the window immediately if it is a valid instruction that does not depend on a cache miss it is removed when it completes execution and a valid instruction after execution may become invalid because you can miss a cache properly if it's a successful load and if it's eh once an instruction is pseudo-removed it frees up its allocated resources this allows subsequent instruction processing which is how you remove this instruction from the window and if you actually have pseudo-retired stores, you don't want to update the memory because updating the memory changes the architectural state.
Some other processor can take that value, it's a bad idea, but you might want to. communicate your data to dependent payloads in the same thread that way hopefully you can get better execution in redhead mode and basically you could do it any number of ways but one way to do it is to have a helper structure that allows us to data communication through memory in run-ahead mode, you don't want to update memory, but you have this store that's running in this run-ahead mode, let's say you have the store here that's storing some data to address a and then you have a load that is being loaded from the address and how do they communicate between these you can do it through the store buffer but again what if the store exits the store buffer because you are running further because its the store buffer is bounded too thats part of its instruction window ok if the store exits the store buffer maybe put the data in a little struct if like a cache at address a and then when a load comes in it checks not only the store buffer but also this, if it is missing from the store buffer it gets the data from this cache like structure called execution cache anticipated and the purpose of which is solely data communication between stores and uploads that are happening in random head mode that's the idea basically and again because it's all purely speculative it doesn't need to be always correct the size of this it can be small it doesn't need to be as complex as a loadstore queue because it doesn't need to be correct yes there are some standalone instructions so when you reset your checkpoint you don't want to run them again that's ideal , but then the question is how do you realize it won't run them, hopefully, but maybe you don't have anything because it does. n later on remember that your log files are a limited resource too every time you sudo remove or remove a statement you are deallocating the logs and someone else is using those logs ok so you may not have all the values in the log file but you have a very good point basically you are doing all this work by executing standalone statements when you come back we are re-running everything if we had saved the output somewhere maybe we could avoid re-running stuff , but that complicates the engine a bit well no actually that's a great question you could also do it with any long latency instruction if something takes it again it depends you could do it on machines in order if you have a machine on order where l1 cache mist takes a long time you could do it right it's not specific ok so let's take a look at branch handling basically handling of branches is exactly the same as a normal mod, except you have some data not available true you have some invalid branches which cannot be resolved a misprint and this is not a problem if it was predicted correctly it will stay in the correct path the downside is which if it's a badly predicted email branch causes the processor to stay in the wrong program path until the end of the correct lookahead and you won't even know about it during execution execution because you don't have the result available, but the branches valid branches are resolved and start the recovery of the lost particles, so it's the same as normal execution, except for invalid branches, it can't be resolved and this could cause problems, you're on the wrong path and may be preloading the wrong things Or you may be preloading correct stuff in the wrong path as well because the wrong path may reference things that are needed. they will be on the right path and in most cases this is actually true of many programs many of the analyzes of programs have shown that it is better to go the wrong way and execute than not to go the wrong way and run if you have to go the wrong way of course, does it make sense and could you determine this in the simulation again?
That's another beauty of the simulation, you might say. oh I benefit from running in the wrong path or not if it has to go in the wrong path it's better to go and run in such a way that it actually throws some cache misses which could then be used for the right path instead of not doing and how could you model it well if you again are the architect you design a simulator where you know exactly whether or not you are on the wrong path while you are simulating and a version of the simulator or a run of the simulator you enable a knob that says while im on the path wrong I'm going to service all those cache misses that happen and another version of the simulator while I'm on the wrong track I'm not going to service any of the cash misses, does that make sense? that's the beauty of someone, yeah how do you know now you're getting into the bowels of simulators and as an architect?
Okay, let's think back to earlier versions of this from your labs, lab where you designed a functional simulator, can you combine it? with a time simulation functional simulator very quickly tells you if you're on the wrong track or the right track because you're actually feeding this program this functional simulator can basically decouple the functional simulation and the time simulation this functional simulator can actually run it can figure out everything that happens for all the instructions and store it and then derive the time simulation that way because the time simulation basically tells you how long an instruction will take with the functional simulator you can easily figure out oh I'm running this branch it should take its correct path and then you can query the branch predictor you're trying to implement, but the saved ch predictor actually tells me to predict that it's not taken at the moment.
You know the branch predictor you implement here is really wrong and this is software so you can take note of that by saying oh I'm actually on the wrong track. in the time simulation you can model everything in there and you have a knob that says oh while I'm on the wrong track I'm not going to generate the cache misses actually that's how many architectural simulators are designed nowadays . first functional simulation so you have a lot of information about what the program does and then you can change time so how do you simulate perfect branch prediction for example exactly the same way you do functional simulation and you don't have a branch predictor you always predict correctly and you look at the timing of what happens well that's how you can analyze a lot of the design decisions you make that's a great question ok huh us until we have some time like this that this is a design of the rennet process.
I'm not going through this. You have seen this image before, but what do you need for this? Basically you need to check that it has checkpoint capability that's basically a copy of the retirement to add at the time of the table of retirement records at the time the load is in the reorder buffer header and needs invalid bits in the log file and you also need some other things like program counter here global history log stack return addresses also because this stuff really wants the context at the time this load actually finishes and this is the cache rendering we talked about the purpose is just memory communication during execution head modes it's not an architectural structure let's take a look at the pro and cons of this so the pros turn out this is actually very accurate for prefetching because it's actually following the path of the program correctly instead of stopping it just ke ep goes well except it does it in a way speculative so as a result you get very accurate preloads I call them preloads because then you will re-execute the statements and the first preprocess if you will is a preload the first time you run that statement you are pre-executing it correctly ok, it's easy to implement because most of the hardware is already built in, you're actually reusing this rundown machinery to do this preloading if you want and we'll see that in the next lesson or two other prefetch mechanisms based on in pre-execution, you don't need to understand all of this now, but what we're really doing is that you're using this thread context so that execution-based prefetch mechanisms that you can use on existing processors, you can actually start a thread in some other core that does the prefetching for you, you can start a thread in some other kernel context thread if you have a multithreaded processor that does prefetch for this main thread it's ok, in this case you don't need to build that type of thread properly, in a sense when you miss a cache you are automatically using the main thread of the program to prefetch, that's ok, there are a lot of drawbacks and limitations.
If we had time, I'd ask you all of these one by one, but I'll go through it quickly now. the table already makes a good checkpoint, otherwise how would you build that checkpoint? I'll let you think it's not that easy because you normally don't have a checkpoint associated with a payload, you can usually associate a checkpoint with branches assuming you employ checkpoints as we've discussed, but when you get the payload you don't have a checkpoint associated with it, if you wait until the load becomes the oldest architectural state at that point is really your checkpoint, that's true for any instruction, by the way, if you want to control the state of a processor at any point wait until that instruction becomes the oldest ok the downside clearly this runs more instructions correctly if you get cache it misses it reruns the instruction stream once and then you come back and run it again, you could do it many times, actually if you keep getting a lot of cache misses you are limited by the accuracy of the branch prediction, if your pred branch ictor is not. right now you're not resolving branches that depend on cache misses but this is actually true for any large instruction window if you're creating a large instruction window you'd better back it up with a good branch predictor because in the end you're really getting a lot of instructions and if your branches aren't predicted correctly you're not going to get much benefit that's another drawback you can't prefetch the dependent cache my sses if the next cache fails depending on the previous one well you're out of luck because you don't even you can run it and run it but you can think about how to fix this problem potentially right maybe you can predict the values or addresses always as you are approximated in such a way that you get a useful cache line ok maybe ok ok , then the effectiveness is limited by the parallelism of the memory level actually available at the end because you can enter the rental mode in advance, continue processing and find nothing correct, there may be no cache misses and this is if the program does not have separate faults independent faults separated from each other very close this does not work again well it turns out that actually many programs when they miss a cache you're probably going to get other cache misses it's called people looked at the programs a lot and realized the misses are clustered you get a lot of cache misses and then you calculate and then you get more cache you lose it It makes sense, right?
You're fetching a lot of data operating on it and then fetching more data, especially if you do a good blocking, for example, if we're going to think of something regular. programs, if you do a good crash, that's how it works fine, that's fine, and the pre-fetch distance is limited by the memory latency pre-fetch distance, in this case, how far you can go in head mode execution, you're really limited by how long the memory latency is right, so it's actually, there'ssome really good docs outlining how this is done at Sun Rock and IBM Power 6 says there are several docs on that as well, but let's take a look at the performance of something like this again.
I'd like to talk about some simulations here. These are all simulation results in the same process that I showed you earlier and these are again sets of workloads that Intel used to build processors at the time. These are banks of specifications. rks some web workloads multimedia workloads some productivity server workstation workloads i think verilog is one of them very crash simulation for example it can actually speed up rear crash simulation it turns out it consumes a lot of memory so you can basically simulate multiple processors to see multiple different uh configurations to see the benefit of this because this is a processor that is really a terrible baseline no pre-fetcher without prefetching this is a processor that has a getter prefetcher that we'll talk about in the next lesson is a streaming prefetcher basically a plus a plus one to plus to plus three to plus four if you're doing this it's great you can actually preload well it's actually smarter than that, you can capture many streaming patterns, so this shows you that actually adding a pre-fetcher gives you a lot of performance and s average across all workloads and this is in terms of microinstructions per cycle and this is the performance you get if y You only run the main execution so it's better than prefecture but the best benefit you get is when you combine the prefetcher and the advance and this 22 is at the top of the baseline with the prefetcher.
Anyway, you can take a look at the paper so how does this fee compare to large windows? Basically, this is the baseline with the input window 128 of the prefetcher. This is if you increase the size of the window. If you triple the size of the window. to that, you get most of the benefits of a triple sized window, but if you look at the results, actually, sometimes a big window is better, sometimes it's better to run the head, and hopefully you can figure out why that is the case, what do you think? I'll ask you some questions that can better tolerate fp operation latencies, so render that you can't do much with short latencies, at least in the way I described it correctly if you have a latency of 20 cycles of floating point instruction alright the autobot run can actually tolerate that very well because your window size is good enough for that because i showed you input window 128 in this case it's all a matter of how big your window is how much latency can tolerate, let's move on well. doesn't help, so if your fp operation latencies, let's say you have a dependency chain through your floating-point operations, you never enter execution-ahead mode, but a large instruction window can tolerate those latencies much better, so that in this case a larger window instruction is a better idea than a red head leading to less wasted execution, so I hope this is clear, a large window does not lead to as wasted execution as running forward, because in It's not really going to go back and rerun a large window leads to a wasted run, why?
Because the accuracy of the branch prediction is not good enough, right? If you have a large window, actually a small window is more efficient because you don't depend as much on the prediction accuracy of your branch because you're not getting this branch that's badly predicted, hopefully because you have a limited window, you wait until you get a lost vertical branch, whereas with a large window you'll get everything you can see right everything you can if there's a bad branch that you can't recover from, it will run some instructions until it recovers from that mark, so you'll waste a lot more execution than a little window but moving on weighs even more ok you can actually imagine other questions here so it's a pros and cons analysis again let's take a look at a couple more things uh before we we're done today so this idea itself again we're dreaming well again we can simulate on order versus off you have a simulator you can set it to run in order and then adding a run head on top of this gives you some performance turns out it's pretty significant about 40 percent and you can simulate it an out of order s for you to um to model running out of order actually buys you a lot compared to order in this case it's a big i haven't calculated it but it looks pretty big so in out of order execution outside of our execution there is much higher performance than in order execution here but in the on top of that if it had gone ahead you still get something so if you look at the relative benefits of running it on in-order and out-of-order processors it's much more beneficial on in-order processors why is it because the processor in order can't tolerate any correct latency that's not exactly true you can actually tweak it it's actually tweaked tweak so you don't stop at every um yeah you don't stop at a lot of things sas here, but you still need to stop at actual dependencies right, because it can't tolerate any latency, adding another latency tolerance mechanism gives you much more performance here, while auto-vertical can tolerate some latency, as a result, adding run it doesn't it gives you as much performance as you would if you added on top of a command execution so all is well and ok all is in simulation still ok and actually this was a bit the design is that sol unfortunately not exists now it's Oracle Oracle bots in um took this idea and they finally implemented it and this is one of the slides they had they called it exploration threads at the time that's one of the slides they had that I really like because it shows different trade-offs in the layout and you can't get the slide if you actually do rtl simulation this is really in commercial workloads huh on your p processor design this l2 cache size on the x axis normalize ipc on the y axis this is if you do an in order run without running ahead unscanned and this is if you add rent head on top of this that's the benefit of performance so with the 512 kilobyte l2 cache you get a 40 percent performance improvement if you add random execution on top of that it's actually very similar to the result we got I was very happy when I saw it.
Someone else validated our simulator with a totally different isa and a totally different model and you can actually swap things around so if you really have this kind of latency tolerance mechanism you can design a megabyte cache. with running ahead and that gives you for this workload for these workloads that tested the same performance as an eight megabyte cache with ronora and ahead you can actually save a lot of space and at different points you can actually save different amounts of correct space, it's really interesting, so these are the kind of trade-offs you can do if you have a high-level sim that you can't really do in a low-level sim, does that make sense?
I think it's a very good place to stop any questions well I'll see you when is that Wednesday your
If you have any copyright issue, please Contact