4. Assembly Language & Computer Architecture

May 11, 2024

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT opencourseware at ocw.mit.edu today. We're going to talk about

assembly

language

and

computer

architecture

, it's interesting that nowadays most software courses don't bother to talk about these things and the reason is that, as much as possible, people have been isolated when writing your software of performance considerations, but if you want to write code fast, you need to know what's going on underneath to be able to exploit the strengths of the

architecture

and interface.

The best interface we have for that is

assembly

language

, so that's what we're going to talk about today. So when you take a particular piece of code like fib here to compile it, you run it through clang as I'm sure you're familiar at this point and what it produces is a binary machine language that the

computer

is for. programmed by hardware. and it runs ok it looks at the bits as instructions instead of data and it runs them and that's what we see when we run this process it's not one step it's actually there for stages two build preprocessing build sorry for the redundancy . that's kind of a bad name conflict, but that's what they call assembling and linking, so I want to take us through those stages, so the first thing that happens is you go through a pre-process stage and you can invoke it with clang manually .

More Interesting Facts About,

4 assembly language computer architecture...

You can say, for example, if you click. Okay, that will show that the preprocessor will run and nothing else, and you can take a look at the output there and see how all your macros and such were expanded, before compilation. passes, then you compile it and that produces assembly code, okay, so the assembler is a mnemonic structure of the machine code that makes it more readable to humans than the machine code itself and then once again you can produce the assembled yourself with clang. - s and finally you were in second to last place, maybe you can assemble that code in assembly language to produce an object file and since we like to have separate builds, you don't have to compile everything it's one big monolithic chunk, so there's usually a linking stage. to produce the final executable and for that we are using LD for the most part, we are actually using the golden linker, but LD is the command that calls it, so let's go through each of those steps and see what's going on first. the preprocessing is really simple, so I'm not going to do that, it's just a textual substitution.

The next stage is the source code, assembly code, so when we click, on s we get this symbolic representation and it looks something like, okay, where we have some labels on the side, some labels on the side and we have some operations, we can have some directives and then we have a bunch of gibberish that won't look like so much gibberish after you've played with it's kind of okay, but it looks like gibberish to begin with, from there we assemble that assembly code and that produces the binary fine and once again you can invoke it by simply running clang clang it will recognize that it doesn't have a CE file or a C++ file it says, oh my gosh, I have an assembly language file and they will produce the binary now.

The other thing that turns out to be the case is that assembly and machining code are actually very similar in structure, okay, just things like the OP codes, which are the things that are here in blue or purple, whatever that color is, like these guys, okay, they correspond to specific bit patterns here in the machine code, okay, and these are the addresses and registers that were operating on the operands, okay, they correspond to two other bit codes there , well, and there's very close, there's a lot, it's not exactly one-to-one, but it's pretty close one-to-one compared to if you had C and you look at the binary, it's like it's very different, okay, so one of the Things that it turns out you can do is if you have the machine code and especially if the machine code that was produced with so-called debugging symbols, that is, it was compiled with the G script, you can use this program called ABS, okay, which will produce a disassembly of the machine code, so to tell you, okay, this is what the most human-readable mnemonic code is the assembly code of the binary and that is. really useful, especially if you're trying to do things right, let's see why we bother looking at the assembly, so why would you want to look at the assembly of your program?

Does anyone have some ideas? Yes, yes, you can see if certain optimizations are made. or not, other reasons right, everyone will say one is fine, one is fine, let's see, so here are some reasons, okay, assembly reveals what the compiler did and didn't do because you can see exactly what the machine is, what is the assembly you are going to do. run as machine code, the second reason that happens more often than you think is that, hey, guess what compiler is a piece of software, it's buggy so your code isn't working properly, oh god what's going on wrong, maybe the compiler. made a mistake and we've certainly found that, especially when you start using some of the less frequently used features of a compiler, you may find that it's actually not that well integrated and it mentions here that it may only have an effect when compiling in: oh three, but if you compile it, oh zero, zero one, everything works fine, then it says cheese somewhere in the optimizations, they did a wrong optimization, so one of the first principles of optimization is to do it right and then the second is to do it fast and then sometimes the compiler doesn't do it, it's also the case that sometimes you can't write code that produces the assembly you want and in that case you can write the assembly by hand.

Well, now it used to be many years. Many, many years ago, a lot of software was written in assembly. I actually had my first job after college. I spent about half the time programming in assembly language. That's okay, and it's not as bad as you might think, but it certainly is. It's easier to have high level languages, that's for sure you can do much more and much faster, and the last one is reverse engineering. You can find out what a program does when you only have access to its source, for example matrix multiplication. example I gave on the first day, okay, you know, we had the general outer structure, but the inner loop couldn't match the math and Intel core library code, so what do we do?

We looked to see what it was, we didn't have the source. So we looked to see what they were doing and said oh, what they're doing is fine and then we were able to do it ourselves without having to do it, without having to source it from them, so we reverse engineered what they were doing. . Those are all good reasons. Now in this class we have some expectations, so one thing is that you know that assembly is complicated and you don't need to memorize the manual. In fact, the manual is over a thousand pages long. Okay, okay, but this is what We expect from you that you understand how a compiler implements various seed linguistic constructs with x86 instructions and that is what we will see in the next lecture and that you can read x86 assembly language with the help of a architecture manual and in a quiz, for example, we will give you bits of information or explain to you what are the opcodes that are being used in case they are not there, but you need to understand it so you can see what is really happening.

You need to understand the high-level performance implications of common assembly patterns, okay, so what happens? You know what it means to do things a particular way in terms of performance, so some of them are pretty obvious. Vector operations tend to be faster than you think when doing the same thing with a bunch of scalars. Operations are fine if you write an assembly, normally what we do ziz is a bunch of compiler intrinsic built-ins, called like that, that allow you to use assembly language instructions and you should be able to do that after we've done this. write code from scratch if the situation demands it at some point in the future, we won't do it in this class, but we hope you'll be in a position to do it after you should get proficient at the level that that would do.

It wouldn't be impossible for you to do it, you could do it with a reasonable amount of effort, so the rest of the lecture here is. First I'll start by talking about the x86 64 instruction set architecture, which is the one we're using for the cloud machines we're using and then I'll talk about floating point and vector hardware and then I'll do an overview of the computer architecture, now all this. I'm doing this. it's a kind of software, right, okay, software performance engineering, we're doing it, the reason we're doing it is so you can write code that matches the hardware better, so to get it better, you could give things at a high level.

My experience is that if you really want to understand something, you want to understand it at the necessary level and then a level below that, okay, not that you necessarily use that level below, but that gives you an idea of why that is. The layer is what it is and what's really happening right, so that's what we're going to do, we're going to do a dive that will take us a level beyond what you'll probably need to know in class so you can have a solid basis for understanding, does that make sense? Well, that's my part of my learning philosophy.

You know, go one step further and then you can, you know, come back. assembly semantics, that is, there are four important concepts in instruction set architecture, the notion of registers, the notion of instructions, data types, and memory addressing modes, and those are more or less indicated, by For example, here we are going to review those. for one, so let's start with the registers, so registers are where the processor stores things and there are a lot of x86 registers, so many that you don't need to know most of them. Well, the ones that are important are these.

Well, first. all the ones there are are general purpose registers and the ones that are normally 64 and there are many of them there is a so called flag register called are flags that keep track of things like whether there was an overflow, whether the last arithmetic operation resulted in a 0, if a child, there was an execution of a word or whatever, the next is the instruction pointer, so assembly language is organized as a sequence of instructions and the hardware advances linearly through that sequence, a after another, unless it encounters a conditional jump or an unconditional jump, in which case it will branch to any location, but for the most part it just runs straight through memory, then there are some registers that were added quite late, it's i.e.

SSC registers and AVX registers and these are vector registers, so xmm registers were when they first did vectorization, they use 128 bits. There are also four AVX, they are ymm registers and in more recent processors that did not use this term there is another level of av. X That gives you 512 -bit records, but we may use it for the final project because it is like a little more power, it is fine for the game project, it's fine, but for most of what you will do, just maintain in the basics of C, for example, in the cases on AWS that you guys have been using, okay, now the x86 64 didn't start out as x86 64 started out as x86 and was used for particular 8086 machines that had a 16-bit word , okay, very short, okay, how many things can you index with a 16-bit word? about how many yeah about 65,000 65536 okay words you can run on that sorry you can address or bytes this is byte addressing okay so that's 65k bytes you can address how could they use that for machines?

Well, the answer is how much memory was on the machine. You didn't have gigabytes, so machines, according to Moore's law, advanced and we got more and more memory. the words had to be made wider to be able to index them, yes, yes, but the thing is that if you build things that are going to have to be very thick, it's too expensive and you can't get a memory that's big enough, so, then , if you built a wider word, like if you built a 32-bit word, then your processor would cost twice as much as the next person's processor, so what they did was they kept going as long as that was the common size and then they had some problems with growth. and it went to 6:32 and from there they had more growing pains and they went to 64.

Well, those are two separate things and, in fact, they did some really strange things. So what they actually did was when they made these longer records. have registers that have exactly the same aliases for the lower bits, sois placed in that part and direct memory says to use a particular memory location, that's fine and you can give a hexadecimal value when you do direct memory, it will retrieve it from that, it will use the value at that place in memory and to indicate that the memory will take it on a 64-bit machine, 64 eight bytes to specify that memory, whereas, for example, the motion signal I can get a. seven two fits in a bite, that's fine and it will move, you know, I will have spent a lot less storage to do it, plus I can do it directly from the instruction stream and I avoid having to access memory, which is very expensive, so what? how many cycles does it take if the value you're getting from memory is not in the cache or whatever, okay or in a register, if I'm retrieving something, remember how many cycles the machine does that normally? take these days, yes, yes, a couple of hundred or more, yes, a couple of hundred cycles to recover some of the memory, it is very slow, no, it's just that the processors are very fast, okay and so clearly , if you can put things in registers, you can access most of the registers in one go. cycle well so that we want to move things close to the process or operate on them, push them back and while we are pulling things out of memory we want other things to be working well and so all the hardware is organized according to what needs to be done.

Okay, now of course we spend a lot of time looking for things from memory and that's one of the reasons why we use caching and we'll have something important. Caching is really important. We're going to spend a lot of time on how to get the most out of your cache. There is also indirect addressing so instead of just giving a location you say oh let's go somewhere else for example the register and get the value and the address will be stored in that location so for example here the indirect register says in this case move the content of our ax to sorry, the content is the address of what you're going to move to our di, okay, so if our ax was location 172, then I would take whatever is at location 172 and I would put it in our di, okay, the registered index says we're going to do the same thing, but while we're at it, add an offset, okay, so if once again, if our ax was 172, in this case would go to 344, it's okay to get the value from that location 344 for this particular instruction, okay and then the relative instruction pointer is okay instead of indexing. outside a general-purpose register, indexes the instruction pointer.

Okay, that usually happens in the code where the modern code is, for example, you can jump to where it is in the code and get instructions. Okay, you can jump down. number of instructions in the code, usually you will see that using with control is lonely because you are talking about things, but sometimes they put some data in the instruction stream and then they can index the instruction pointer to get those values without having than append another register now the most general way is the base index scale shift direction Wow, okay, this is a move that has a constant plus three terms, okay, and this is the most complicated instruction supported, the mode refers to the address, whatever the the base is okay, so the base is a general purpose register in this case RDI and then add the index multiplied by the scale so that the scale is one two four eight, it is okay and then an offset which is that number in front, okay and this gives you very smooth indexing of things from a base pointer, so you'll often see this type of access when you access stack memory, okay because all you can say here is the base of my framework on the stack and now for anything I want to add, I'm going to increase it by a certain amount.

I'm scaling a certain amount to get the value I want. Okay, so once again, you know you're going to get familiar with these with a manual. Well, it's not necessary. memorize all of this, but you have to understand that there are many of these complex addressing modes. The jump instruction takes a label as its operand, the location of which identifies a location in the code for this, labels can be symbols, in other words you can say here. a symbol I want to jump to could be the beginning of a function or it could be a label that is generated to be at the beginning of a loop or whatever, they can be exact addresses, they go to this place in the code or they can be relative. address jump to somewhere like I mentioned it's indexed outside the instruction pointer okay and then an indirect jump takes as operand and in the direct address I have it okay since it's not mark as operand okay so that's it a typo, it only takes one operand as an indirect address, so you can basically say: go, you know, jump to whatever that record says using whatever indexing method you want.

Well, that's kind of an overview of assembly language. Now let's take a look at some idioms for the additional opcode to be calculated. the bitwise XOR of a and B we saw that Yes, it resets the registry, why? the log, yeah, it's basically taking, it's basically taking the results of our ax, the results of our ax that. Every time you see that, hey, what are they doing? They're zeroing out the registry. Well, that's actually faster and easier than having a constant zero put in the instruction. A byte is saved because it ends up being a very short instruction.

I remember how. many bytes that instruction has, but okay, here's another one, the opcode test a B computes the bitwise y from a and B and discards the result preserving the flag register, okay, so it basically says what for The test instruction is useful for these things. What does the first one do? So it takes our CX, yeah, so it takes the bitwise and from A and B to the right and then it says jump if it's equal to the right and it's not zero if any of the bits are set, that's correct, so if the zero flag is set then our CX is set so it will jump to that location if our CX holds the value 0 ok in all other cases it will not set the zero flag because the result at the end will be zero so a Again that's a type of idiom they use, what about the second one?

So this is a conditional move, so they're both basically checking if the 0 records are okay and then doing something if it's okay or not, but those are just idioms. which you have to look at to see how they accomplish something in particular, well, here's another one so ISA can include several no-op instructions, including knop not a, which is an operation with one argument and data 16 that reserves 2 no op bytes, so here's an assembly line that we found in part of our code ok data 16 days 16 data 6 no op w you know and then % CSX you know, so no op W is going to take this argument which it has all this address calculation, so what do you think this is doing?

What is the effect of this? By the way, all of them are not operations. So what's the effect, it's not right at all, the effect is not right at all, now set the flags. but basically it mostly does nothing, why did the cat compiler generate an assembly with these idioms? the beginning of a cache line and in fact there is a directive to do that, if you want all your functions to start at the beginning of the cache line then you want to make sure that if the code gets to that point it will know to just proceed to jump through memory continue through memory, okay, mainly it's optimizing markup, so you'll see those things.

I mean, you just have to realize, oh, that's the compiler generating some non-ops, so that's kind of our little excursion into x86 assembly language. assembly language now I want to dive into floating point and vector hardware, which will be the main part and then if there's any time at the end, I'll show the slides where they are. I have a bunch of other slides on how to predict forks. works and a variety of other machine type things that if we don't get to them no problem, you can take a look at the slides and there's also the architecture manual, so you access floating point instruction sets, so which is mainly accessed by scalar floating point operations. through a couple of different instruction sets, so the floating point story is interesting because Ridge '''l II the 8086 did not have a floating point unit.

Well floating point was done in software and then they created a company that would do floating point and then they started integrating and so on as miniaturization took hold, the SSC and AVX instructions do single and double precision scalar floating point, it's say, floats or doubles, and then the x86 instructions, the x87 instructions, which are the 88 seven that was attached to the 8086 and that's where they get them to support single, double, and extended precision scalar floating point arithmetic, including float double and double long, so you can get a great result from a multiplication if you use the x87 instruction sets and they also include vector instructions that you can multiply. or add there as well so that all these places on the chip where you can decide to do one thing or another.

Compilers generally like SSE instructions instead of x87 instructions because they are simpler to compile and optimize and SSE opcodes are similar to normal x86 opcodes and use xmm registers and floating point types, so which you'll see things like this where you have a motion SD, etc., well, the suffix says what the data type is in this case. says it is a double precision floating point value i.e. a double value, once again they are using the SD suffix, in this case it is a double precision floating point value, the other option is the first letter that It says if it is simple, I use scalar operation or pet that I have.

The vector operation is fine and the second letter tells whether it is single or double precision. So when you see one of these operations, you can decode it. Oh this is operating on a 64 bit value or a 32 bit floating point value or on a vector of those values now what about these vectors? So when you start using packed representation and you start using vectors, you have to understand a little bit about the vector units that are on these machines, so the way that a vector unit works is that there is it's the processor that outputs instructions and it issues the instructions to all the vector units, so for example, if you look at something typical, you can have a vector with the four vector units, each of them is often called lane la ne and the X is the vector with , so when the instruction is given it is given to all the vector units and they all do it in their own local copy of the register, so the register can be considered as something very large divided into several. words and when I say add two vectors together, it will add four words together, okay, and store them again in another vector register, okay, and whatever K is, you know in the example, I just said K was four and the rails are what each of them contains integer or floating point arithmetic, but the important thing is that they all operate in unison, okay, it's not like one is going to do one thing and another is going to do another , everyone has to do exactly the same thing and The basic idea here is the price of an instruction.

Well, I can order a bunch of operations to be performed now. In general, vector instructions operate on a white element in which you take the eye of one vector and operate on it with the I element of another vector. and all rails perform exactly the same operation depending on the architecture, on some architectures the operands must be aligned, i.e. you must have the beginnings in exactly the same place in memory, a multiple of the length of the vector, there are others where Vectors can be changed in memory, there is usually a performance difference between the two.

Okay, if you support some of them, you won't support unaligned vector operations, so if you can't determine that they are aligned, sorry, your code will end up being executed scalar in a scalar fashion, if they are aligned, okay, you have to be able to solve it and in that case, sorry if it's not aligned, but it supports inline vector operations, it's generally slower than if they're aligned, okay? and for some machines now they have good performance on both, so it really depends on the machine and there are also some architectures that will support cross-rail operations such as inserting or extracting subsets of vector elements, permuting types of shuffle and scatter operations, so X supports multiple instruction sets, as I mentioned there is SSE,there is a VX, there is a VX, and then now there is the avx-512 or sometimes called VX 3 and that is not available on the machines that we will use, the Haswell machines that In general, we will do to VX and avx2 to extend the SSE instruction set using the wider registers and operate on the SSE, use wider registers and operate on most of the two operations, the a VX can use the 256 and also have three operands, not just two operands, so you can say that you know add a to store it and see instead of saying add to b and store it and be fine, so it can also support 3, yes, most of them are similar to traditional opcodes with minors. differences, so if you look at them, they look like, basically, only if you have an SSE, it basically looks like the traditional name like ad in this case, but then you can say do a packed aggregate or a vector with packing. data, so the premise V says to say V X, so if you see it's V, go to the part of the manual that says VX.

If you see the peas saying it's packaged data, go to SSE if you don't have it. The V ok and the prefix P distinguish an introvert or an instruction you understood me. I tried to think about why P when distinguishing an integer is like a P is not a good mnemonic for an integer, okay? So also they do this aliasing trick again where ymm records are actually aliases of xmm records, okay so you can use both operations but you have to be careful what's happening so they just extended them and now of course with the avx-512 they made another extension to 512 bits, okay that's vectors. so you can use them explicitly, the compiler will vectorize them for you and this week's assignment will take you through some vectorization exercises.

It's actually a lot of fun. We're going over it at the staff meeting and it's a lot of fun. I think it is a very fun exercise. He introduced it last year for O, I hadn't or maybe two years ago, but either way it's fun for my definition of fun, which I hope is your definition of fun. Okay, now I want to talk generally about computer architecture and I'm not going to go through all of these slides like I say, but I want to start with them and give you an idea of other things that happen in the processor that you should know about, so probably in six double- or for you.

I talked about a five-stage processor so no one remembers, okay. Five-stage processor there is an instruction fetch there is an instruction decode there is an execution then there is a memory addressing and then you write the values back and this is done like a pipeline so you can do all of this in one thing, but then you would have a long clock cycle and you can only do one thing at a time, instead they stack them together so here's a diagram of blocks of the five-stage processor, we read the instructions from memory in the instruction fetch cycle and then decode them, basically, it takes a Look at what is the opcode, what are the addressing modes, etc., and find out what it actually has to do, then it performs the ALU operations and then reads and writes the data memory and then writes the results to registers, which is usually a common way. that these things go through a five-stage processor, by the way, this is very simplified, okay, if you can take six, eight, three, if you want to learn the truth, okay, I'll tell you, I won't tell you anything but The lies pious are fine for this lecture now, if you look at Intel Haswell, the machine we are using actually has between 14 and 19 pipeline stages, the 14 to 19 reflect the fact that there are different paths through it that take different amounts of time.

I also think it reflects a little bit that no one has published Intel's internal material, so maybe we're not sure if it's 14 to 19, but somewhere in that range is fine, but I think it's actually due to the different time periods that I will explain. What I want to do is that you have already seen the five-stage process. I want to talk about the difference between that and a modern processor by looking at various design features that we already talked about in vector or hardware. Next I want to talk about superscalar processing. of order execution and branch prediction is okay a little bit and when it's out of order I'll skip a bunch of that because it has to do with scoring which is really interesting and fun but also time consuming but it's really interesting and fun. that's what you learn in 6/8 2/3, so historically there are two ways that people make processors go faster by exploiting parallelism and by exploiting locality, ok, and parallelism there are instructions, well, we already did word level parallelism in bit tricks, but there is also instruction level parallelism, called multicore ILP vectorization and for locality, the main thing used there is caching.

I would also say the fact that it has a design with registers that also reflects locality because the way the processor wants to do things is to retrieve things from memory, it doesn't want to operate them in memory, that's very expensive, it wants to retrieve things in memory. memory, have enough there to be able to do some calculations, do a bunch of calculations and then put them back, okay, so this In the lecture, we're talking about ILP and vectorization, so let me talk about instruction level parallelism, so when you have, say, a five-stage process, you want to find opportunities to execute multiple instructions simultaneously, so that one instruction does one instruction. fish, then it does its decoding, so it takes five cycles for this instruction to complete, so ideally what you would like is that you can start instruction two in cycle two, instruction three in cycle three, and so on. successively and have five instructions once you enter the steady state have five instructions running all the time, that would be ideal, okay, where each one takes only one thing, so that was actually pretty good and that would improve performance despite that it could take a long time to complete an instruction. you can have many instructions in the pipeline at some point, that's fine, so each pipeline executes a different structure;

However, in practice, this is not what happens; In practice, you find that there are what are called stops in the pipeline when it comes time to execute an instruction to achieve a certain correctness. reason you can't execute the statement you have to wait and that's a stop in the pipeline, that's what you want to try to avoid and the compiler tries to create code that avoids the stops, why do the stops happen? They occur due to what are called dangers. Actually, there are two. notions of danger and this is one of them, the other is a race condition danger, this is the danger of pendency, but people call them both dangers just like they call the second stage of compilation, it's like they invented these words, okay, so here are three. types of hazards can prevent an instruction from being executed, first of all, there is what is called a structural hazard: instructions try to use the same functional unit at the same time if, for example, there is only one floating point that knows, multiplier and two of them try to use at the same time.

At the same time, one has to wait, okay, in the modern process there are a lot of each of them, but if you know k functional units and k plus one instruction, you want to access them, you are out of luck, one of them will have to wait. the second is a data risk, this is when an instruction depends on the result of a previous instruction in the pipeline, okay, so you know, you know that an instruction is calculating a value that will be held in you know, RCX says, it's good. so they insert it into RCX, the other one is to read the value from RCX and it comes later, it has a weight and another instruction has to wait until that value is written there before it can read it, that is a danger to the data and a danger for the control. where you are where you decide you need to make a jump and you can't execute the next statement because you don't know which direction the jump is going to go, so if you have a conditional jump it's like, well, what's the next statement after that jump I don't know.

I know, so I have to wait to execute, I can't go ahead and do the jump and then do the net statement after because I don't know what happened to the previous one, okay now In all this, we are going to talk mainly about data hazards, so an instruction can create a data hazard. I can create a data hazard because of a dependency between I and J, so the first type is called true dependency or read after type dependency and this is where like in this example I'm adding something and a story to our ax and the next instruction wants to read from our axe, okay, so the second instruction cannot continue to the previous one or it can stop until the previous one. the result of the previous one is known, there is another one called anti-dependency, this is where I want to write to a location, but I have to wait until the previous instruction has read the value, okay, because otherwise I'm going to hit that instruction. andrey hits the value before it's read okay that's an antidependency and then the final one is an output dependency where they're both trying to move something to our ax so why would two things want to move things to the same location? after all one of them will miss and just won't do that instruction why yes maybe because it wants to set some flags so that's one of the reasons why it might do this because it wants you to know that the first instruction set some flags. besides moving the output to that location and there is another reason What is the other reason why I am blank?

There are two reasons and I didn't put them in my notes. I don't remember well, but it's a good question anyway. for the test, then okay, give me two reasons, yes, I could, but of course, then you know if you're going to use that record, then, oh, I know the other reason, okay, so this is still good for a test , okay, the other reason is that there may be aliasing going on, maybe an intermediate statement uses one of the values and has an alias, okay, so use it as part of the result or whatever, there could still be a dependency anyway some arithmetic operations arithmetic operations are complex to implement in hardware and have long latencies, so here is some sample opcodes and how many instructions, how much latency they take, they take a different number, for example, the Integer division is actually variable, but a multiplication takes about three times what most integer operations are and floating point multiplication is like five and then F ma what is F M a fuse multiply addition here's where you're doing a multiplication in an ADD and why do we care about fusions multiply sums not for memory actually this is actually a floating point multiplier and a sum is called linear algebra okay so when you do multiplications, you are doing a dot product , you're doing multiplication and addition, that kind of thing, that's where you do a lot of that, so how does the hardware adapt to these complex operations?

So the strategy of that hardware I mean a lot of hardware. What you tend to use is to have separate functional units for complex operations like floating point arithmetic, so in fact there can be separate registers, for example the xmm registers that only work with floating point, so you have your pipeline basic five stages. you have another pipe that is on the side and sometimes it will take several cycles for things and maybe a pipe at a different depth so basically you separate these separate operations and the pipe maybe completely partially or not at all so I Now I have a bunch of different functional units and there are different paths and I'm going to be able to take the processor data path so that it has an integer, vector floating point demit spread across eight different ports. which is kind of an inch from the input, so since things get very complicated if we go back to our simple diagram, let's say we have all these additional functional units, how can I now exploit more parallelism at the instruction level?

Right now we have it. we can start one operation at a time, what could I do to get more parallelism from the hardware I have? What do you think computer architects did? Yes, even simpler than that, but that's implicit in what you What I'm saying is that you can just do multiple instructions per cycle, so instead of just doing one per cycle like we showed with a typical pipeline processor, let me look for several that use different parts of the pipeline processor pipeline because they are I'm not going to interfere, it's okay to keep everything busy and that's basically what you call a superscalar processor where you don't run one thing at a time, but multiple things at once. time, so in effect it breaks the instructions into simpler operations called microoperations and can issue microoperations per loop to the rest of the pipeline, and the search and decode stages implement optimizations in microcode processing, including special cases for common patterns, for example, if you see the X or of our ax and our ax knows that our the time and those are the machines you're doing, that's why when I said that if you save an add instruction, it probably won't make any difference on the current processor because there's probably an idle add or something out there.

I probably recorded how many where are we going here, yeah, so if you look here, you can find out that they are actually a group of Al users who are able to make an announcement so you know that they are all over the map and Well, now we still insist on that the processors execute things in order and that is the next stage: how do you end up making things work, that is, how do you make it so that you can free yourself from the tyranny of one instruction after another, okay, so the first thing is that There is a strategy called bypass, so let's say you know some instructions that are executed in our ax and then you will use them to read well, why bother waiting for it to be stored in the register file and then we remove it for the second instruction, okay, instead, let's do a bypass, a special circuit that identifies that type of situation and sends it directly to the next instruction without needing it to go into the register file. and go back, okay, that's called skipping, there are a lot of places where things are skipped and we'll talk more about it, so normally you would wait for it to be rewritten and now when you delete it, I can go a long way. because I only use the branch path to execute and allow the second statement to start earlier, what else can we do?

Let's take a big code example given the amount of time I'm going to do is basically say you can go. Go through and find out what the read after write dependencies are and the write after read dependency is there everywhere and what you can do is if you look at what are the dependencies that I made and that I just showed. through you can find out, oh, there are all these things, each one right now has to wait for the previous one before it can start and, but there are some, for example, this one, the first one is just an issue order, no can start the second B if it's so you can't start the second until you've started the first, it's fine that the first stage has finished, but the other thing here is that there is a data dependency between the second and third if statement you look at the second and third instructions.

We're both using xmm, so we're handicapped, so one of the questions is: why not do it a little better by looking at this as a graph and figuring out what's the best way to traverse the graph? of tricks you can do there, I'll go over them here real quick, okay, and you can take a look at this, okay, you may find that some of these dependencies are not a real dependency and, as long as you're willing to run things out of order and keep track of that, that's perfectly fine, if it doesn't really depend on them, then just go ahead and run it and then you can move things forward and then the other trick you can use is what's called renaming the registry if you have a destination to be read from sorry if it has a if it has a if I want to read from something, but if I want to write to something, but I have to wait for something else to be read, they are fine, they write after the read dependency , then what I can do is just rename the record so I have something to write to that is the same thing. and there's a very complex mechanism called a marker that does that so anyway you can take a look at all these tricks and then the last thing I want so this is the part I was going to skip and in fact I don't .

I don't have time to do it. I just want to mention the last thing that's worth it, so you don't need to know any of the details of that part, but it's there if you're interested, so rename and reorder, and So, the last thing I want to mention is the branch prediction, so when you get to branch prediction, the result can have a danger because the result is known too late and in that case what they do is what is called speculative execution, which you've probably heard talk about it, basically that says: I'm going to guess the result of the branch and execute fine, if it finds you, you assume it's taken, otherwise you exit and execute normally and, if you're right, everything is fine if you're wrong, It costs you something like you know you have to undo that speculative calculation and the effect is like a deadlock, so you don't want that to happen, so there is a mispredicted branch that costs about 15 to 20 cycles, most Machines use a branch predictor to know if it will work or not.

Here are some things on how to tell if something is going to be a branch, it's going to be predicted or not, and you can take a look at this on your own, sorry to rush a little at the end, but I knew I wasn't going to finish all of this, but it's in the notes, on the slides when we published it, and this is really It's kind of interesting stuff again. I remember I'm covering this at a level below what you actually need to do, but it's really helpful to understand that layer so you have a deep understanding of why certain software optimizations work and don't.

The job sounds good, okay, good luck finishing your project.

Watch Video & Subscribe

If you have any copyright issue, please Contact