An Intro to GPU Architecture and Programming Models I Tim Warburton, Virginia Tech

Jun 09, 2021

I was here yesterday for the end of the OpenMP session, so I hope you have some energy today because it was a marathon and in some ways reflects the complexity of modern OpenMP, which can do almost anything. Ask him to do. I'm going to talk about something a lot more focused and optimized, something more aimed at graphics processing units, so I'll give you a story about myself so you can have an idea of where I'm going. from and a little different for some of the previous speakers and some of the speakers you will see I am an applied mathematician by training actually my PhD advisor was an engineer I was a teacher and the graduate student in applied mathematics worked for an engineer, so over time I have become more and more applied and more computational and I will give you the perspective of a computer scientist, not that of someone from a vendor, not a representative of them, and I am not trying to sell you anything.

I have no personal interest in any product. I'm not going to tell you that, for commercial reasons, openmp is better than open ACC. I'm not going to tell you that Invidious or Cudas are better than anything else. I'm simply going to give you some relative merits, advantages and disadvantages of using these different

programming

models

. I'm going to try to debunk some myths about GPUs. You may have heard some propaganda and I'm just trying to eliminate some of the propaganda that you may have heard right away. your codes will be a hundred times faster, open ACC Pixies will visit your code, you will run the code and it will magically be a hundred times faster, frankly it's not going to happen.

More Interesting Facts About,

an intro to gpu architecture and programming models i tim warburton virginia tech...

I'm going to delve into some of the history and give you a historical perspective on how GPUs have evolved over the last few years and talk briefly about where they're going and why they're important, frankly, your laptops, your desktops, your servers, your large clusters, their supercomputers. They are very, very competent central processing units, you can see their performance on the node. Performance is now reaching teraflop without the GPU, why do we care about this? But there is a CPU, why do we care about the GPU? What is it that keeps us watching this guy? of East Tag devices, I'm going to give you a brief

intro

duction to nvidia x's single-vendor solution for GPU

programming

, the cuda, because it's the easiest way to get started with GPUs and actually get your hands dirty so you can get it. to understand a little bit more about the

architecture

and how you should interact with the

architecture

to get the best performance, so we'll talk about best practices for GPU programming.

I have some hands ons and hope they work with GPU programming. it's a bit complicated, not very complicated, but it's detail oriented so I'm afraid the practice is quite prescriptive because if I just say go and code this in CUDA you probably won't get very far so I've sorted it out . For example, I have some examples that will guide you through GPU programming. I've included short slides on kut'r optimization, optimization code specifically for GPUs, but I'm not going to touch on those, just saved them for completeness. partly because, as we see each generation of NVIDIA GPUs and AMD GPUs, the details of how to program them become less and less important, they are azzam as hardware matures as compilers mature, it becomes easier to get the better performance, however, if you are very serious, you need to go and look at the details of the architecture and really map your algorithms very carefully to the particular architecture, looking at the memory hierarchy and exploiting every resource that is available and then I'm going to take a little detour and talk about portability, so if you choose to program, you are beholden to a hateful future, if they buy nvidia, they go out of business, suddenly their programming solution is no longer available, and generally speaking, the general rule is if you have competent CP, competent code, computer science code, some modeling code, simulation code, for example, academic code, maybe it costs single digit millions, so writing one, two, three million dollars if you add them all up, all the time and effort that goes into these codes costs millions. write, if you look at a larger platform, we're talking about at least ten to 20 million dollars to write one of this code, so you're making a pretty big bet on a hateful future as a viable business as an independent company that can provide the cuda solution, so I'll talk about some ways you can hedge your bets and target and write code that can target multiple platforms, a CPU GPU, so you're not as vulnerable to changes in the market and changes in mission odious and I have a fairly light but fun example, perhaps we are all familiar with his honorable and fine spaghetti in other Flying Spaghetti Monsters.

Yes, so what we'll do is write a very simple lattice Boltzmann flow code that will generate flow simulations. Well, what we need you to do is find an image file in PNG format with a white background, whatever you want, make it family friendly and we'll pop it into a portable stream solver and you can generate movies like this. So during the break, what we would like you to do is get some PNG file that has to be PNG and we can generate movies like this and we can switch between different calculation modes. able to compute using CUDA open CL or open MP, your choice, whatever is available on the system, is fine, and of course it goes without saying that if you have any questions, please ask, just shout.

I'll raise my hand for whatever I need to do. It caught our attention, so I promised to tell you a little about myself, my research. I'm a mathematician, as I mentioned, so I spent an inordinate amount of my time worrying about function analysis, approximation, finite element discretization theory, so I spend a fair amount. of time here thinking about the numerical methods and then I'm thinking well about the theory to prove something about them, but I don't stop there either, I go to the description of the numerical methods x' and the physical

models

and we implement them now because of the power available in these eggs or eaters in GPU, everything we do has to target this type of architecture, we will talk about that later and then we are very concerned about scalability.

My team is part of an Exascale Design Center that is a marriage between Livermore and Argonne and my team at Ginger Tech is one of our five other university partners where we really care about targeting next-generation leadership facilities, so we really care about everything from basic approximation theory to scalability, and these are not really separate concerns, each part of the stack has to be adjusted so that we can target these large-scale parallel devices in these large leadership facilities some Of the applications we've looked at electromagnetic flow modeling competition this is cool, we worked with some researchers at MD Anderson Cancer Center on modeling laser based thermal therapy to kill cancer so it's really daunting.

In fact, I have slides. I should have trigger warnings before giving them, but I haven't been Glidden now. Basically, they are photographs of patients being placed. In an MRI, the surgeon will drill a hole in the patient's head using literally a hand drill and inject or place a fiber optic catheter into the patient's brain, shine a laser on it and it's basically like an easy baking oven. where the tumor is baked until it dies, we incorporate some quantitative models into that process so we can decide what power the lasers use, how long it should be used, and where the laser should be placed.

It seems like a very simple question. This team, MD Anderson, took codes from commercially available items and wrote the code correctly. All elements of Franta built this building. This array started to resolve and it took 12 hours to make one instance for a patient, one instance of a therapy for one patient, but if you're going to do a design optimization for therapy, you need hundreds of instances of different potential treatments, so there's no way you can do it, especially since you want to do it live because you're getting live data from the MRI and you really want to have a feedback loop to decide online how to optimize this treatment, so we achieve that by using GPU .

GPUs are considering a workstation that could be located in the booth next to the treatment room, we reduced this to 17 seconds. Just speeding up the finite element solvers, there's a lot that comes into play, that speedup, but it was key to getting the most powerful workstations we could get. We've also done some more classic stuff, like time-sensitive sand, our modeling flow modeling is a little weight impact modeling and gas kinetic modeling, all sorts of things, so I'm a little bit different than some of our speakers. previous because my goal is this. I'm actually not very interested in the minutiae of twist locks and other esoteric things.

I can do it with open MP or MPI. I'm just interested in making sure my finite element modeling tools run as fast as possible on the Blodgett cluster we can get our hands on, so the slide should be available for you. There is a github repository that you can access and there is another repository that we will need later, we will come back to that later, okay, so reality check, myth number one, how many people have heard that GPUs will create your code 100 times or They have seen results that say the GPU. The code is 100 times faster than the CPU.

Have you seen this? Yes, prevail. The best thing I saw was a thousand times faster. I'm sure they can edit it later, but that's okay. I chose the two most expensive GPU and CPU. You could find a very serious P-class 100 Pascal GPU from Nvidia and a very expensive server chip from Intel. You can see its specifications in floating points. This thing on the left can get ten teraflops with single precision, about half that of the Indonesian one. Actually, it's a monstrous thing right a reasonable number of gigaflops, but that doesn't matter, none of that matters for most.

I talk to several people around here and it doesn't matter for most of the applications people are doing in this room. What matters for most applications is the available memory bandwidth between the processor and its memory pool. This thing can reach seven hundred and thirty-two gigabytes per second or I've seen different numbers depending on what website you look at. Eighty-five gigabytes can be reached, so it would be unreasonable to expect more than basically a factor of nine difference between GPU and CPU performance. You can forget about any other important numbers you provided. You can just watch. the relative strength of how much data you can transfer through these processing units in unit time, so now you have to calibrate your X switches to expectations, you write your code, what should you expect nine times faster?

Is it worth the effort required to convert? your code to the GPU will be nine times faster. What might happen is you do that implementation and get it a hundred times faster, but it tells you something. What tells you that your CPU code was still correct? That's the real essence of why people are watching. A thousand times speedup doesn't even mean the GPU code was that good, or you could have bad CPU code, really bad CPU code, and a passable to mediocre GPU implementation ratio could still be great, does it? right? But if you have the CPU code properly tuned and you create a GPU code properly tuned, you'll see maybe a factor of nine at best.

Now these are the maximum bandwidth, it is usually very difficult to get to the maximum bandwidth, you will get a fraction of maximum bandwidth on the P 100 we are having a hard time getting much more than 50 to 60 percent of the bandwidth. maximum bandwidth on the CPU, we can get close to the maximum bandwidth, there are reasons for that to do the cache hierarchy, so 60% at 732 versus 85 we are really talking more realistically, we can expect a factor of 5 difference between your CPU code and GPU code, so modify all expectations, you really shouldn't expect an Ondra time speedup to which it will actually seem closer. 5 GPUs are expensive, well they are if you buy pro class server class GPUs that's what's on the left 5,000 dollars at least what's on the right, although I manipulated this game I found the Intel CPU more expensive that I could find, which is a $9,000 piece, if you take one of their slightly cheaper models with less onboard cache and look at the performance ratio of this versus one of the slightly cheaper sites like a $1500 card or CPU Prime, then the rate of that amount indollars The GPU is very carefully calibrated so that the dollars you pay for the GPU are about the same as you would pay for equivalent CPU performance, so a device will give you four or five times faster, but could cost you four To five. times more, so the question is what are we doing here, why not use your CPU, just use more CPU, thinking about GPU has some advantages and we will talk about this myth number three: CPUs and GPUs are very different and I have some takes of tint, one for you in the p100 video and one for that monstrous Intel CPU that looks like a flyover country when you go over an overpass and you see the fields and there is a feeling, the field, but if we look at the real architecture in reality They're not that different, the GPU has 56 because Nvidia will give you a marketing speech and tell you it's a Streamy multiprocessor, by which they mean core, each core has four.

I think it's two or four candies, the exact number thirty-two ways, since if They call it sim tea, which can simply mean sim D.units, so it's basically 32 vector units wide. They told me that I should use the pointer to be able to capture here. He was getting a stern look from up there. Peony hand has 24 calls and they have 256 bit wide avx2 instructions. so they also have wide vector instructions not as wide but in general it is not that different to the system the architecture is really not that different fifty six calls to the left 24 calls to the right 432 vectors of width that if multiplied by 32 bits that it's a broader vector unit, so it's a thousand units of 24 bits versus 256 bits, just units back to units, right, they're really not that different, there are some differences, myth number four, the open second is magic , you've had it, open a conference on cc now and you've had the hands-on and essentially the suggestion is that you can add directives to your code and it will magically completely exploit the GPU.

My actual experience and observation with open ACC is, first of all, that the compilers are not mature yet, my favorite era. Favorite compiler bug is what was the Oh compiler experienced a catastrophic error which is very helpful and you really know it's hard to know what to do after that but I'm not a big fan of open ACC but not for the reasons why that Tim spoke yes yesterday, he was talking about a territorial dispute, open ACC committee versus open MP committee. I'm talking more about program capacity and I'm thinking okay so I have to have a directive to move data and some code and I have a directive that makes the data operations you want to perform happen on the GPU in a different part of the code and I have a code that I have to coordinate those things in a massive part of code base and I have to remember that those directives are mere suggestions and the compiler can choose to ignore my suggestions at any time, so I am a fan of control, it doesn't give me enough control, the best open ACC codes usually come from re-designing existing CUDA codes where you take a CUDA code that already has the type of offload we'll talk about later and then you simply change the syntax and CUDA programs of awkward way, but when using open ACC directives, those are the codes that have been proven to work. have reasonable performance, there are some mix-ups, there is one thing that was kind of interesting: when I taught open ACC to some college students, it was a bit of a disaster due to compiler problems, but there was one case where open ACC could do a best job. than my native CUDA programming would do and that was a reduction because you're basically saying to do a reduction via an open directive.

ACC can access your bag of tricks which are prepackaged reduce operations and could reduce an efficient reduce that I couldn't match because the immediate guys behind open ACC or PG had done a better job on that implementation myth number 4.1 CUDA is magic , even if you do your best to write your CUDA code, there is no guarantee that you will achieve maximum performance. It can take a lot of psychoanalysis of the CUDA compilers and architecture to get the best performance, so it takes more than three hours to master the GPUs. We will discuss some of the basics.

There are many web resources. Nothing beats practice. add some background on what Nicolai talked about so let's talk about that, any questions about that and remember that's all opinion, it's my opinion, your mileage may vary based on your personal experience and your journey, we'll talk about that, I have real numbers , a ridiculously high number of flops required per byte, but I'll get to that, that's a good question. Anything else, so let's talk about this at the 10,000 foot level, which is the CPU, what is the main driver of the CPU design originally, was the supremacy of a threat that you wanted.

The thread must be uninterrupted and with the highest possible performance. This comes from the day when you had single-core, single-thread machines, so the goal was to make individual threads very fast. Reduce latency through large caches. Use prediction speculation for the instruction flow. I wanted to protect against branches in the instruction stream and if you look at a sort of abstract representation of a CPU architecture, not only do you have an instruction fetch decoding unit, you also have a good amount of silicon dedicated to control logic outside of order,branch prediction logic. you have a lot of silicon dedicated to memory prefetching and caching and maybe you had a limited number of arithmetic logic units that are actually executing floating point, so it was just to make sure that one thread didn't stop and achieve high performance.

The original GPU app rendering games were basically for teenagers, because with as many polygons as fast as possible and the highest frame rate, it is a very competitive industry vision. You had ATI and NVIDIA, we are competing in the GPU space, the proper rendering process is very very very simple in a sense every pixel you can see I can see pixels that you probably can't help but love in a second we have to make decisions about intensity red-green-blue for those little blocks on the screen and it's a huge compromise because you can choose those red, green, blue intensities for this pixel here versus this pixel here, they can be done independently, so what you have the possibility to perform on a thousand by thousand screen a million operations simultaneously, so if you get closer obviously we can see the pixels that we need to choose the color, we must do it at 60 frames per second and we must do it on a screen modern with a resolution of 4k, so if we take 4k times 2k, that's 8 million pixels at these 60 frames per second. second, that's a lot, it's like half a billion pixel decisions you have to make every second, so you need some serious parallel processing behind the scenes and here's a more modern example from Fallout 4, so this screen is a rendered set of triangles, each triangle. it has an image, a texture associated with it and when we as viewers need to decide what color this pixel is going to be, then there's some projective geometries, some rotations, some interpolation required, so there's a huge amount of floating point arithmetic that goes into it. needs to be done to make a decision on how those images should be colored, so what are the design goals?

Performance matters, individual threads don't matter, so you can think of a CPU as the embodiment of capitalism, it's all about the individual or Americanism, right, it's about that. a single consumer can continue to consume without interruption the GPU is more like communism where the collection of the collective is important, all pixels need to be colored X times 60 times per second correctly, so performance matters, individual threads don't matter which o Do you color those pixels, as long as they are all colored within one-sixtieth of a second? We're ready to go, that's the game, you're ready to go, they recognize that you have to do it with memory latency and we can do it in bulk.

Parallelism, so we oversubscribe the processing elements of each of these from the GPU because we have more threads than processes. In essence, one of the best things I did in video was let the programmer deal with the raw storage hierarchy so that the early GPUs were not programming friendly tips, they had very little cache and the cache they did have was managed by the programmer, so when you program the CPU you don't actually interact directly with the GPU caches, they are well exposed originally. were exposed, so you decided when to retrieve the data in the cache, so you were doing that thing that runs yourself as a program and they didn't, they didn't fall into the same trap as AMD and Intel on clock speed CPU, they didn't keep increasing the frequency speed of their chips just to get better performance because you run into thermal issues, instead they made them more parallel instead of faster, so here is my first example of a GPU.

I like this because you look at it in almost the entire space. on this chip it's processing all that simple assembly because those are all vector units that are calculating those pixel intensities even the cache does arithmetic so when you say I would like to get the texture value between which the image pixels are for each triangle and I'll say I want to have the pixel value and it will be between the image pixels in that triangle, it will actually do that arithmetic in the cache, so whatever it listens to is a worker, so it's an ideal communist. all workers very little management and everyone is working striving towards the same goal right we have an early floor plan of some sort of vintage single core CPU from 2008 and if you look at almost everything here is data management, cache instruction management, management The only thing that is really doing the work is the floating point units at the bottom left, if in terms of HPC compute clients the only worker is the oversubscribed floating point units at the bottom left, so I like to think of this as kind of a university. the university system, yes, you have the president, the provost, the vice chancellor, the deans, the associate deans, the department heads, the faculty and who does the work, the graduate students, so how do they get there?

They destroyed the management in the seat in the GPU core they got rid of the instructions, the branch predictors, the prefetch units got rid of the cache, reduced it to just the instruction units, their home-made logic units doing arithmetic and some registers to put that data in, and then since we have to do this on a bunch of pixels to have a bunch of these, of course, lightweight, but the GPU rendering works on this pixel here and this pixel here, you do seam, you're looking for the same sequence of operations, right, it's all the same projective geometry, so instead of By having 16 independent direct instruction streams, let's unify those instruction streams and in this case I only have one because it could only fit 8 in my graph so I have 8 and 8 1 instruction stream 1 instruction fetch decoding unit which feeds 8 arithmetic logic units and they share some execution context so registers for example now because we have to deal with memory hierarchies, you'll end up having threads stall or instruction streams stall when you try to feed the ALUs because you can push pump operations pretty quickly.

We have our multiple contexts, so that means large log files, you can have multiple data sets for threads that can be up to date with the core, and because it's a lightweight course, you can then duplicate those calls and have multiple cores per thread. GPU. it's more like a CPU, except, we'll talk about it, we'll get to the main difference, but essentially every operation on the GPU core is parallel on a CPU core, we have that decision, the compiler has to really be persuaded to that an operation is not just a serial operation, but is an operation that can be vectorized in the kernel, every operation on the GPU is vectorized.

There is an alternative: there is only one instruction stream and it reaches each core and is pushed through the vector units, so that is an abstract model here is a real model of a Maxwell which is not the current GPU but the one from the older generation and you can see that I'm not too far away in the structure. I suggest we have each of these things as a core that I have. an enlargement of one of them, this is the core and here is a blow of the core. Nvidia would like you to think that each of theearth medical logic units in that core is a core because it's marketing speak because they can say they have 3000 cores in a GPU because that sounds a lot more than Intel's claim of having 16 calls in the GPU, the marketing speech says involves what they actually have in this case for sim D vector units, so it's like a kernel with four subcalls, okay?

To answer the question, in this case we have 16 calls each with four D sim groups with 32 L usage and the data streams are 56 Giga float and a peak is 4.6 teraflops, so if you want to get the maximum performance, take 4.6 teraflops and divide it by 56 Giga floats and that tells you how many operations you have to do per load or float store to get people to perform and that's what those 80 or so are. I can't do that arithmetic in my head, but I'm a very slow single core processor, but that basically tells you that if you want to hit those maximum numbers, you're going to have to do a lot of arithmetic every time you load a float or store a float, which you can't get, but on the other hand On the other hand, this thing can transmit 56 Giga float per second, which is a lot more than most CPUs can transmit and since our operations have mostly limited bandwidth, we're concerned with that peak rate, not the failure rate, so let me emphasize the fundamental difference between a CPU and a GPU.

I said they are not very different, but there is one surprising difference: this is a central diagram of the execution unit inside the Intel core. It has many different units and the problem is that it has both vectors. operations and it has scalar operations and you have to persuade the compiler by hook or by crook that an operation must be a vector operation and it has to satisfy some criterion, basically it has to satisfy that the data you are using is aligned in memory in a certain way, but that's a pretty big restriction and it's eligible to be handled by a vector operation because there are several ways you can perform the operation, so the compiler will usually be very cautious and therefore you really have to reinforce it through intrinsics through directives. and you really have to punish it or even that assembler level, you have to be really firm with the compiler to persuade it to do a vector operation on the GPU and one of the reasons why it might be easier to speed up when you go to the GPU is that it doesn't you have to be so forceful with the compiler to do an extra operation because by default all operations are vector operations, you have no choice, there are no scalar processes in the GPU core and the other difference is that if you look at the amount of registers available to the core, the CPU core doesn't have that many. registers on the order of 100, the GPU core has thousands of registers, now 16,000 registers seems like a lot for a core, but divide it by the number of logical arithmetic units and you will find that the number of registers is not that big and there is a problem that let's not do it because we are going to have to deal with the latency of the GPU core memory to communicate with the GPU memory, we have a problem that we want to have not only in this case 128 resident threads but we all want to have a multiple of that resident so we can switch context between threads that have data and threads that don't have data, so 16K sounds like a lot, but it's not as much as we expected, so just to summarize, GP has multiple cores. and each core has one or more widths, so 32 IDE simony vector units, sim D units execute a stream of instructions, it has a shared memory pool which I didn't mention yet, but like a notepad so that all the Threads residing on a single core can access that shared memory, share a register file privately among all of its logical units, and are capable of fast switching in single-digit cycles, switching between different threads, and switching between active threads. to inactive.

I took other threads ready because we have it because it is essentially a vector processor at the core, branching involves serialization, so there are some disadvantages to this basic model that everything is a vector operation. Well, that's the architecture. Any questions about that, yes, do you see any convergence, so they are going towards the same absolutely and we are converging on the number of facts, so the variable of manufacturers to produce chips of successive generations is decreasing, so eventually we'll be left with just a couple of Pat's and convergence is natural and of course obtainable. these hybrid GPU CPUs that have CPU cores and GPU cores, so you can even converge on a single chip, so sometimes when you want a CPU that is very good for its scalar operations, sometimes you are in a GPU which is very good for vector operations.

So if your application needs it, then it is a good option. Any other question was good. I'm not losing it, am I? I've given you the basics. You can see the convergence of these architectures that I just highlighted. a couple of the main differences these are the biggest differences and why we see some of the biggest performance differences memory bandwidth automatic vectorizer vectorization is the basic operation or vector operations of basic operations and a greater number of registers per call how do we get there? So I showed you a pretty mature GPU. If we go back just a couple of generations to the Fermi core, the core was kind of a simpler core with 32 arithmetic logic units.

This is one of my favorite GPUs. Oddly enough, you start later while getting a favorite. architecture, then they changed here and went from 32 floating point units to 192 floating point units per core and there is a reason for that because in the videos the main business is selling GPUs to teenagers, the firm class GPU was designed more for use general. Computing somehow in the video research managed to direct the architecture, the decision decided towards Fermi, so that it was actually better balanced to calculate the pendulum swung in the other direction with this capital class where they wanted just a large number of units floating point units so the graphics performance was better, so this was not the best card basically because it had 192 floating point units, but the number of available registers was not one and the proportionally increased shared memory did not increase, so which simply overprovision the floating point.

Maxwell units, the pendulum swings back the other way a little bit, you don't have as many floating point units and I think shared memory relative to the number of floating point units increased and now here we are today with Pascal soon to be Volta, where if There are so many floating point units that you can barely tell them apart, we're at three thousand floating point units, which even from my perspective as a computer scientist who really wants to do memory-bound work, that's it. There are too many, but what's most interesting and this is why the Pascal generation is a step change in terms of what NVIDIA is producing is the relationship between single precision cause and double precision cause, so if you look These orange boxes are 64-bit floating point.

ALU floating point units so there is a two to one ratio which is the best ratio we have had so far on hateful architecture and for hard core calculations we care about double precision so you can forget that the single precision units are there if I want and the balance between a number of double precision floating point units and the available bandwidth that is on this device is almost a teraflop, terabytes of data that can be transmitted per second, which is actually a pretty good ratio, so this thing can generate about five teraflops throughput and constraint about a terabyte of data, so that ratio is actually improving, going back to the point made earlier about the number of operations you need to do per load or storage, so this is expensive, but it's a really good processor, it's also the first version of the GPU that got to 16 nanometers, so it's really interesting.

Yes, you can buy a consumer version for $600, unlike the professional version which costs $6,000 and the only difference is that it doesn't have one. so many double precision core dollar precision arithmetic logic units activated, so you will get much less dual station float performance in the single position, if you can get away with single precision, you can buy the consumer device, so it is a very, very powerful device, so we've seen some evolution just in the time that I've been working, so I started working on these GPUs around 2007-2008 as soon as CUDA was invented and they went from fairly modest cores that they had two half eight and eight wide units up to one hundred and ninety two and back, so I think it's fair to say that this Pascal class GPU is probably the first GPU that has the potential to make a significant difference in many of the applications which, as you know, I was looking at in the pages. outside and I was thinking about a lot of the projects that you're working on, they would actually probably correspond very well with this architecture, the new Pascal architecture.

Well, any questions about that, now you see the progression, what has stabilized in that progression is the width sim D is 32 vector units wide left where they started with 8 wide, now there are 32 wide and that has been the case for several generations, so in terms of tweaking a code we haven't had to make major changes. in the way our codes are written for those devices in several years, how we program them, that's the next question, so we're going to talk about unified CUDA computing for unified device architecture, which is one of the worst acronyms which has knitting terminology like what yarn and texture and I hate the terminology because I can never keep the limb, which is also very bad, oh wow, so my wife actually has her own LUMO, a small portable loom, and she threads them threads so this is one direction on the fabric and then in two to make the orthogonal direction, you pass a shuttle that has a thread attached through the loom and you just repeat that song forward until you've created the two-dimensional fabric.

What's the problem with thinking of this as a good analogy for thread-perpendicular parallel computing and is it an inherently serial process? If you start the thread here, the cross thread, you are literally visiting the threads in order, so it's the worst analogy you can get. Imagine parallel computing because it's an inherently serial process, okay, so I'm going to touch on thread arrays and just clusters and talk about vector parallelism, I'm not going to dwell on the one thing I'm going to keep from everything that. The thread is okay, so CUDA came out in 2007. You can Google CUDA and you'll find tons of stuff online and when it started there really wasn't much, but there was a lot of information from the developers, they formed a lot of forums.

Over the years and so the community grew until Nvidia practically abandoned CUDA developers when they realized they could opt for independent software vendors instead of working directly with the community. Now they are trying to court people through open ACC, but I hope they do it. will leave that community as soon as they focus on something else NVIDIA has the attention of a gadfly. I mean, today it really is artificial intelligence, before it was DNA, before you know they changed our approach all the time, so be very careful with the amount. you invest in this, but it's a good way to think about how to program these GPUs, so this is a typical computer GP like discrete GPU, it's basically a computer on a board, that's the process, it's the board interface base on which it is located. it has its own memory, so I put this in, it's very important to have the right mental model of what a GPU is, such a discrete GPU, that when we want to allocate an array on the device we'll have to go through CUDA.

API we're going to have to do instead of doing a malloc, we're going to say CUDA malloc because we basically want to make sure that we've allocated bytes of storage on the device, which literally means reserving bytes in the board memory next. step, let's copy data from the host: Assuming you need some input data for your algorithm, we will copy it via the PCI Express interface or with the latest GPUs via the MV link interface, we will queue a task for advice and I make it very careful to say that we are not going to run a kernel, actually we are literally going to queue a device, but qmq we perform a task on the device because it is an asynchronous computing device that has its own flow ofinstructions separate from what we call a host right, the host is going through its instruction set, the device is going through its instruction set and we, when we save, run a kernel on the device, we are saying, please, when you If it is convenient in the future, run a kernel. and then once this is done we will copy the data from the device to the host using the kudamon copy, so those are three things that we will have to figure out in the CUDA malloc kudamon copy and how to run a kernel and how to write the kernel. for the device so it makes sense three things that's all we need to know somehow this is difficult but it's not it's pretty simple just remember those three things meshcoota mm copy and drive a car now we're going to have to do it In some decisions we have this massively parallel device that has 16 cores and it says that each core will have 128 floating point units.

We will have to decide how to divide our tasks between the course and between the vector units in those. Of course, that's where we really have to invent and invest some mental energy into figuring out what to do, so first we'll have to decide how we break down our overall task into tasks that can be mapped into this parallel. architecture so let's say you have to do three matrix vector multiplications and they are independent then you will have three separate tasks now with each matrix vector operation you will have to split that matrix vector operation into independent subtasks so you could imagine doing a row block as a task block multiplied backwards to multiply to the right and then that would go to the next core row block and take a chunk of rows multiplied by the vector that can be sent to another core, okay, so we have Think about partitioning parallel tasks, it's not that different from MPI or even OpenMP now, when we've done that partitioning of blocks, now we have to decide how we're going to process that block multiplied by that vector, but distributing those operations to the 70. the units put together those cores, so that's the hierarchy in which we have to think about dividing at the scale of independent operations, which are quite large operations, then taking each of those large operations and subdividing them into also separate operations that then, on the finer scale, they will be divided into small and light ones. operations that won't require a lot of logging because remember the log file is big but it's not big relative to the amount of processing and there are some details but it's a lot to take in so here is the first part of the CUDA code that you can see yes, hopefully, it has.

I haven't seen this before because otherwise you are very very bored, I hope you surf the web well. Well, I know there are some CUDA people here and I apologize, so just to reiterate the sequence of events that almost all CUDA programs follow, we're going to allocate some space to an array on the device and because it's outside of the host memory, we're going to create a specific CUDA malloc called which is a CUDA API call and we're basically saying, well, the cool thing is that we can use regular pointers. and we're going to do a CUDA malloc, so this says give it the address of that pointer so it can change the pointer and say how many bytes you want to use.

That's not too difficult. You've seen malloc and I apologize if you have. It's not because of Fortran, people hate me right now. If you've used a malloc or a kalloch, you can understand what it does. Let's skip this part, which is the actual execution of the program in the kernel, and I'll tell you where we're going. to do it here I should have said what we are doing we are just doing a simple kernel which is going to allocate an array on the device this will fill the entries of that array and then we will copy the array back to the host and in this case the host has to allocate some space to receive the data and then I'm going to use a CUDA memory copy it's just simple and the allergy is analogous to the normal standard library memory copy and you say I want to copy from the device to the host so many bytes and you tell CUDA what what you want to do and finally once the data is on the host it will be ready to print the data that makes sense so what I left out is this part where we are going to invoke a kernel or tell it a kernel on the device this will be your name. we have to say something about how we want to split the problem and we'll give it some input arguments to make it look like a key function, right?

It's a feature with some kind of baggage, it looks like a template, but it's not strictly one. and then we will see that this is not standard C, so we will have to use the Nvidia CUDA compiler to process and compile this code because it understands what this syntax means. What we're saying is I want to queue. above, sorry, yes, I'm going to solve the problem so that on each core I want to queue 512 threads, okay, roughly speaking, what it says and then I'll choose enough main tasks or thread blocks to have one thread per entry into this array, so if I took a well, like I say, I wish I had picked an easy number, but if you take three thousand seven hundred and eighty-nine divided by 512 and round up, that's how many thread blocks I'm going to request. , so I'll have enough threads that I can set an entry in each thread, I can set an entry in that array, it doesn't make sense, yeah, it's just a block partition of a loop and we'll see.

What I mean by that in a moment, okay, to reiterate that we use CUDA malloc to allocate space on the device. Let's specify how many threads we want. Yes, so you can do other things on the CPU and then get notified when the GPU. it's done, it's a fantastic question, yes, I already answered that question, yes, what did I say earlier? It's asynchronous, it's already separated, it's a thread, we are requesting that kernel to be launched and returned without guaranteeing the country that the kernel has ended up in. question: that example, how does the main code know that the kernel is ready when you copy like Rebecca?

Because when walking, I didn't tell you that the Kudamon copy is blocking. We are actually queuing a memory copy request. adding that after the kernel, so the kernel has to complete before the copy starts, yes, that makes sense, yes, but I try not to include all the details at once, that's very cleverly noted, so Which makes sense to people, there's basically a queue. on the device and in this case we are using drop queue and we have enqueued on that kernel operation and after that we have enqueued on that copy operation, it is possible to overlap those things but let's not get ahead of ourselves okay ?

We're going to queue that task and when it comes back we don't have, no, we don't know what the state of the device is, it's not doing the right thing, but by queuing this copy we're putting it in the device's queue and that has to finish before that happens now let's see before we jump directly into the syntax of how we code the kernel. Let's start with the simple kernel written in serial C, so here is a simple example and somehow I have managed it. forget the integer type n, so this is int N and I'm just going to loop through from 0 to n minus 1 and set the input to be equal so the loop index is pretty easy, like we're going to do. a little bit of loop tiling and we're going to split this, in this case I have n is 20.

I'm going to say partition y into blocks of 4, so we'll have five blocks or four iterations each. This makes sense because I have to figure out how to take that long vector and split that task into subtasks that can be sent entirely to one core, because we have these two-level parallelism cores and threads running in vector units, so I have to figure it out. manually. If I were to write the code, I would say that the block is equal to zero to the dim grid minus one. Someone should complain at this point. What is the complaint?

Where is it defined? It's defined at launch, where you say how many blocks I want, that's what defines it. the grid is grayed out, so there is some sort of code separation problem, release decides the outer bounds of that block loop and designs the outer one, determines the boundary of that inner loop, so it looks like those bounds came from of nothing. when you start the kernel, those upper bounds are defined for you, okay, so just to summarize this, by setting up this dimensional grid, this will define the upper bound for the block loop and this will define the number of blocks that this will define. the upper bound for the right inner loop makes sense, that's what I've been teaching GPU programming for, I don't know since 2008 literally, that's what bothers people is that you specify the loop bounds outside the core. and they magically appear inside the kernel and people don't like it when things magically appear in the code, okay?

So we are tiling that loop now, since we have tiled it, we need to rebuild the entire linear index based on the thread number in the block and the block number because you have to count I am in this block I am this thread in this block so I know where I could be what my place is in the order of things, the only map I have I could have chosen no. I just needed a one-to-one map. Basically this is the task I want to do: iterate N and I could have chosen any map between the thread and the blocks in n so I can have a crazy map between the threads in the block and the index, as long as there is one to one, it means that there are no two threads pointing at the same time to the same n right, so there are four, we have had to think that I am legitimately building a parallelism in here I made a lock and then a subpartition of each task block, okay, so still we're working serially, although now another idiosyncrasy of Pewter is that it gives you the thread index, the T and the B, and my previous thing, the keywords it needs are thread ID is the index for the thread in the block.

It's basically analogous to using the MPI range, except it's a two-dimensional range, but note that we're not defining them. since we have been given, there are variables that appear out of nowhere, which again is very difficult to understand, so there will be a loop over thread id does that mean it's the paint removed but it's just a change of variable names okay oh yeah there's a good question so look down here I have an if statement where the hell did the F statement come from why I have an if statement?, this is what This happens if your arrays of length, which is a prime number, a large prime number, are not tiled correctly, so we are always on guard because we may have to provision excessively, creating more threads that don't our entries in the array correctly, so if you see look at the CUDA codes, you'll see these checks everywhere even though you think, oh, that's why there's a check there, okay, a little messy , but it is still serial code, but almost all CUDA code has this structure now, what is going to happen is each iteration of this in a loop will be assigned to a thread.

Each iteration of the outer loop will be assigned to a block of threads, so we don't really need to have those four loops there. What we do is say what each thread is going to do based on its thread rank in its block rank or the only code we really need is this here that makes sense: We remove the loop structure and replace it with threads that do everything iteration in the loop structure, so we will have a threat for each iteration. I'll come back to that in a minute, but we basically start with this thing that looks like a simple mosaic cereal grain and then we really get into it. by just giving it the inner part of that double tile, that single tile followed because that's the only piece that matters to a threat, what it does, that makes sense, it doesn't need those four loops because we're not iterating on an index.

Based on its traditional sense, we have tiled the base of the index with threads, so if I have a thousand vectors that are 24 long and I am going to use 32 threads in a block, I will need 32 blocks and then I will launch a thousand 24 threads and each thread will be responsible for updating a value in the index, a value in the array, okay, if you can know how to program CUDA correctly, you have to design the partition of your loops in your problem and just code what happens inside there in most of that loop, so it's a really nasty thing because we're using these dim index block index block index thread intrinsic variables that appear out of nowhere, it's a little bit awkward in MPI.

When we are asked what our range is, we explicitly say what my range is and then based on our range we decide what to do right here, we just have to use these thread indexes and lock indexes that tell us what our range is. Okay, that makes sense. I paused. a little bit took a little more time yes, sorry is the best greeting I mention that the black permission can be devised by a compiler look at everything we have to wish we have to decide when we launch the number already then and it depends on a worddifficult maybe this excellent pressure and if I can't come as a pilot, we decide it for us, it may not be easy to decide, it could be a very difficult question to answer, which means you would have to scan all possible sizes of blocks of threads, it's a very complicated question because the number of registers depends on how you partition the task, if you partition into a small number of threads in a thread block, you may need a lot of registers or a few, it depends, so what you asked is actually a deep question: how does it fit? a GPU code to get the best performance and one of the things is that you have to pay attention to how those thread blocks are mapped to those vector units, so having a thread block count that is a multiple of 32 is usually a good start because What happens when you assign this task to the GPU core is that it splits into groups of 32, so I have 512 threads in my block, but it splits into groups of 32, so 55 12 divided by 32 It's what it's 16 and what.

What will happen is that the instructions for those 512 threads will be split into 16 groups of 32 behind the scenes for you, so what would be a very bad choice for block size 33? Actually, that's not the worst option. What is the worst option? Wow, right, that's right. one of the reasons is a very bad choice for the block size, specifically we have 32 units and if you put a thread in a block of threads, that thread will be very lonely because it will occupy a lane in a 32 wide vector unit, so which is Like the roads on ice 11 Houston, you know there used to be four wide, then six wide and eventually it will be 16 wide roads that go into Houston, not out, and the analogy is you send a car down a lane, what?

TRUE? what you want is 16 cars or multiples of 16 cars so you can have 16 wide and the way traffic in Houston is usually like that now, the difference in Houston is that those 16 cars are running down the 16 wide road . they run at different rates, right, they are not working at the same rate in the GPU model, you would have all 16 cars going side by side, they would go down the road at the same rate, right, they go at the same rate, but worst case has pointed out the issue is who only has one car, that means you're gone, you have one lane in your drive semi occupied, occupied, which, as a traveler, would be ideal, you have all this space around you, but in terms of performance, that's the worst use of The Road Is Fine, does that make sense?

So I dwell on this one a little bit because this is the basic concept, if you understand you have 50 percent of what you need for GPU programming, so again, we just need to layout tiles in a loop. a loop tile maps some, say, a linear index space into a partitioned injection space the variables that appear out of nowhere are the thread rank in the block and the block rank in the grid, so it's the blocks grid, the blocks and inside the blocks you have the correct threads and to get them you can use these intrinsic variables thread id 3D indexes because the GPU is designed to render images of pigs, they are rendering those images naturally.

The range of a thread for a thread is actually a two-dimensional thread range, so in the previous version of CUDA they gave you an x and a y location of pixels, as well as x and y of threads, but now we have three-dimensional frames, so in You can actually do three-dimensional, in fact you can have three-dimensional on the block, three-dimensional on the throat, there are some limits on those dimensions of the block array and the thread array, but it's too much data to take off, incorporate, okay, so that's the kernel, there's some additional syntax that I didn't dwell on in the kernel, it's indicated in the source file with an underscore underneath, they like underscores, underscore, underscore, global underscore, the underscore, okay, so that says That's going to be kernel code.

We can use the intrinsic variables to determine what our index is in our problem and this is the actual action each thread will take which seems like a ridiculously small task for a threat, but in a sense the sweet spot for the GPU kernel is one that doesn't. it requires a lot of logging, which is actually a small number of operations because you're not just setting the logging variables for this thread in the Basically, you're setting the logging variables in a lot of cases. I think I've covered it, so I think it would be good to do a practice run right now, so the first thing you need to do is SSH to Cooley. be able to use curly this week or it's the first node, okay, so yeah, we've generously provisioned you a node on each on Cooley, so you SSH in and it goes to Q sub.

I think this is correct for the subcommand by a process of getting a compute node: well, first of all, clone the repository with the examples, secondly, get the compute node, thirdly, find the source code for a simple example and then compile it and then run it and You will see that we have to use the NVIDIA cecum or CUDA compiler was strictly necessary, but for this example we will use it and then run it as if it were a normal executable, so I will pass it on if you have any questions or problems, please raise them. your hand that wasn't difficult, well it wasn't difficult because I gave you the code, but if you look at the code, it's logical in the sense that as soon as you realize you have to divide a problem into blocks and then subdivide the blocks, everything else flows, how about we try an example in the next 57 seconds before the break, where instead of using that index where I said N is equal to T plus B multiplied by the dimensional block, why don't you try it? reverting what each block does instead of going from scratch, so Wow, does everyone see what I mean?

So here we have created an arbitrary map between thread index, path range and block range in our index space, you can choose to reverse it based on just inside the thread block or you can reverse it on the entire grid, you you choose how you would like to do it, but just to elaborate on the point that we are deciding how to map from the thread index and the block index to the index space. So what I want you to do is modify the CUDA code to change the sequence or map between the thread index and the array index.

Does it make sense, so try it, so I'm asked to repeat it here. We've created a simple map so that thread 0 block 0 is mapped with index 0, thread 1 block 0 is mapped to thread index 1. I'm saying you could reverse that risk, so you could say that the thread 0 index, thread 0, block 0 could map that. last entry in the array, so you could try that or if you're feeling particularly brave you can just reverse the order within the thread block. Okay, just be careful to get the if statement right, so you just have to make sure that your index is still a valid index, especially if you reverse the rakes, then you have to check if the index is greater than 0 or something, greater which equals 0 during the break.

Very good question, which was like recommending books and I don't recommend them. any book because they are of different quality, there are several CUDA books by Alves. Now this might be one of the most used books, so if you need a text, you might consider using or looking at this text. Well, everyone could understand it. This exercise is not necessarily complete, although all I wanted you to do is change the index formula. How did you go from a thread and a block range to a linear index on an array? Obviously, it's not a single map and therein lies a challenge just because of this very simple thing. problem, how do you decide how to match a thread in a block to a workpiece?

It's not a one-size-fits-all choice, sometimes the hardware gives you some clues on what to do, something you probably didn't want to do is have it weird. still, the odds are that the threads do one part of the array and then the even threads do a completely different part on the right, they could make a really bad decision because behind the scenes, when we come to optimize the CUDA code, we should have great care. threads in a block access some sort of nearby fragment of an array. Another bad choice would be to have more than one thread trying to write the same entry to the array because you'll get one of these right racial conflicts.

To give you a quick example, which is the Poisson problem, have we seen this before in examples so far? Solve Laplace's elliptic problem using discrete finite differences. Yes, we have seen this. Okay, so skip the elliptical problem. We are on a grid. We have a template. find a different template than those behind a sin of those derivatives, so if we look at it, we are performing an update on a two-dimensional network, we update the formula for the central node based on that template plus around that central node , just the beautiful thing. The thing about this problem is that I can calculate the update for this type independently of the update for this type and independently of the update of this type as long as I'm running to a new array, so I skip this and mathematically there's an iteration of Jacobi. which says that my new solution at each node depends on the four neighbors and itself on the right side, we have to decide when to stop the problem, so I will say use an RMS measure to decide if the iterations are close enough.

I can stop iterating. ok that about sums it up, it's very simple to keep running this update formula and decide when to stop options based on that formula, so what does the serial kernel look like? It says loop over an indexed space IJ over the rows of columns and rows of that lattice and use that update formula to decide what the new guesses are for the solution at each node on which it is based. I can't get this to give me a cursor based on that update formula, what does the CUDA version say right? we are allowed to map two-dimensional arrays of thread blocks and within those thread blocks they can be two-dimensional.

I can create a two-dimensional tile of my finite difference network and then within those tiles I can perform that update formula correctly. Here I have a two-dimensional block index that is indexed through the x and y components of the block index and the thread index and I can tell if it is a legal column, if it is a legal row, then I can rebuild the index on the array. and I can perform that update formula, that's it, it's a very simple kernel, probably with very bad performance, that would perform the Jacobi iteration to solve the Poisson problem.

Well, all I've done is create a map from my Cartesian grid of nodes. Then I split it into blocks and then in each block I assigned threads that cover each of the nodes within that block, so that's the serial code and there's the parallel version. You see, I've lost those four double loops because the four loops are hardcoded into the thread array makes sense, it's exactly the same thought process I went through when splitting the vector, adding a whole vector, the simple example I had before, but there is a more complicated problem. I want to decide when to stop my Jacobi iteration, so I need to calculate the RMS of the difference between my two consecutive iterations, but it is a reduction.

I have to sum all the differences except the squares of the differences of all the nodes before taking the square root, so it's a pain, because I have this massively parallel processing unit. It's great when all the threads do something separate and different, so they all do their own task separately, but now I have to get them to coordinate, so what do you think I'm going to do? Do it well, I have given you the clue to start. I'm going to do a block reduction. I'm going to split that sum into a child block, so it's the same principle I used before when splitting the for loop in the simple example. but now I'm going to reduce it so that then at the thread block level I'm going to do a partial reduction of that residual vector correctly, so I'm going to say divide that residual vector into chunks of 512 and then I'm going to do a reduction of those 512 values to one, so I have many thread blocks.

I'll get so many values that I can then eventually copy to the host and complete the reduce on the correct host or I could just run the reduce process again with just read release blocks until I've continued reducing them, that's something I've come back to. I'm trying to reduce the value okay so this is not rocket science what I'm going to do is I'm going to get all my threads in my thread block to load a value and I'm going to use the shared memory array that I still have I have not spoken in everycore, so remember that all threads in the thread block are resident in one core together. everyone has access to a shared memory space, so what I'm going to do is have all the threads load a value into a shared memory array and then I'll be ruthless and kill half the threads. and then I'm going to say that the remaining half will each audition at the reduction and then what am I going to do?

Kill half and get 1/4 of the threads to make an addition. So what am I going to do? Killing this is ruthless, I will literally kill the threads until there is one thread standing that does the final sum and I have the graph here, so in this example I have eight threads and one block, each of which has loaded a value. I eliminated these top four, the bottom four will do the sum of, in fact, in this case, I took this thread, I will do this value plus this value and put the result back into the shared memory and then I go to then I have the Posche.

I have an intermediate result. I'm going to kill these threads and then repeat the process until there is one thread left standing. This is nothing more than a binary tree reduction at the level of a thread block (okay, so it's a collaborative effort, it's not necessarily a highly efficient process, but it's one of the best options we have, so in pseudocode each thread determines who it is, we will say that the live total is the number of threads. in the thread block and then we will load each thread, we will load a value from the global array into the shared memory array, we will say that as long as there is more than one active thread we will synchronize the threads because what we have to do is make sure that when we have loaded data in shared memory, all the data is in shared memory before we can continue because it may be that some group of 32 threads finishes and is where it is ready to do the next reduce stack, but the Other threads, other groups of 32 have not completed their uploads so we need to make sure they are all on the same page, so we

intro

duce a barrier and synchronize the threads in the thread block before removing half of them and reusing the shared one. the memory keeps doing that and in the last step a thread is alive and it will write a value from the thread block so the reduction ratio if we had a thousand 24 threads in the thread block we will take a vector which is of length n and it will reduce the length of the vector in the reduction by a factor of 1024, which is a pretty good reduction ratio, and we can keep calling this until we reach a value or we can simply download this result to the host using a copy of CUDA memory and complete reduction on the host at some point, it makes sense to download or reupload to the host in the actual code.

The new syntax is that we declare this block as a shared memory array using this shared underscore keyword. and that just means it's an array that all threads in a thread block have access to, okay, so each thread finds out where in its linear array it loads a value, what am I doing here? Oh no, it initializes it to zero so that all the values in the shared memory array are initialized and then if it has a legal index, it will load the values into the shared memory and at that point we can start this tree reduction.

We make sure to sync before anyone does anything after we've killed half the threads. we will simply do that first phase in the reduction and keep repeating until there is only one live thread. After we are done, there is a live thread that writes the result. We are ready to start. Makes sense. It's very fast. Overview of a very simple reduction. There's like a seven-step program where you take this very simple reduction and adjust it multiple times and you can get close to the transmission bandwidths from this kernel, but we don't need to go through all those optimizations to In this introduction, create an access to the shared memory array accessible to all threads in the thread block, load and deal with the load values into the shared memory, make sure you have synchronized, kill half the threads, and keep shrinking until only one value remains.

I don't think we have time for this, but during Tim's presentation yesterday I thought his man or man a shining example was a pretty good example, so while I was doing it I coded an example where you can take some skeleton code and apply the cuterization. to the skeleton code and you can turn it into a prettier implementation of those same areas or the Mandelbrot example, but I want to skip that because I think we'll have a more interesting example later. Ok, I'm going to skip this. It's in the notes. but there are things you should know about shared memory usage, area usage, etc.

That's fine, but I think it's more important that we spend some time talking about portability. Everything I've talked about so far is CUDA and Cooter is a single-vendor solution. a proprietary programming model and a proprietary toolchain and is wholly owned by Nvidia, let me tell you a sad story, this was my laptop, it lasted a year and a half, it had a good NVIDIA GPU, it died, that's what you get when the MacBook Pro laptop dies, you get sad, sad, where is my hard drive? Unfortunately, that's the last Apple MacBook I knew that I know had an NVIDIA GPU, so now what do I do?

I'm not going to use a Windows laptop, but there is another later G. Apple laptops have other GPUs, so in fact NVIDIA. is not the only vendor that has an accelerator or GPU solution, these are a bit dated, but you can get an Intel CPU that may also have an integrated GPU, you can get an AMD advanced processing unit that also has GPU cores and CPU that has the xeon rate. Of course, nvidia has its own thing, AMD has its own GPUs, etc., but there are several different ways to program each of these different devices. If you want to program in OpenMP, you can program Intel solutions and AMD solutions. with open MP, open ACC will give you some of the other CUDA accelerators will give you the NVIDIA GPU that used to have an Intel x86 solution.

I don't know, they still do it, but OpenCL, which is the Open Compute language, can program all of these. things, there's a dashed line on the field programmable gate array because it has a lot of open-source functionality, but not all of it for various reasons, okay, so let's talk about the Open Compute language, which is why Tim mentioned yesterday that he was actually on the Standards Committee for OpenCL I felt like I should complain to him, but he seemed a little torn, so I didn't want to get in the way. I'm not sure I've seen anything good. it came out of a committee, but OpenCL is definitely the product of a committee, it's actually on to something, it's almost a miracle that it actually happened and it was largely driven by Apple, which is why Apple saw the success of cooter and the potential of CUDA, but I didn't want to keep an eye on NVIDIA and as you can see, they don't have Nvidia chips now or their laptops for example, so they wanted multiple vendors that they could play with each other, but they didn't want to get locked into the solution CUDA from Nvidia. so they basically created four, pushed for the formation of the open CL can open CL Standards Committee, and somehow brought together nvidia and AMD as active participants in designing an open version of cuda open CL.

They are not the only participants, but yes. the main supporters that interest us for the discussion open CL. Does it differ from CUDA in one aspect? You can use it to program GPUs and CPUs, but that also adds some complexity which we'll get to in a moment, so just to give you an idea. the background, I'm sure this isn't very readable, it's just a timeline, so some of the early work on GPU programming was done at Brook's GPU programming work, it was an academic project, projects PhD programs that harnessed some of the power of GPUs, Nvidia hired some. of the top PhDs from these groups and they created CUDA around 2007, a little bit later in the year AMD hired some of the other people from that group and they launched the AMD pipeline well and then everything starts to take off a little bit.

In mid 2008, Apple said okay and they pushed to form this OpenCL working group and between May 2008 and October 2008 the Standards Committee was formed and they produced the OpenCL standard nine months less for five months, okay, So ask yourself this, how does a Standards Committee work? a body of diverse stakeholders from all these different companies, how do you produce circumstruction, how do you produce standards in five months, what was it that you were due, someone was late, borrowed, yeah, they borrowed, they borrowed CUDA and we'll see how deeply it goes. I borrowed in a second, so OpenCL is very related to CUDA atomic terminology is very similar, it has kernel host programs, a thread in CUDA becomes a work item, it's a bit more generic term, because a thread may not actually be a thread if you If we are on some strange platform like an FPGA, it could be something else other than a sequence of instructions as you would normally understand thread, it could be a sequence of gates that perform a sequence of operations, but the terminology is that the thread becomes a work item, the thread block becomes the workgroup grid becomes something like iron blocks become an n-dimensional range, each thread that is created in OpenCL gets a range, but instead of using these intrinsic variables, the thread idea distribution index and the block index, it uses get local, which gets the index of a thread in a block. and get global, which gives you the global index across all threats, so I usually asked the question who prefers intrinsic variables and who prefers the more API type of approach, so who prefers this, who prefers this, yeah, your alternative, no one likes that, that's fine.

Are you starting to see that really OpenCL is more or less a grammatically challenged version of CUDA? It's just to change the syntax slightly and your grammar changes are attenuated slightly in the grid to get groups block id x get group id. The dimension of the block grows and becomes local. size and so on, you can get a global size, you can discover everything, but through the kernels, the annotation and the keywords for a kernel function instead of underscore, underscore, global underscore, underscore, becomes the underscore kernel function, okay, then they change the keywords, they change the way of finding the thread sorting block sorting the dimensions of the block, but the philosophy, the approach, the model is the same, the memory model is the same.

On the left we have the memory model adopted by the thread blocks they access, the global memory reaches the memory texture. and we have not talked about the different types of global memory cache and here we have on the OpenCL side, it is basically synonymous, but there is a problem, so you should immediately say why not everyone programs OpenCL and the answer is to learn. curve I'm English when we go to a restaurant we have a one page menu it's just chips beans you can order any order you want chips or beans beans and chips that's what you get in this country they give you a book, right?

You can walk around and say what is your Andhra snack. You know different types, basically everything is chicken or fish, but beef in Texas, but there is this huge menu of options and this is the problem that is fundamental for OpenCL because it is not just going to one supplier. solution an nvidia GPU, you could have a system that has multiple CPUs, multiple GPUs and you have to accommodate the OpenCL implementation of each of them, so we have to make some decisions and I don't want to really dwell on the API, but basically the problem is that you are the first to run a program on an OpenCL device with ads, you have to choose which platform, so the vendor's OpenCL implementation could be Intel, it could be AMD, it could be nvidia, once you have the platform you want . to know what devices are available you have to choose a device, so in CUDA there is a default device option in OpenCL, there is no default device, a device option because it is not a well thought out option once you have the device that needs. create a context as a manager which it will work with on that device, once you have the context then you will have to create a command queue on that device, so if we look at the code in CUDA we will need to specify some header files in OpenCL It's the same thing, but Apple, being Apple, has a different header file than the rest of the OpenCL implementations.

Listed in the CUDA implementation in the main file, you don't have to do anything, you've given the default device an open cell. to get platform IDs you need to choose a device or get a listof the device IDs and then choose a device in OpenCL, then you have to create a context on the device and then you have to create a queue. on the device and now we have to work with the program again, the kernel that you are going to run on the device, so if you think about it, we are already in combinatorial hell because we have multiple platforms. the number of devices, etc., you have all these options, so the typical solution is to compile the kernel executables at runtime, so we start from the source code, we are going to load the source code, compile the program from the source code. make sure you don't get an error when you reported when you created that program at will, colonel at runtime, because you know we don't all program correctly the first time and then from that program you will create a kernel, okay, none.

That's necessary in CUDA because the nvidia cuda compiler will do that compilation for you right now. We still have options. The cars we have now map the arrays in CUDA. We would use the Kudamon copy in open CL. we need to use CL create buffer and this is where it gets horrible because every time you run a protic OpenCL kernel you have to attach the arguments to the kernel yes well at least you should do it at least once so the colonel knows what the arguments are due Because you're not literally creating a function, you're creating an object that can then be sharpened once you've added the arguments to the kernel, then you can queue it and then if you want to wait for it. to finish and you can wait for it to finish, so coding the malloc and the build buffer are more or less the same, but when we get to the release of the kernels, we need to add the kernel first, add the arguments to the kernel before. we call a function that actually blames that kernel, that's why OpenCL is not the basic choice for most programmers, okay, the good news is that the kernel code is more or less the same on the left , the thread index log blog index block is used. faint on the right, you can use a global id, but it has a bit of a bad reputation and it's mainly because of what we've seen, but there are other problems.

CUDA came first, it's a bit like being Betamax versus VHS. I don't know if you remember that fuss, but basically timing was everything. CUDA had a better time in the market, it had a better execution in the market, so there is a richer set of literature on CUDA, there are more books, there are more libraries, there is more user participation, there is more user participation. programmers, so you can see. forums and everything else, the solutions or platforms offered by Intel Apple Nvidia have specific annoying differences, the Intel looking at you is your property and Intel's implementation in CPUs is paralyzed or may be paralyzed, so there are some difficulties

tech

niques to make it work and become truly portable adding runtime compilation makes it a little more complicated, but on the other hand, I can now specialize in runtime without templates, so since you are compiling an executed kernel at runtime I can code all sorts of loop boundaries that can't be coded as well in CUDA, but after those boundaries may need to be known at runtime and it's provider and poor provider independent, so that if you're writing 20 million dollar code, there are three and four different vendors offering solutions. for OpenCL, so here is the kernel serial implementation, here is the CUDA implementation and the OpenCL version will be a slightly modified version of this where we just change the keywords and focus to get ranking from the API, okay, at the kernel level. it's hard to argue that there is anything wrong with OpenCL at the API level.

I would say that a lot of things are run with OpenCL because there are really no default options. If they had a basic default OpenCL, many problems would have disappeared, that is the fundamental obstacle in the stomach. those tree reductions are the same, the syntax is slightly different, sharing is local and instead of using synchronization threads we are using a barrier, a local memory barrier, so I have said that there are some disadvantages of OpenCL, we trust in the good. nature of the company of the companies involved in the open CL standard to provide good implementations of open CL for a long time he is nodding there for a long time nvidia publicly stated that open CL was going to be obsolete and was a second class citizen compared to CUDA, that has changed a little bit, but in essence they are still emphasizing the use of CUDA on NVIDIA platforms, but also rarely is it mentioned opening ACC open CL, so what to do, what to do, so we have seen that there is already one that it's good. several options some of the portability options say okay you have the CUDA code let's start with the CUDA code and do the code translation let's mutate that CUDA code into open CL so interestingly there is a Virginia Tech product called cuticle which will accept Nvidia CUDA.

The open CL translated code seems like a good solution, right? It will translate the host code containing the rack names with the CUDA API and convert it into open CL API calls and take the kernel code and mutate it into OpenGL kernels. There is a different way. solution I'm not sure if it's really taken off yet, but AMD has a solution called hip and there's something called HIPAA Phi, which is a tool that effectively does the same thing: converts CUDA code via HIPAA Phi into hip code that you can target. to AMD platforms. and apparently on video, and then another way is to go to a low level where you take the compiled CUDA code and send it through Ocelot, which is an academic project that can then be attached to multiple different devices and PGI at some point had a x86 compiler for CUDA I.

I'm not going to talk about opening an OpenMP CCR, but they offer different ways to program these devices. In terms of maturity, and this is just my opinion, AMD solutions for portability are not as well developed as Open MP and Open ACC for accelerators. which are not as well developed as open CL and CUDA and in terms of ease of use, Cudas is definitely the most mature, the easiest to use but the least portable, so you have that balancing act of whether you really want maturity or portability, so if we take a step back we have a question and it's a very, very serious question: If you started using one of these large-scale leadership installations, will you use MPI for distribution peering, that's pretty much already decided.

The battle was fought in the 90s and MPI won. We're still fighting a battle over what the port will be, what the threading solution will be for these massively parallel processes. Okay, so you will have compliance with all these possible solutions, so MPI for distributed and then NPR, some people claim to use MPI for threads on CPU or OpenMP or P threads or CUDA OpenCL open ACC thread building block subclasses? what is the actual solution we should use? Should we write code for each of these different approaches? No, because you take the cost of the code you're writing and you multiply it by the number of things you're going to target correctly, so if you have a million dollar code and you're going to write four different versions, that's going to cost you Four million dollars, this is what I make, so this is my 10-minute sales pitch for my thing.

I told them I wasn't going to sell anything, but I lied, so I had a PhD student who worked at a minerals company. in Houston and one of his tasks was to take some CUDA code and port it to OpenCL. There's this incredibly smart guy and he burned himself through the summer doing that right and he came back to me and he was a little angry, it's like why are you burning out? We're wasting time doing that, so I said we'll put it into that, we're going to write an abstract layer that will automate the process of what you did in your summer internship, so instead of writing directly in CUDA or writing.

It's OpenCL directly, we'll write to an abstract layer and let it do the translation behind the scenes at runtime and as efficiently as possible. This is what we have. It's called akka. It has API for Julia Python C++ C. I'm not sure about this. I think it might still work, we're not really Fortran friendly so it's there but I'm not sure it's really mature, we have an API behind the scenes, we have a parser and it can accept CUDA code, OpenCL code or our own code personalized. kernel languages, you would push them through an intermediate representation that was portable across the different backend platforms to which they are naturally mapped to the appropriate hardware on the backend.

Okay, so your involvement would be to write host code in your favorite language and write a kernel either in OpenCL CUDA or in what we call OK, all of our host language, what does it do? A single unified library for programming heterogeneous critical devices. It is flexible in runtime to choose from. You write the code once and then at runtime you say, "OK, I want to." to run on this CPU with openmp or I want to run on this GPU with OpenCL or CUDA, choose well, very simple, choose a very simple API in the background, we have a lot of

tech

nology that supports this, so we have a caching system that or DX cache your binaries for the kernels that you are going to run on the device so that we don't rebuild them every time you run the program if you have already touched that device on that platform with that threading model it will load that program from the cache it is very lightweight and we made some decisions with the kernel language to try to make it a little more friendly.

It doesn't do anything magical. It does not stop automatically. it doesn't generate that loop partition problem. those parallel partition decisions you must make. decide how to do it yourself, it still doesn't authorize, it still helps to know something about the architecture, it doesn't tell you how to design your data and your memory, it doesn't tell you, it doesn't magically decide that MPI tasks can be Orcutt tasks. it can be distributed across MPI tasks, you have to, but it uses the vendor compilers, so we don't have a very deep compiler stack to support, so it's portable. We don't have time to do this, but Let's include this in the final demo.

There are other solutions that you will hear from Carla Edwards. He will think. I think he's going to talk about Coco, which is kind of a matrix on steroids that has some of the same abilities that he has. multiple backends, there's another project by Roger de Livermore which again is kind of a high level portable threaded program model, then you have some lower level approaches which are code translation, so it falls in the middle of the options, so this is kind of what a kernel prototype would look like if we have a kernel keyword, we actually argued about this for about six hours whether we needed this keyword or not, but we kept it in for loops, like this that instead of discarding them and simply using the Thread Index Block Index.

We've explicitly required the programmer to put in those four loops so we know they're okay, but they have some extra keywords at the end. We've added a clause in the for loop that tells it that's basically it. an outer loop is a loop over thread blocks if it is an inner loop it is a loop over threat so each iteration of the outer loops is assigned to a core, each iteration of the inner loop is assigned to a thread running on that core, so it's the same threading model as CUDA and OpenCL and it's very similar to the threading model for OpenMP because what we're going to do is we're going to leave these inner loops intact and make them serial and then we're just going to use a parallel on the outer loops. so this will give us all kinds of nice generation, we can have multiple external loop blocks.

I'm just trying to leave out some of the more important details so you don't need to see them, but what happens behind the scenes is we generate. some intermediate representation and that can be run in CUDA OpenCL or OpenMP, ok, here is an example. Hopefully you can read this, this is the host code, so we'll do things on the host, we'll say we allocate three arrays, we'll have a device object for the kernel. object amendment with some memory objects we will say with a string I want to use OpenCL mode and then I will choose which platform, which device I am going to allocate some space on those devices I am going to copy the data from the host, I am going to build the kernel to run it on that device, then I'm going to start that kernel, since it looks like a function call because we have our own Plaza that works behind the scenes, we figured it out. which thread block you want and which grid blocks you want right, so we do all that detective work and you don't have to specify here how many threads you want in each block and how many blocks you want, so we'll figure it out. for you and finally we will copy the data again.

Okay, now that the kernel is just source codeand the kernel source code will manually ask you to tile that array which we also have in our keyword that you can use, but this is easier to convey. general usage will say loop over blocks and then I can say loop over block from block next to block plus ten and there we go we have the kernel implemented so it's very simple basically you need to know three keywords and you can do a lot of things inside from ark and at runtime you decide if this will actually be used as openmp as a game use OpenCL if it will be used as CUDA you can do this in C with a julienne cm face Python interface MATLAB interface oh oh well I guess they killed MATLAB, no one was using it so we didn't use that right so we have the serial implementation of the Jacobi Makuta iteration and then the OpenCL implementation and we have an intermediate representation that is generated from this aqua kernel language and you can look , we just split EE the loop over the nodes in that direction, loop over the nodes in the Y direction and you can see it looks more like code to me, you know?

And we have some interesting technology behind the scenes that I don't. Let's dwell on there are a lot of buttons to press here, okay, everything magically happens behind the scenes, all those device selection platforms, everything is magically done without you worrying, so here is our last practical exercise that we have made. I have about ten minutes. I hope we can do it. Did it work for you? Okay, so I want you to create a flow simulation like this using the PNG file you already found. Yes, regarding host languages, to what extent are certain ones restricted from being used? functions, for example, with Python, can I just my standard Python functions and is it okay at the bottom how to parallelize?

No, there is a different solution prompt called I loop that will do that or you could use pi CUDA or PI OpenCL, but we hope you do that. providing the correct corner language is just the host API is Python yeah sorry but there is something called IP loop Y that will do what you're talking about by Andrés Klockner okay this is where yeah right this is the fun part, that's why A little bit of a rush through portability, but I just want you to run this fluent simulation code and you'll use it if it's implemented innaka and you'll be able to run it in the different modes.

You can see how simple it is. switch between open CUDA or openmp or whatever you want to use, so what it's going to do is it's going to take your PNG file and it's going to form a domain out of that, so once it reads in the PNG file, it's going to set up a network. where you have fluid nodes and wall modes at the fluid nodes, it will do collision and transmission calculations, so it's like a Lagrangian code and it's basically a discrete version of the Boltzmann equation and what we do is that every hundred steps we generate a PNG file.

Then you'll get a bunch of snapshots of the solution data and then you can use ffmpeg to take those PNG files and create an mp4 video file. Okay, so let's get competitive, see if you can create the best animation correctly and we'll tell you what. If what we're going to do is send me the animation, I'll create a directory and everyone can vote on it. Okay, it won't be today, but we'll vote via email and then we'll pick the best animation you can take. a time to do this even if you don't finish this day the best animation I will send you a box of fine English sweets made in the middle of the Middle East as a prize, okay, so there is a prize at stake for this right.

You must send me the winner. I will ask you for an address and send it to you correctly. So how do we do this? It will depend on the akka library, so there are instructions for building the library you are going to build. the akka lbm code, which are the instructions, are in the second box. It will take the file in PNG format, on which there are some restrictions. Not all z' PNGs are created equal, so they are segmentation faults and we need to diagnose what your problem is. We will save the PNG format in your directory and run the lbm code with those arguments.

The last argument is like a threshold. It says what our pixel volume is. Of all the pixels that we are going to solve. Use an LP m update formula. Make some time. taking a step and finding out what the characteristics of the flow are from left to right and where it is white we will treat it as volume of flow, where it is black we will treat it as a wall and as the thing calculates it will create a sequence of images . files slash something something something dot PNG and this command up here will stitch them together into a movie called fooder mp4 when you have it just use Globus or whatever you're using to transfer files to grab that mp4 file and load it on your laptop and play it in your browser or whatever you use to watch video files that make sense, remember there is money at stake here, not sweet money at stake, so everyone got it, yeah, where is a Ray player ecosystem?

Far away, okay, so if you remember correctly, these are the pro-level AMD GPUs, this is what you think he or you are thinking of the company called Ray Fire, which has its own library, they've been around for a time, some kind of high-level interface, I think. more or less at the library level. I don't think I have seen that product in a long time. I watched it in about five years, so I can't remember all the details. There are other questions while making your films. Okay, so. I'll put up the first set of instructions. I once saw a more complicated version of this for a final assignment in a GPU programming class.

Write your parallel programming class for youth attack. They had to take the serial version of LPM and create a parallel version that could be MPI. openmp or CUDA and what students do with an assignment, they wait for the last second, so I set the assignment so that generating the stream movie takes a certain number of hours with the stock code, no one wanted to do the MPI implementation. of course because MPI is a bit more unwieldy I'm angry now but those who delayed it the longest had the option of open MP or CUDA and if they really delayed the deadline they had to implement it in CUDA so that I was using some social engineering to force the students who could let them know that it takes ten hours to run an open MP, but if they ran CUDA to reduce it to one hour, I hope I gave them an idea of GPU Programming I hope you have a idea that it's not that complicated and if you are more progressive and willing to use an abstraction layer, there are several solutions to abstract the above programming directly into one of the threading models and I ended up showing you the akka is a very simple and straightforward interface which you can use to code once and deploy on virtually any platform.

You'll have to adjust your code for the specific platform you're going to run on, but that's a different question we don't ask. I have time to answer any questions. Yes, a CUDA kernel you have written. Use akka to produce a CUDA kernel. The same movies. The entire performance looks different, but it is the same. Yes, because let's go back to CUDA and everything we add will disappear when it hits the compiler, so we can keep it. If you take the CUDA kernel and create an aqua kernel, you can get performance back within 1% because we may have a slightly different launch cost.

Yes Yes. you could do mine through those errors but i'm still worried warning ignore those warnings they only warn my precise library before it was a silent silent surprise you have to export the LD library path to your local library otherwise you won't be able to load the file shared life, what in the senses or a dusty old version, yes, so I should have added that instruction, it is loading the library path equals loading the library path forward slash colon forward slash shockers slap underline dire /lib is fine, they are really understanding, they are giving a timeout signal, so I will end there thank you very much

Watch Video & Subscribe

If you have any copyright issue, please Contact