Turbocharged: Writing High-Performance C# and .NET Code - Steve Gordon

Mar 04, 2024

Well I make that time, thank you very much for coming, it's great to see so many people in a session in one afternoon that is a deep dive into

high

performance

type of

code

, so obviously no one is asleep yet and still have some ability. To take these things, welcome, my name is Steve. I am a Microsoft MVP and author of Pluralsight. My day job is as a senior developer for a company called Magics Base in Brighton, where we primarily make software as a service contracting technology. I also run a meetup group called Net Southeast, which is a monthly event, so if you're in the Brighton area check us out and hopefully you can come and join us today, we'll talk about how to write C-Sharp

code

high

performance

and I want to highlight this bitly link here, so bitly /hi perf net is where you'll find all the slides and all the code that I'm going to show you, so don't feel like you have to photograph every slide.

As we go along, if you go there, you'll have it all, so make sure you have a photo of that and you can follow us later if you want to get in touch after the talk. I'm online with Steve J. Gordon on Twitter and my blog Steve J Gordon Code in the UK has a lot of the content I'm going to cover today in much more detail, so one way or another you'll be able to delve into some of the Things that if don't make sense today, hopefully those blog posts will guide you through the rest.

More Interesting Facts About,

turbocharged writing high performance c and net code steve gordon...

I'm going to try to fit a lot into 60 minutes, so in order to streamline this part of the talk. We'll skip it, so I want to start by talking about the performance aspects because I think that's really important. Before we start talking about how we optimize the code, we understand what we're trying to achieve and that's why, for me, I boil this down. up to the kind of main free areas and the first one is kind of the raw execution time and this can be the execution time of maybe an entire process within your application code, but very often this could be some kind of measurement of how long a particular method or even parts last. of the code within a method that is needed to run because as a general rule the faster we can make our code run the more we can do with that particular device, very closely related to that is frupa and this is a measure that It's actually easier to measure. in production because this is something you can track over time and it is quite affected by runtime and this is a measure of how much work you can do in a web application, this could be requests per second for example in a service of queue processing, a worker service, you might be measuring how much of that queue you can actually process in maybe a minute or an hour and this measurement, as I say, you want to track in production because you can see that you know the trends over time. could alert on this data if some of those numbers go down, you're handling fewer requests and you were hoping you could do that per second, which could be a sign that something that was recently implemented has affected your code and again it relates pretty closely to both of those. memory allocations and if you've watched people like Adams on Twitter or David Fowler on the Microsoft team you'll see them talking quite frequently about trying to allocate less in their code, the reason this is important is that although the allocation It's a very quick and fast operation, creating a new object isn't really problematic, we pay for it at some point later, so anything we allocate on the heap will have to be garbage collected at some point and that introduces potentially short pauses . in our application while that takes effect and this can have an impact in high performance situations on the overall performance of your code if you are giving CPU time to the garbage collector, that is the time you could spend on your own code, for What These are some of the key areas that we will look at and see how we can potentially optimize some of these areas in today's code.

It's important to remember that performance is largely contextual to what I'm going to show you today. it's not relevant to your daily work, you don't want to apply these techniques to everything you do, but there will be maybe ten percent of it when you're

writing

something that we're on. he is the scale or handles high volumes we are looking at, code with higher performance could save you time and money on how to scale that service, so keep in mind that this is all contextual and you need to think about your given situation. It is appropriate to delve deeper into what I am going to show you today.

It's also important to remember that with performance there tends to be a small trade-off with the readability of your code, so what kinds of things are more important that the code is easy to maintain and if it changes regularly and you have many developers working on it? he? You probably want your code to be reasonably readable so you can maintain and build it, but if the code is a microservice and is designed to do one thing and you write it once and hopefully don't modify it too often, then that's the at which point you might say well, in fact, if we introduce some of this high-performance code that we accept.

This will make it harder to read, we don't expect to change that too often and in fact the benefit of knowing that scaling that service less or being able to do more with an instance is really worth the time and effort to do so, so that just keep in mind that you know there are some tradeoffs in these things, so when we start to look at how we're going to improve our code, we're going to go through what I've considered the optimization cycle, it's a pretty sick impression. simple loop that we start by measuring, it's really important in performance optimization not to make any assumptions about what you're doing, even things you've done in the past that have improved the code somewhere may not have the same effect somewhere else, so measuring your code is really vital to validate that what you're doing has a positive impact.

The first step is to measure before doing anything. Find out what the current state of the application is, but maybe start with a top level profile type approach. to understand what the active paths are in your application and then once you know where the code is frequently executed and what is called most frequently, you can start to focus on doing a sort of code-level benchmarking and, of course, In fact, look to optimize those parts of code and improve them once you have those measurements and you know that you are starting, then you want to start with some optimization and in this phase it is important not to get carried away and make tons of changes at once, it is quite tempting When you're doing this to exit, you can apply span here and this there and all those changes at once means that you don't really know if all the changes you're making have had a positive impact overall, you may have improved execution. time or reduce memory allocations, but the best you could have achieved, the best way to know is to change one small thing and then measure again and at that point you can validate your previous assumption and make sure you have had a positive impact. and then try the next thing and it's really just a simple cycle that you're going to follow and depending on what your goals are for that application, you can do it once or twice or you can continue with this until I really think you've gotten every ounce of performance out of it. of the code you are trying to improve, so there are several tools available to measure application performance.

I'm not going to go into them too much, but I want to write It's just that you're aware that these are kind of a starting point for a lot of your code in terms of profiling. Visual Studio has some pretty good tools built in these days, so I think David Fowler was showing this off the other day. There are diagnostic tools built into Visual Studio that you can even run while you're simply debugging your code and even there you can do things like see what your memory traffic looks like, take actual snapshots of the CPU or memory and analyze them as they run their codes. aware that you are in debug mode, your codes are not in the most optimized compiled version, it is designed for debugging, so it will be slightly below the performance you would see in a release application, but it is a good indicator that you can start to have. an idea of what bits of your code are allocated and what kind of volumes and memory you're talking about, once you've done that in debugging you'll want to create some profiles, hopefully under very realistic loads and this might be in some sort of a replica environment where you can access something, we have a series of requests that mimic what you've seen in your real production system or, in some situations, you can be confident enough to take some kind of profiling in your application of production and you can use Visual Studio tools to do that, David Fowler showed perfu again, a pretty low level walkthrough, quite a difficult tool to understand, but very powerful, lots of options there, jetbrains track and couple memory , which is a nice thing.

In the middle ground, they offer a little more power in some areas, but have a better user interface, which makes them a little easier to work with and occasionally this is not something you will have to do all the time , you might want to do it. look at the IR code, the intermediate language that your C sharp or F sharp code is compiled into when you build your application, things you can see there, sometimes you can get a hint of where you're doing some boxing operations or you have a lot of calls to virtual methods or just the raw number of instructions can give you an indicator that maybe you can optimize that code in some way, they say it's not something you're going to do all the time, but it is There is another type of tool in your chest that you can use.

It's also important to remember that you want to monitor these things in production and have real metrics around them, especially if you're going to try to change a lot. Around performance, sometimes in your test environments you may see positive gains; you want to make sure you're achieving them in production and you're monitoring over time so you can alert and change in those environments, so these are important tools to discover. where they fit into your development processes, but that's what I'm going to spend more time on today because it's more like direct code improvements and the actual type of code performance optimizations is benchmark, so this is a library that can incorporate is an open source library it is very well maintained and what it is about is doing high precision monitoring of your code and this gives you very good and precise comparative results on memory allocations, execution time of the code, you can usually get really basic results by putting a timer around some code, but you'd actually have to run it many times to get a very realistic count, especially on code snippets that only run in nanoseconds because getting an accurate measurement That's very difficult, so benchmark net does this in a very scientific way. do many tens of thousands of iterations of your benchmarking so that it can rule out outliers, give you a statistical average of the data that is collected, and will also measure its own overhead to ensure that the actual measurement process itself does not affect the results that you're seeing, this is a very scientific way of collecting the data and this is the tool that Microsoft now uses within Core FX, which is the dotnet core type of framework and within asp net core to make sure that you understand. how that code base is performing, so it's a very powerful tool and you can also include it in your C ICD process if you want, to use while you improve your code, but you can also have these things run in builds. so you can see if you're actually introducing regressions into any areas of your code that you've identified as critical paths that you don't want to break, so let's take a look, this is like a hello world of benchmarking, so we'll start with this, so basically what you'll do is create a console application and you'll bring in the reference dotnet libraries, a bit like creating unit test projects and then we'll just have Here's a very simple program main method that will call the benchmark runners, So there are several ways to run benchmarks.

The easiest way here is to just tell it to run and then give it the class that contains the benchmarks you're interested in. and below I have my benchmark class here, so right now I have this attribute at the top which is a memory diagnostic, so this is where benchmarks give us control over the type of information that is collected and reported here. What I'm saying is that I also want,In addition to the execution time, I want some details about the memory allocations when running benchmarks inside this class, so I have a setup code, this code is outside of our benchmark.

I am not measuring the heap allocations or the cost of setting up these objects. These are the objects I am going to operate with. What I'm interested in is testing what is needed to run the method to get the last name of my position ID, so inside this method here it is marked with the reference attribute to identify which reference point you want run, we have some code and this could be code that points to another project reference you have, but it could just be arbitrary lines of code in there that you also want to test in isolation, so we have our reference point, we'll run it now, so what we're going to do is make sure that we're at a release build benchmark. that won't allow you to run the tests in the debug code, so it will give you a warning.

It's pretty obvious if you've done it wrong and it will just run the code inside that console app, so it will run for a while. it's going to spit out a lot of information about all the warm-up phases, it's going to do all the general measurements and then finally all the actual iterations of the benchmarks you're running and finally when you get the results, you have a little summary, you'll find out in what kind of platform was running this and then for each benchmark within that class, we would have a row here, we have a row so we can see one hundred and sixty-three nanoseconds for What we know at this stage is that good is so bad sounds pretty quick I guess, but only at this point did we really establish our own baseline for the execution of this code now because we included the diagnostic or memory attribute.

We get this additional memory information, so here we can first see information about possible garbage collections that will occur, so there are several different generations that your objects can live within this type of managed memory and this essentially gives us. a view of how long those objects lived within the code base while those benchmarks were running so we can see that everything is in the Generation Z row, so all the objects that were created had a short life, so which is an indicator that We know that only large allocations exist or that they do not live for a long period of time and we can see that this measure is actually a little difficult to calculate because it is per 1000 trades, so it is zero zero point 3.7 per 1000 operations, which in this case is about 26,000 times, we would have to call that method before we could have introduced enough GC pressure that a GC would have actually been activated, so we can see which is reasonably efficient in terms of we're not putting much load on what we're doing and we can validate the actual bit allocation here as 160 bytes.

Now it is important to remember that these measurements can be scaled differently, which can be kilobytes or megabytes depending on what you are measuring, if all the results will be the same. unit that is not always assumed to be bytes basically and that is now a point where we could go and start optimizing our code a little bit, so now I want to incorporate the features that you are probably all here to hear and see, and the first one of them is span of T, so it was released in the last few years and generated quite a stir.

I guess I'm always interested in how many people have heard of span. of T already most of the house is good, how many people are using it in a production application 1 2 3 4 5 I can count them in the whole room, so yes, this is quite indicative and quite expected. I see this practically everywhere. I ask the question and I can understand why because in a sense Microsoft has advertised this quite a bit because it is introducing to them the ability to really optimize the core effects and asp net core, so they are very excited because it is making the framework and the platform be faster. so we work against it, but they've always warned in those posts that this is great, don't use it and they do it because it can lead you to a kind of more twisted code world and they don't want everyone to choose this.

Just because it looks cool, I upset them a little by saying that I think it can be used more than they advertise it. I think there are more use cases for things like the interval of T that I've been prototyping where it hasn't made the code a lot more complicated and I've gotten a reasonably good benefit out of it and we'll see some real world prototyping that I've been doing in a moment so that you can make your own judgments. about that, but it was introduced in dotnet core 2.1, it's also available for dotnet framework as an additional library package that you can embed and that refers to slow span, which is a bit of a misleading term, it's a bit slower than the span which is in dotnet core simply because in dotnet core they were able to make changes at runtime to allow the function to be as optimized as possible, so they are both fast, they are both perfectly usable and you can use them in both places if you need the interval that it gives you. is a read/write view over some contiguous block of memory that is kind of a very wordy sentence, so essentially if you think of something like an array that's on the heap, it's a contiguous block of memory that's allocated on the heap in a single nice block. and we can get a spam viewer on that, which doesn't sound very useful because we can always see it as an array anyway, but the interesting thing about the interval of T is that you can point and see the memory that is on the heap but also in the stack and also the unmanaged memory and you can work with it through this tapi constant fan.

You don't need to worry about where the memory exists and it will do the right thing for you in terms of memory and type safety to make sure that you're not doing things that would be dangerous or that will allow you to leak objects out of the scope in which they need to be seen so you can work on things like arrays and strings on the heap, but like I say, you have that stack option that you can use and we'll see that in use in one of the demos as well or in unmanaged memory, if that's where you're working, there's almost no There is overhead in using this, it has the same type of functionality as an array in that you can iterate over the array or over the data in the interval.

I should say that you can modify it, you can take a particular index and adjust it if you want and as I say, it is very simple to do those types of operations in a very similar way to spam with practically at, like an array, we have practically no overhead general to do it. The most popular trade you will do once you have a span and this is where you will start to see its power. is the slice operation, this is where we have a view of some data and what we're going to do is modify that view, so here we start with an array of nine integers, which is a really strange number to use, as there should be been.

It's been ten, but I have my array, so I can call our interval and that will return my interval representation. Now interval is a very low and light type, it is a value type and is guaranteed to only live on the stack, so there is no allocation or cost in doing this and it is essentially inside just a pointer at a length about the memory you're looking at, it's a little more complex than that, but it's very lightweight, so there's very little cost to doing it once we have our overview. this array, then we could split it and all we do with the split is get a starting position and optionally a length that we want to use and this then returns us another span, so now we have two spans looking at the same block of memory.

So the important thing here is that this starts to give us the ability to parse some of the memory that we already have a view of without copying it, we're not copying data, so there's no overhead, there's no real cost to doing this and you can see. here that my second interval, my interval up to your index 0 would essentially indicate where index 2 is in that first interval, so now you can both operate on that data and modify it and view it differently. I prefer to give the analogy for this photography. so I do a little photography.

I have a DSLR camera with that type of telephoto lens. Many people have taken photos with their camera phones if you're taking a nice wide-angle photo of a landscape and you see an object in the background. in the middle of that landscape that interests you and you have a couple of options, you can walk towards it, which can be a mile across the field and over a stile and eventually you get close and great, you took a photo, there's a fair. The amount of effort involved in that or a pretty consistent operation is just zooming in on the camera lens or sliding the zoom into the view and then you'll have a different view on the same scene that you were looking at previously and this is what it essentially is. the slice and because it is a constant cost operation in constant time, so because there are no allocations involved because there is no memory copy involved, it doesn't matter here whether that array has nine elements or nine million elements, the The cost of dividing it into any length or portion is exactly the same.

The same thing every time we do it, so now I've shown you that and I've probably tricked your brain with span because it takes a little while to understand it. Let's look at a slightly trivialized example before moving on. Realistic code in no time, so imagine the product owner gave us this requirement. We need a method that takes an array and returns a quarter of its elements starting from the middle. I'm sure everyone has had a requirement like this. thrown at them at work, so that's the requirement that had been given to someone in our business at some point and, and possibly this is how they might have gone crazy, they could just use the expression skip and take in a link to provide the data that was they ask us for it and they think it's great, we have delivered it now, some time later, the product manager comes back and says: we want, we want to speed this up, we want to do it very quickly, we want to make a fortune, this is going to make us save some money. money and now we need to optimize that code, so the first thing I'm going to do is set up some benchmarks so there's additional code, but this is our setup, which is a little more complex than hello. world example, it doesn't have to be like this, but there are some good reasons to do this, so the first thing here is this params attribute on the size property, this allows us to configure benchmark net to actually run this benchmark every free time or every in fact three times each time you run it, although it will have a different size value on that property and the reason we might want to do this is that it's very easy to get drawn into trying a particular type of best-case scenario that you expect and we always expect this array to have maybe a thousand elements, but there are often edge cases in the data we receive and sometimes they can be different, the results of your type of improvements can differ in higher or smaller sizes here. so let's try three possibilities because I need some way to configure this reference method, so what I have is just a normal method that I marked with the global configuration attribute and this runs. once before the benchmarks are actually initialized and it is not again, it does not measure, it does not affect the execution time or the allocations, we see that what we are really doing is creating an array of the appropriate size in which we are going to operate, so now we can actually write our benchmarks, so this is the code that our original developer had written.

All I've done here is add a baseline attribute, in this case with baseline equals true, which says this is the starting point of the code I'm coming from. so I want to compare everything else to this and we can run that benchmark and this time we get three results because of the three different parameters that we apply so we can see that 154 nanoseconds, two to four bytes allocated in the first run 100 original size for the array and as we go up as you would expect, because we were working on larger and larger arrays, we see higher and higher allocations and a longer time for this code to run, so this is our starting point , now we are armed. with enough information we can make a change and see if we can improve, so at this stage maybe let's think well, actually I've heard that bound expressions can have some overhead, maybe let's write this code more manually to create the new . array we're going to copy some data into and see if it performs better, so we've had a theory, we've made a changeessentially quite small here that we are now going to measure, so in these results we can see that this first hundred element result is really great, we have a much faster execution time and it looks like we have saved almost 50% on the bytes allocated, so at this point, if we were just testing on the hundred elements, we would feel pretty good about the results because we have added additional sizes and we can see that actually with a thousand elements the gain in terms of the type of bytes allocated is pretty negligible now that the difference in each case is actually 96 bytes, which is the cost of the binding expression being compiled and executed, but otherwise the allocation cost is pretty much the same, it's obviously faster than execute, so this is a benefit if we are looking for raw execution time, but perhaps we also care about memory. so finally we're going to introduce spam and as you can see, we're going to be able to use the slice operation to give us the same set of data.

This is a bit of an unfair test, potentially depending on what the caller can accept, so here others we return an array, now we return an int range, so we assume the caller can work with an interval and still use it optimally if we were to just return it to an array at some point we would probably have less play, but let's see what the interval gives us assuming the caller can work with that and continue on an optimized code path so that when we run our tests again with the interval we are now less than a nanosecond and we don't have any allocation so that's pretty good we feel pretty good about that and if we look closely at 1000 skips we are still below a nanosecond and we still don't allocate nothing and in 10,000 the same set of results is a constant time.

Constant cost operation, so regardless of the fact that we're getting a larger and larger initial array that we're testing with on my particular machine, this slicing operation takes about a nanosecond on other machines, it will vary slightly, but that's a pretty good set of results. So you can already see from this trivial example that the extension can be a pretty powerful way if you're analyzing data and you just want to get a different view of it, essentially, you can do it for basically no cost and we can do this. with strings too, so strings are essentially just an array of characters under the hood, so we can call as interval on a string literal or a reference to a string, but what we get is slightly different this time, we get a read-only interval.

The reason for this is obviously that strings are immutable and if we were given a read/write viewer on the data that string occupies, we could do some really nasty things to people who trust that strings are immutable, for example. which we can only retrieve a read-only range and now we only have that view that we can read and potentially pass, so in this scenario maybe find the index of the last space character and then I would split it to get my last name in this and then in this simple example. we can start doing these kinds of operations on this, there's not much play to it, but if you're analyzing a large number of tab separated files, for example, let's have an example, you'll see that actually the benefits start to accumulate from quite significantly when you only want parts of that data.

Now there are some important limitations to covering T that are really important for you to understand so you know where and when you can't apply them, so the first of these. it's a stack value type only and I kind of hinted at it before, so what does that mean? Well, it is defined as what is called a reference trace. This is a new keyword that was introduced in c-sharp 7.2 and what it says is this. a value type that can never be stored on the heap, so nowadays value types can essentially end up in heap memory at some point in their life, but because this fan could point to allocated memory from the stack, that would be really dangerous because then we can this spam exists longer before it gets garbage collected on the heap pointing to memory that is no longer what it was originally looking at and because of this because it is built in in the T function, we are offered this memory security. we guarantee that we can only use it in the correct context with the memory we're pointing to, that's why it can't be boxed, that makes sense, we're not going to be able to put this on the heap on the big type.

One of the changes that that might introduce is that you're often tempted to add this as a field in a class or something, but because it's part of a class and heap, we run into the same problem, the more big and significant. What I see people finding is that we can't use this inside a receiver method, basically we can't have it as an argument or inside as a local variable inside a receiver method and much of our current code is asynchronous, so in a moment I'm going to show you how to fix that because I think for a lot of you that would be pretty spectacular and the last thing is that for a very similar reason it can't be captured inside lambda expressions and the reason for those last two.

It may not be immediately obvious because in the code it doesn't look like you're breaking the rules, but when you compile them you end up with a state machine; in the case of just one, you can end up with a class for the lambda closures that exist and this means that we are breaking the above rules by then raising this interval to a field in one of those classes that are generated, so these are all the limitations we are up against, fortunately Microsoft also introduced T memory. and this is a very similar type to T span with slightly smaller performance gains so it can live on the heap if necessary and this is the critical thing, so now you can't point it to a stack memory, but it does mean that we have this guarantee that we can use it in some of the places where we can't spam it, so it's defined as a read-only structure , not like a restaurant, and that's the key: using ref struct keywords is what ensures that a single element will live on the stack.

It's a little slower if you're cutting it. It has a cutting operation. It's a little slower than what running spam streams will give you, so depending on what you're looking for, it might be perfectly fine to accept it. that kind of difference, but if you're looking for maximum performance you'll always want to get back to an interval as soon as you can. Fortunately, that's very easy. Exposes a spam property that you request for spam and that represents that. The same area of memory that you're looking at with T memory and you're going to combine these two together, so here we're trying to do something that makes the compiler very sad.

We are trying to move on from this. method which is asynchronous, so the compiler will check this and see this risk, it will identify that this is not a legal operation and therefore at this point we are not going to have compile code, we can change the code and then accept it in this case. in the bite memory now the compiler is happy again and the way we're going to work with this is we can just split, like I say, that memory and it looks very similar to what we were seeing with spam, but let's each gram of performance, so we want to use an interval operation to split it, so the way to fix this is usually to create a non-asynchronous method at the point where you are working with the data because once you have the data in memory it doesn't there are actual asynchronous methods that you're going to call and then this method can take a span, so what you're going to do is try to pass that span from the async method, well, here we are.

We also break the compiler while trying to create a local variable that is a range, it won't let us do it inside this async method, so the way we fix it is by making sure we pass it directly to the method here. and the compiler is happy with this, it can figure it out and do the right thing under the hood for us and this is typically what you'll see in asynchronous code, it will just put it in T's memory where it needs it and then return to a span in the which is actually doing its final optimizations with it, so I talked for quite a while, half an hour, so now we're going to look at some code quickly, so the scenario here was that I was looking to build a prototype to optimize some code that we had in production, that piece of code reads from M sqs, which is just Amazon's queue service, takes the message and we want to store it in s3, which is essentially Amazon's object store, there blob store, so To do that, we need to give it a filename of an object key and I want to use some of the properties in that original JSON structure to do it, so today the code is deserializing to get the properties to construct the key and then store it in there, so if I go to Visual Studio, which I desperately hope works well this time, this is like the original code and I'll show you the original code, then the new code and then we'll look at the benchmarks, I promise.

I did it the right way, I compared it as I went along, but it's a little better for a reveal if I show you the numbers later, so here I have a method and it's taken in this event context which in this case only has five properties in our current. In the scenario, it has many dozens of properties, but we only cared about a few of them and what this original code did, which had been slightly optimized anyway, was take the context of the event and calculate how many of the elements there are. we'll actually use it to construct the key of the object and basically if there's no date, we'll use four elements.

If there is a date, we will also use the date in the key to get this string array of those elements. We eliminate the appropriate size and what we are going to do is begin to fill the parts of that array with each of the properties that interest us and within this get part method what it does is say if it is empty or null. We're just going to put in an unknown part and that unknown part is just a constant string that we're going to use when we don't actually have the data in the property that we expect if we have the data that we are I'm going to try to remove the spaces inside the remove method. spaces down here, it's as simple as if the space is in the method in the string, replace the string quite reasonably and then the final check is whether it's valid as an object key.

So there are some rules about only having characters and numbers in the object key, so we have a regex that checks that for us and that's essentially the code and we're just going to fill in the parts of that array, optionally we'll do that. to a string in the date to format that if we have it and finally what really happens at the end is we just use a combination of strings to join all those elements into a final string, we want it to be lowercase for consistency so We're going to take it down and then we have our rope, so it seems reasonable, so I'll show you the new one.

I'm going to drag it down just to give you an idea, to be fair this is more of a boat and stuff. It kind of goes to the point of performance versus readability and as you can see at the top here, don't worry too much about it. I have some ugly looking code where I am defining these things in such a way that I can guarantee that I will reduce a copy of memory. I'm great at just forcing as much performance as I can out of this. You'd have to understand exactly why for the purposes of this demo, so the key I want to talk about is kind. of how I'm going to use spam for optimization here, so basically I'm going to create an array of characters that I can use as a temporary build area for the name of the file I want to generate and I'm going to slowly fill it with the elements as I I'm moving forward, so what I'm doing is I'm pre-calculating the length and the reason I'm doing that on this line here is because down here I'm going to do something a little clever. which is if it's only less than 256 characters long, which on the happy path I normally expect, I'm actually going to use the Halleck stack to allocate the temporary character array on the stack and then this would normally mean you'd have to use the keyword unsafe and enter unsafe code if you want to work on the stack because it is quite dangerous you have to make sure you don't allocate too much on the stack or you will end up with a stack overflow exception and then end up in stackoverflow.

I'm trying to figure out why, but if you're only working with small chunks of memory and hear the 256 character subguarantee, I'm pretty happy, that's fine. I'm saying use stack, but I can also hear you saying if that's too big. just also use a normal heap allocated character array, but I can render themto both as a span of T, so this is where that kind of transformation to a single type that deals with any memory location is quite convenient, so what I can do is I can construct parts of the string in a similar way to before and what I pass here is the part I want to start including my spam which is the output work area and by reference I pass the integer that tracks my position so I can update my position as I go and cut to the appropriate point in the character array as I build it, so I'm basically doing what I did before, but a little more like a bow, so if so, it has zero length for its just whitespace, then use the unknown part and you can see here, what we're doing is we're basically copying it into that spam starting at the segment that starts at this output position at the beginning of this, it's zero, but then what are we going to do?

What we do is we update that position so we know how many characters we have written and so the next time we do a segment we will start from there and we can continue

writing

to it, if we have data there we will do a is a valid check and we can simplify that without expressions regular here and just do one character or digit and basically we finally end up here, hopefully, where we're going to use these kind of memory extensions here: too low and variant, which means copy this input string into this extension area and reduce it at the same time so we can let the compiler do some optimizations, hopefully that's what we're essentially doing by removing spaces.

It's a pretty simple case of seeing if there's a space there using this index of on the span and if there is then we can take that actual index where that space is and just replace the character so we can modify it using the indexer as well, so that's the code. It's all on github if you want to dive into it. More details, but I'm going to go back to the slides and take a look at the results here, so these are the results of my tests, so in the original approach it was just over a thousand nanoseconds and then we went down to 440 90 seconds four. and a half, two and a half times faster, it's not really a huge improvement and depending on what our goals are, it wouldn't be particularly good, but we were really looking for allocations.

We originally had a theory that it could be a lot of allocations, so you can see that the original code was allocated because it was using things like join strings and then string to go down because strings are immutable, that means we're just creating new and new strings for each of those intermediate phases of the construction of that final key. We have some allocations over a thousand bytes here in the new code, we have 192 bytes and 192 bytes here is actually the length. of the final string that we assign to the end of the methods, so we've managed to build that string, we don't have any overhead and that's a pretty significant improvement for us, slightly supported by showing it a second ago, but on its own this doesn't seems very impressive, but we expand this to what this service actually does: it sends 18 million messages per day through this service, so that's 17 gigs of daily allocations that we have achieved simply by modifying this particular method flow in the code . and that's about 2,700 garbage collector 0 gem pickups that we've avoided having created and the need to do that at scale starts to have sort of compound interest in terms of the investment that you're making, so.

I'm moving forward a bit, there are quite a few more features I want to touch on in a breach, all the first of them, it's very difficult to determine what this does from the name, it's a delete group, this is useful in scenarios where you regularly need a short-lived buffer and you'll see this a lot if you're working with streams, for example, you need that buffer space to do some work and then get rid of it when you've done something. with this, what we can do with array pause, I'm going to say, well, actually we'll go to this thing that will have a pool of raises for us and they'll live forever essentially within our application and the pool will just give us one existing every time we need it again, so it's pretty powerful, it's in the system buffers namespace and the way you work with this is you, it's a generic type, so you have a set of arrays of whatever you want. you want and usually what your call is, they are shared methods for your shared property, I'm sorry to have an instance that is preconfigured with common pool sizes that we might want to use, the reason it's a good idea to use this is because They are in charge of ordering. of figuring out how to actually implement a pool itself, but also this is shared with the runtime, so if we have a pool of int arrays and the runtimes also use the same pool of int arrays, we probably already have arrays in that pool that we can share and reuse, so to get the element we call it rent and give it a length to say, okay, I want a thousand bytes, so let's say and it will return us an array with the important and really critical point.

That is, you can get a matrix and you will probably get a matrix that is actually larger than what you ordered, which sounds a little strange, but it makes sense if you think about how the device is going to work, you should stay safe. bucket sizes of arrays available so it can give you realistic reuse of those arrays if you have all the possible array sizes Under the Sun in the pool you will realistically ever use that array once and that's all we've done is just give a lot of objects a much longer life and we didn't really get any gain, so when you work with the pool arrays you need to keep track of how much data you've written to it, so when you then iterate that array for example only goes to the length you're interested in, when it's done you return it, that's important, otherwise you just create an object and discard it, eventually you'll get G seed, but really everything you've added it's overloading a bunch of arrays with no real benefit, so you'll be worse off if you don't return it.

We can pass this lighter work matrix equal to strength or true to this method. This is another important one. period, before dotnet core free, oh, 2.2 or earlier, when you returned an array to the pool, it was not cleared by default. This isn't a performance optimization that only saves a brief amount of time by clearing that memory, but it confused some people and introduced a potential security risk if you put arrays back into data that you really didn't want removed later, so it is now cleared by default, but you can disable that behavior; you should always remember that any array what you receive from the pool may already contain data, so again you don't want to just iterate the entire array because you could be talking and getting data at the end that you didn't write there in the first place, so It complicates some parts of working with the matrix a little, but as we see in the example, it is not very complicated to do.

Here I have a short live buffer that I want to pass to another method and here we have an allocation on the heap, so what I'm going to do is change that to use the shared array pool. I'm going to rent a thousand byte array. The possibilities are here. I'll get 102 four bytes or maybe 204 eight bytes but I know it depends on what's available in the buffer it could go even higher but I have my buffer and then I can work with it now as I say we need to make sure put this back, so they normally recommend using the try finally pattern. here to make sure that whatever happens, we return the array to the pool once we're done with that and it's kind of a red pool, we'll see that in action in a moment, but I also want to introduce the I/O pipelines of the system.

This is kind of the brainchild of David Fowler and the asp.net team, so it was originally created to improve the performance of Kestrel. Kestrel does a lot of I/O, reading network requests from a socket and then sending them through Kestrel. web server in MVC or Razer pages and on the side that is B net and this is a pretty heavy operation because there were several transfers between different flows through those different layers of the applications, so they were looking for a way to optimize that code and I have found that they can actually improve working with transmissions by about 2 times.

Now, technically, you could write that code yourself, that's all they've done, but they've included it inside this guy because it's actually very, very difficult to do, so it's really just about deleting. The important thing about the way pipelines work is different from a traditional type of stream-based approach: you work with the buffers and manage them yourself. All of that for you, so it will give you some memory to work with and you will just read it and you won't need to worry about buffers, it will also use the array pool. internally to do that to avoid heap allocations for those buffers and two ends of a pipe, seems quite logical, we have a writer and a reader, so what does this look like?

Well, we have a pipeline, so first let's get to work. with pipe writers and when we get the pipe writer what we can do is say get memory this gives us a T memory memory per byte in this example why memory is ok normally we are inside an asynchronous method when we are working. Through this, we are going to perform some I/O within the same method, so there is a possibility of asynchronous calls being made, so that we get a memory of bytes. We would work a little with the data they have provided us to put some data. in it, maybe how I built the character array earlier for my object key, for example, and when we're done, we just move forward by the number of bytes we've written and then when we're ready, we can empty that as well. is the operation that tells you at the end of the read that you have some data that you might want to see now, so in read M we have a pipe reader, we can wait for the asynchronous read method, so it is a non-blocking operation by default. is actually quite powerful, it won't do anything until there is something in that pipe that we haven't seen yet, we get a read result.

The result is a buffer, and interestingly, that buffer is of type read-only stream AND read-only stream. and it doesn't have a good memory, the reason is that the pipeline in advance usually doesn't know how much data it is going to give it. If you are transmitting data from a network socket, you may get a small amount of data or a lot of data. and a lot of data, so the pipeline has to be able to create buffers for you as it gets more and more data as those buffers fill up, what it will essentially do is create some T memory outside of the array pool internally and then when it returns it to you, it gives you the structure of a read-only stream that just represents essentially a linked list of those buffers so that you can then work with them as if it were a logical piece of memory, that's how you are We will be operating with the data you get from the pipe at the other end, so this again is easier to see in practice, so what I'm going to do in this example, the scenario here is parsing files separated by touch, which is Super exciting, we had the scenario where we analyzed a tabbed file that was stored in AWS.

First we unzip it, which is kind of important for this example, but then what we're doing is parsing that data and getting basically just three parts. three particular properties of each row and storing them in elasticsearch, so let's see the code for this and again I will do the before and after type of code and then we will see the actual improvement, so this is the code above. Again, I'll go over it quickly, so we're essentially reading in this case from a file just because I didn't want to use s free in my example, but we're also retrieving a history stream and decompressing it. and finally what the original of this code was doing was unpacking the entire stream into an array to create one long string representing all the data in that file and the reason they were doing this is that they are using this lie. we call small CSV stacks, of which at least at the time I was sold it was only accepted to work with strings, these files had about ten thousand rows each, so if you start to think how much memory there could be if we created an array of bytes representing the file. and then we were able to create a string for it, this raised some warning flags and we actually had some memory leak issues with this service which led me to look at it initially, so what I did instead was regret what I did. they made.

We were using this little CSV parser that has a pretty nice sort of fluent syntax forsay: I want to read this and assign it to these properties, so the reason they were using the library was readability and convenience, but I had some concerns, so the new code again, I'll just show it for comparison, this is longer again, but I hope we can follow something, so we start with the decompression sequence as before, so we will use the pipeline here with a sequence that we use. we have to use a stream in the early stages because there isn't actually a pipeline that can be decompressed at the moment, but that may well come with the framework, but once we have a stream we can just create a pipeline reader on top of that sequence and start working with Basically, at the other end of the pipeline, so that the framework takes care of writing to it, we do the asynchronous read as we saw in the example, we get the buffer and finally, what we are going to do is call stacks of lines passed into the buffer. what it's doing is new, so we have a sequence reader which is the type that allows us to work with sequence in a very convenient way because that type of read-only sequence is a little bit complicated to read, but the sequence reader does it.

It makes it pretty simple because as a method, like trying to read anywhere, I can say look for, in this case, the bytes that represent a newline character. If I have a complete line, I know I can pass it to the actual elements. If I don't have a line, we'll stop here and wait until more data comes from the pipeline. When we have at least one complete line we will try to parse it and on slow stacks, We are now falling into a spam based approach and all I am looking for in this case is within that span find me the index of the tab byte and count its position using that tab character as a way to identify where. you are when you're at tab count one, I know I want to read that string, so all we do is cut from, in this case, the start position to where that tab is and now we have some data that we can immediately cast with string, but actually we are going to index on elastics and we only do that for the items that we want us to break early, we only do the 11 tabs because after that we don't care about the rest of the line, so we have I did some optimization to the ni Even analyze data that we don't care about in terms of performance gains.

For this, let's look at the results, so I did this test on 75 files each, we have about 10,000 rows because that was somewhat representative of a load I was seen quite frequently in the service, so in the original it took us about 8 seconds and a half and we are down a second in optimized code, that's pretty good, not very exciting, we move on, this is a processing service that could probably live with 8 and a half seconds, but the allocations were a story really revealing. 7 gigs of allocations reading those files in the original stream because of that byte array we're creating and then that memory string copy, there's a lot of copying, there's a lot of allocations in short live allocations, we have 242 megabytes in the new stream, which is a fairly significant drop, approximately 30 times less.

What was really interesting was 203 megabytes of these strings that I'm outputting to. I'll send the data to elasticsearch, which means we're less than 40+ megs overhead for this job to be done now, which is pretty significant and I can actually optimize a lot more that way because elasticsearch accepts bytes, so I could avoid those string assignments with my code in the future, so it's quite possible. A lot of advantages that we were able to get with the pipelines extension in that example, so I finally want to touch on the JSON APIs that came in donate the free core interest path that we heard about, like the system text or Jason when he arrived, a good number. read the blog posts so this came in the box.

Microsoft announced it. It caused a bit of heat on Twitter when they first said what they were doing because everyone started stomping on Microsoft saying, "Okay, you're going after open source." community once again we have Newton soft Jason we like it, why are you building your own project now? Fortunately, to help alleviate that King James Newton who wrote juice Jason Newton soft Jason came along and wrote a blog post and explained why he thinks it's a good idea now to be fair, in some ways employed, so pays him to agree with them but you know hopefully his points are fair and he's probably representing his own point of view and the reasons for this were twofold so one of the problems was asp. net core relies on JSON parsing quite often to bind the model to data coming in as JSON and serialize the data into web APIs, this means that for that to work they need to set the asp net core version to a of particular work and newton soft jason know that it works, which means that if you want to use a different version in an asp net core application you probably can't because a new game won't actually incorporate it and that was a bit annoying for sure the people who they wanted the latest features from Jason but they were tied to what had been the core at the time of release, the other issue is performance and there are ways to optimize it, obviously with things like the T interval, but to adapt it to Newton Soft.

JSON, which is installed in so many millions of code bases, which would introduce some kind of break in the surface of the API, would have been a nightmare for anyone who decided to make the next version, so instead of screwing everyone in the world, they decided. it probably makes sense to just build something out of the box because this is done so often that asp net core can rely on and optionally you can use it too or you can continue using Newton soft Jason, there are free layers in the way they have structured this, so at the bottom level, the lowest level that gives us the most performance, we have a utf-8 JSON reader and a utf-8 json writer.

This is where it gets really complicated. We'll look at some code for this in a moment, but this allows you to get the most bang for your buck, if you're really looking for performance, you can work at this level if you're actually reading data in a slightly more less verbose way and a slightly efficient way to do it is with this type of JSON document they bought is essentially a read-only view that in most cases may not even allocate dynamic memory when it's working. It basically gives you a Jason document object model that you can review if you want. and then at the highest level we have a normal serializer and a serialization error which are optimized, they are faster than Newton Soft Jason, but they have lower performance than working on the low level code yourself because they have to be a bit more Generic and a little more general for the Jason structure that they get in those random serialize, a fairly limited feature in this case, they were released with just what was enough for serialization of asp net core Merilee, which is moving in new versions de.net. core, but keep in mind that A have limited features and B are quite strict regarding Jason's build.

New, it's off. Jason is pretty good at figuring out what Jason probably means, even if they aren't completely valid Jason systems or text. Jason will usually throw an exception if it finds an invalid Jason, so this one is easier to see in practice again, so here I'm going to do an example where I'm going to index something into an asset so it doesn't matter if it doesn't. You know what a lastik quest is, but he has an operation that is essentially a rest empire. I can call to pass it some data to store in the index.

What I need to do is when I receive the response, I need to deserialize it. response to identify if that operation completed successfully overall and if it didn't complete successfully overall, which of the actual things that I index failed because what we want to know is which ones we need to retry, so this is actually a great product in is a difficult problem to solve in a very efficient way with regular Jason, so let's look at the original code now. The nice thing about this code is that even with this ridiculous font size I can fit the entire method on the screen, so that's good and we're basically just taking the Jason text reader over the data stream and deserializing it to our type , now inside that guy.

Basically, the data that you get from the elasticsearch API is how long it took true or false, was there any error or not, and then the collection of all the elements, which is the metadata mostly within the request that we made, which we can say is that if there are no errors in that top level property then it is a big return that there are no errors, a true success in this case and an empty array of elements that failed because there is nothing, if there are errors then we have to review them to find the ones that had a fault code and then we have the IDs of those that failed fantastically, so we'll go to the new approach.

You can see by the scroll bar that it's a little bit bigger and again you know you really have to make a decision about whether this is worth it in your scenarios. Some horrible things at the top we can skip, so here we are using the array. The pool is fantastic, so in this case instead of using pipes, I'll use streams natively, so I'll create the buffer for that stream, but I'll use the array pool, so this gives us the flexibility to at least don't allocate a bunch of code to this temporary buffer, so I'll skip some of it, but ultimately what I'm going to do is read from the stream in my buffer and then try. and pause errors and I pass a lot of things by reference here because I'm going to keep track of my state as I go into this method because I may not have the entire JSON blob at once.

It probably won't, so I'll look at internal errors. We will use this utf-8 JSON reader. This also requires that this be the final boolean block and some state, so this is a type of react that accepts the possibility that you are working with streams or data coming in in chunks, so it allows you to just somehow, keep calling it with the last state you had and you can keep parsing more and more into the JSON and ultimately what we call is JSON, don't read and here once we've done that we basically get the token that we're at a level low now, we're looking at the opening brace of an object or we're looking at the beginning of an array, but we have these tokens that we can work with and all this horrible stuff. horrible switch break, who does it is try more accounts with the known structure that I know I'm getting, where is the errors property and what is its true or false value in that and potentially if I need the elements, how interesting that I can do here. is that if I find the errors property here and then in my next loop, if I find false as the value of that property, then I will exit here and essentially, if I come back here, I will short-circuit the whole process and let's say well, if no errors were found or I found the errors property and it has no errors, stop if there are no errors in that document.

I don't need to go through the rest of the JSON at all, that's within the first 20 bytes of the JSON so I can optimize this code and avoid parsing anything I don't care about, just in this type of failure case I would have to check, identify and find all those relevant ids for the messages that failed again. You can write You can take a look at that code in your spare time if you're bored on a Sunday night, but otherwise I'll move on to the results which are a bit more fun, so in the failure response scenario and I'm totally frustrated with the numbers because I'm sure there is more performance and this is the only risk you take with performance every time you see an allocation it will be like this so even here although we have gone down from 114K to 16K in the scenario fails at 16K It feels like a failure, but you know, this is a pretty good game, but the failed response is very unlikely in our case, we don't expect things to fail in the indexing process, so the response successful is more interesting.

I ignore them, so the response except success here in the optimization flow is much faster because we read the first 20 bytes, see if there are no errors and just stop doing any more work and only frustratingly allocate 80 bytes that really should be zero and I need to spend some time. in this code to figure out where 80 bytes are coming from, but you know it's good enough as a prototype and we've shown that in this kind of 99% case we're saving a good amount of allocations in what's going to be a pretty heavy operation and common in this particular scenario, so we're wrapping up now that we're coming to the end and the important thing that you might be thinking is great.

I want to dosome of this, how do I get a job? let me do it, you can be crafty and just do it and that's what I've done here. All of these prototypes are little lunchtime projects I've done here and there where I've seen things that might be interesting. To take this forward, what I suggest is that you look for one of those types of things like I did, where I saw great assignments. In my case, we had a service that was actually failing due to memory allocation, so that was a clue we should probably have. do something, but even then you know that just by casually looking at that code you can see that the array allocation, the string allocation, was a big heap allocation, so look for something like that where you can say yeah, that's really bad, so use this scientific method of actually measuring first, optimizing a small piece of code, measure again to give you these numbers that really allow you to quantify what you've accomplished, don't go to the product owner and go.

I've removed 1000 bytes from this particular method and now it's only 400 nanoseconds because they don't give a damn about anything you're saying, so convert this to a monetary value, try to establish for them what the value of this is and give them some kind of cost benefit relation. you spend time doing this in work time or real time to move this forward, so this is an example that we had, so we have a service that is an input process, so this is one of the things that generates 18 millions of messages a day, so pretty common high volume workloads, so what I did was all these individual prototypes plus a few that I haven't had time to show you, where attempts to find various parts of that flow that could potentially optimize after some kind of something like maybe slightly positive. biased estimates I've said that I think I can reduce 50% of the allocations within this little microservice and can probably double its performance based on the runtime changes I'm seeing for the various phases, so what I've done is gone and I looked at how much scaling this service has to do within our container cluster to find out that actually I can probably get rid of enough scaling, which means I keep a VM you know in our cluster on a daily basis, which is in our situation, around $1,700 a year for that service change now, that might not be enough to justify a developer spending their time on it, but some of those processes, particularly in microservices, if you're manipulating a lot of data, you will.

We'll be doing the same thing over and over again across many services and if we can seal and replace similar parsing logic for those particular requirements, we can scale this up pretty quickly across the hundreds of services we might be talking about. so you can start showing this value and then you can see if you can get the business strategy to accept this idea of optimizing the code again, where contextually it's worth doing if there's a good cost savings and the code isn't. If you change a lot, this might be good for you, so, in short, because this has been a lot of things in one afternoon.

I don't remember what I said at the beginning. Everything you've seen is really important, mainly for advanced situations, so this is it. It's not that, you come back to the office on Monday and say, "Wow, I'm changing everything," he lays it out, going everywhere, it's really tempting and when you start it's hard to stop, but you know, use a little, be careful where you are applying it, measure everything. Not assuming in this case can be really dangerous, you know, making changes that actually degrade performance or have a knock-on effect elsewhere that you couldn't even foresee. Be scientific about it, so benchmark is a fantastic tool and quite easy.

To start initially, that gives you these kind of precise numbers that you can work with. Focus on the hot past. Don't waste all your time optimizing something called once in a blue moon. It may be great that you get zero assignments, but that's the way it is. In fact, it will change your life over time in terms of the overall code of that service, so don't copy memory where you can split it. The work with the Interval API is when you're parsing string or byte data, use those groups of arrays, that's a pretty quick and easy change actually, you can make pretty insignificant code changes and it's still pretty readable code when you're done.

It's not as bad as some of those demos I showed you. General pipelines are great because you have a lot of I/O and you can conserve it. Keep that in mind and the new API is for Jason. If you're doing that JSON parsing like I've shown, you can make some kind of really efficient API calls and parse the responses you get with that kind of API. A book that I recommend. net memory management product Conrad is here at the conference he talked about yesterday about 0 GC. A fantastic book, it's that thick and if you carry it with you alone where you become stronger, but you are protected at night, if someone comes with the tax, you swing your bag. kill them, it's really worth having and this is where I learned a lot from what I've seen here about how networking around memory and how it is optimized and expanded.

I'll leave the link with the slide there again if you want to get in touch later, follow me. I will tweet about this asp net core and gin or you can check out my blog which is limited to real technical topics; Otherwise, thank you very much. Time's up, thank you.

Watch Video & Subscribe

If you have any copyright issue, please Contact