Linux Performance Tools, Brendan Gregg, part 1 of 2

Mar 27, 2020

Well, thanks John and thanks for letting me get back up to speed and in the next 90 minutes I'll cover Linux

performance

tools

. I am very excited about this tutorial and what I will help you do is for any application. It doesn't matter if it's a database or a website, any application running on a server. I'm going to show you how to start analyzing

performance

, how to go through a sequence of meaningful analysis and give it an end point, and that's the hardest

part

. performance analysis is where I start there are so many

tools

there are so many metrics one thing that's great about

linux

is that it's generally the same

linux

that you run on different

part

s of your stack and once you understand the operating system you have a way to approach the performance regardless of what part of the environment you're looking at because it's common, it also doesn't matter if you're going to use one of the monitoring tools;

There are so many great monitoring tools these days and it's based on the same metrics we use in Linux, so it's a critical skill to be able to understand the performance of operating systems and this works for any application. I'm going to review fewer performance tools to show you what can be done and with guidance. about how to do it. I have a lot of content in this talk and a big part of it is giving you exposure so that you know later on, when the need arises, that you can do something that is more important than knowing how to do it, just knowing that. something is possible because if you don't know that something is possible and your company really needs to do this analysis and determine the root cause of this problem, you will never look for it, so a big part of this tutorial is sitting down and having exposure. to all the different things we can do and you'll remember it later.

More Interesting Facts About,

linux performance tools brendan gregg part 1 of 2...

I watched this excellent tutorial at high speed. The slides are online. The online videos. I can look them up and then study them and learn the parts when necessary. In this tutorial, I'll discuss goals and some live demos. I work at Netflix, it's an amazing company to work for, we have a huge AWS ec2 cloud and one thing that really excites me about Netflix is our culture, our culture of freedom. and responsibility and it is also something that is often discussed in various forums about culture, different companies, especially in the US, and while Netflix is innovating and doing incredible things in terms of the cloud in terms of our CDN that it is doing streaming, I'm also excited. about the things we're doing culturally within the company, it makes a lot of sense, it's given me the freedom to address areas of Linux performance that I felt were missing and improve them, publish tools, and work with other great people at Netflix. coming up with better tools so I get excited about not only what we're doing but also the technologies that we're using and also the culture so in this tutorial I'm going to go over methodologies and that shows you how to address performance issues tools many tools different types of tools and then two big categories of tools profiling and tracking methodologies throughout these slides and I'll do some live demonstrations.

I listed objectives so that these can be your conclusions when I review this content and also when you review it later, so if you look at the slides online, what are the kinds of things that I really want you to learn? It is very important, so this is something that you get when you are guided by someone instead of just going and doing an internet search on a topic on your own and drowning in information when someone guides you, what was really helpful is when they tell you what you don't need to know and that's because when you look up something in or you could open a book and try to do your own homework, it can be off-putting, especially when it comes to Linux performance tools, about how many things there are, if someone is giving you private lessons, they can tell you not to worry. this and don't worry about that, you needed a little of this, you need to know it very well and that is also a big part of this tutorial, it is what I have selected to fit in ninety minutes and also what I have selected as part of the objectives that will help you navigate through the giant field that is Linux performance, so I'll be able to recognize the street anti method, perform the workload characterization method, perform the usage method, etc.

The first thing I want to start with is that my system is slow and I would like to do a demo, so what I'm going to do is just quit this. I have had this model to teach. This is a consultancy. I used to teach acting and I also tried two different ways to help other people understand how to do this and one of the most effective ways is to create mock failures that people solve for themselves. I usually do this and let's say a five day class and people can figure them out now. I've created some of these bugs and what you see is a binary program that I can run and it simulates some problem, a big problem that people have when learning performance tools.

Do you try to learn them when your company is on fire and maybe your boss is yelling at you and your customers are running around happy and that's when you're actually trying to learn all these tools? It's when people learn to ski, some people say you go to the Black Diamond slopes and if you make it to the end of the slope without dying then you have learned to ski. It works for some people, but many people find it helpful to take lessons and master the skills. at a time and that's something that creates little bugs that can help you with live open source stuff, but the source code is actually not that interesting, it's really useful for you to write the source code yourself, so if you're a java company, you A PHP company that uses nodejs, whatever the language, writes some very short mock problems and just practices solving them so that they have the known problem before pointing the tools at them, so here is one of those problems that I wrote in the lab zero zero two and let's say the The customer says that the performance on this system is slow, so I am not satisfied with the performance.

What tools can we run? Many people like to start with this tool. Remember I'm running these things called lab zero zero two. top, do I see it in my instance? No I don't see it in the commands so I guess it's probably not consuming CPU and someone tell me what the next tool is and run IO top oh yeah I need to install that when I get to top that's cool. I tend to like to go down to trace interfaces to solve the same problem. I mean, I can run it and get something that I should have at the top and then I could get something similar.

The path shows me that the disks are basically idle, so that's interesting, someone said netstat, I mean, I never run it without switches. I have no idea what the next set minus s does, so it will be more similar next time minus s and there are tons of metrics or in net set. - I bend s - yeah, oh, you want to get stuck, they're fine plugs, okay, what other cool tools can I run? Remember that the problem is that the system is running slow. I just ran top. I said nothing. I ran IIST and I.

I didn't say anything, what is that? Sorry D Stan, another toy. I guess you'll find out what tools I don't use, what tools aren't installed, so it's tensioner. I mean, this is fine, but I tend to get that information from the lord. some incentive like I want to see what's happening at the interface level and Sai has a lot of good things. I know some people prefer D stab and I'm not going to judge, that's fine. Someone said another command, what is the instance that is excellent, pre inactive systems or s traces traces of what s trice of like a s trace of know in which is good, what is that? check out the resolvers.

I can check, solve us, although they will be interesting. I could see if it's slow and we're already debugging. that problem without setting it because logging into the system was getting slow and it's like I know this is DNS as I'm familiar with 30 second timeouts like s try so it's interesting I don't like using s trace normally because it slows things down but trace shows we're waiting and except yeah I mean we cheated the truth I mean it didn't even appear at the top so I had to know other commands what is that? Sorry, oh yeah, what do I do?

What they mean by system is to quantify it slowly, like give me a measurement, it's latency, where do I start, play it here, the command line, so this is an interesting topic because the program doesn't do anything, everything does is call accept and then wait there. I've used this when I teach acting classes and before students know the methodology, people can get stuck on this for hours and hours and hours looking around trying to find a starting point. One of the good things about interpretation methodologies is that you can learn to exonerate a system. and do a methodology in let's say five minutes and then go back to the client and say I've checked the system resources, they're good.

I don't think system problems, how about you look for them in your own code and that's something that this lab helps teach people to do because you can spend, I mean, you can chase your tail for hours and hours trying to Knowing what tools I haven't tried, let's look at those two complicated diagrams and find nothing. having a methodology that has an endpoint, so it's like, well, I've looked at these things. I've looked at my checklist of things I've done. I don't think this can't be seen as a problem with the system it comes back to.

In fact, customers are quite helpful in having another methodology where the customer is asked to rate. It's also useful, so narrow down the problem before you start running tools and try to solve it that way, which is why there are dozens of performance tools for Linux. and packages like sister proc PS etc., the methodologies provide guidance on choosing and using them effectively, they give you the starting point, the process, the end point. I will start talking about methodologies covering anti methodologies, it is the lack of a deliberate methodology, the streetlight. The anting method is named after a parable about a drunk who looks for his keys under a streetlight at night and a police officer meets the drunk and says what are you doing? and the drunk says that I have lost my case and the The police officer asks if you lost them under the streetlight and the drunk says no, but that is where there is better light and this is an anti method because it is something that I see that people tends to do: let's start using the tools that are familiar to me. let's run top and let's run IO stat and let's run this staff and let's go back to top again because those tools didn't work and it's a little confusing and sometimes it works and that's fine, but many times it's inefficient another methodology In fact, I've called the drunkard's anti-method.

The streetlight antimethod is an observational methodology, so I'm looking for fun tools to observe the stable system. The drunkard's anti-method is an experimental methodology, so I'm changing things randomly until the problem goes away. It is actually useful at a price. I called this several years ago at a previous job. I went home one night and came back the next morning and my coworkers said they had a performance issue while I was gone and they didn't do it. I didn't get in touch but they fixed it using the anti-drunk man method and I immediately understood how they came up with these strange tunable parameters and it's that they just guessed until the problem went away so it was really helpful to give it a name.

In terms of communication, there is also the guilt of someone else being anti-method and it is something that can be seen in the distant past and always seems to come up. You know, maybe it's the network. Can you ask the network team if they have something like that? dropping packets or something, well maybe it's the firewall, I mean, maybe it's the crypto cards and that's a separate framework for us, those guys, I mean, when you get back to your thing, aha, no, it'll be another thing, it will be. to be a database, ask them the actual methodologies that we are going to follow, include the problem statement method, the workload characterization method and the usage method, etc., the problem statement method, this It's not specific to Linux tools, but it's useful to include it as part of this slide. deck because it is really useful and as someone suggested when I was working on that problem, I would ask the client how the performance degradation can be expressed in terms of latency or execution time and if you asked that then that would give you a point of more specific item for analysis, so this is something that actually came up when I was at Sun Microsystems and the support staff would just check the basics every time a performance ticket was raised, so what makes you think that There is a performance problem the system has everIt works well.

I've had issues in the past where people would file a ticket and say my timebase performance is terrible. You have to fix it. It's atrocious and it's like, "Oh my God, what's changed?" I'm looking around the system. and I don't see that anything has changed, the system has always been like that and, of course, they tell you later, but it has always been like that. I gave you the impression that this is an emergency, learning to dance has been going on for six months, no. One fixed it and I always remember to ask them if it ever worked well.

If it's been like this for six months, not just now. You're asking me about what changed recently and other cool stuff, so an easy to do checklist solves a lot of problems just at the ticket level, so if you start session in JIRA or whatever, you can really solve performance problems there before you do. go to the command line or before running any performance tools, the workload characterization method is the first real methodology where I use metrics and tools, and that's where I want to see the workload applied to the system , not the resulting performance, a big problem. we have when we have problems and, naturally, it is serious on Netflix.

I see this everywhere, when there is a problem, is it because we have a massive amount of load right now or is the point of the load that the server is malfunctioning? Answering that question can be complicated our quick characterization who is causing your load why load is called as the part of the code for stacktrace the reason what is the load then various features like io per second type of database query performance or whatever, whether it's read or write or metadata and how does the load change over time? The workload characterization method solves a lot of problems because you don't realize it, but someone is accidentally running a test in production, it should have been pointed at the test servers or you've already released it. of season four cards and suddenly there is a huge amount of loading or maybe you launched it earlier by mistake and there is a large amount of loading.

What characterization solves a lot of things and is also something that performance monitoring tools are generally good at doing, so Like I said, there are a ton of performance monitoring tools out there and when I look at them, these are the reasons why I am using them in the metrics I am analyzing. The method of use is a different methodology than this. take a functional or block diagram of the system and then for each resource check three metrics, just utilization saturation and errors. The cool thing about this is the ethos for the hardware of a system, there may only be a dozen different components.

You can do this for software systems where you only want to say a dozen key components in your environment, you can do this for your cloud environment and pick the core components there, you'll end up with only 48 or so metrics that you need to check, it's not like you'll win , that's the money, SAS and getting a huge page of stuff, but it reduces the scope, so now going down is useful. Another thing that's really helpful is that it encourages you to look at where your current tools are. This raises the questions before the answers and we very often use monitoring. tools and metrics, we start with the answers and then try to figure out the questions that got us there, so the method of use is greater than covering what you're not looking at and, as you can imagine, for a hardware system.

I have drawn some hardware components in there so that the CPUs, like mine, would run with the top and we are looking at the exercise disks that I ran the last few network ports that I ran, I saw that I could run T stat something else, but there is a bunch of components that I didn't check I also didn't check a lot of things I didn't check for disk errors I didn't check what's happening in the I/O bridge on the I/O bus and expanded it to interconnect any components that are in your data path can hurt performance, any component in your data path can become saturated and that's where there's more work than you can do in a given period of time and work queues and that adds latency.

Any component in your data path can generate errors, so the error is the red areas. they're really cool to see because their target is easy to interpret and that's the method of use now, that's a hardware image, but you can draw a picture of your internal workings of your database, your front-end application, your full stack environment and do the same process. I have a cache here on Redis here. I have Cassandra here and I'm just trying to find these three high-level metrics, utilization saturation and errors. It's just an exercise in posing the questions and then you'll find out quite often. that there are other things you're not monitoring and you probably should have done this exercise in the past for Linux and I put an online checklist on my home page.

It's very long and actually quite unpleasant to complete. It's something I would do. I would really like to see performance monitoring companies step up and create a wizard form and give us these metrics with the click of a button. It's something we're working on internally at Netflix, where we have multiple open source projects. We have one for cloud wine monitoring called output lists and one for example monitoring called Netflix, both are open source and I've been reviewing the vector to make sure we can make as much of the usage method as possible from that tool and it is something that is a good idea. it's just a methodology and it's not going to solve everything, it just solves a particular class of problems, another methodology that gets a little complicated, but I wanted to mention that it's the idea of doing analysis off the CPU, so when this happens , there are a number of problems at the device level. that the method of use can be easily found like, for example, okay, my disks are busy and the network interface is busy, my bus is busy, okay, but there are many much more complicated problems to work on that involve data contention. blocking or I'm stuck waiting for a conditional variable or I'm stuck in the network stack for some reason and there are a lot of them and I might be stuck because I'm doing swapping or anonymous paging.

It's possible that I'm blocked because I got unintended contacts from the kernel and had to wait my turn. the CPU analysis run queue is so good that it gives you a way to attack them all and that is, let's instrument when my application exits the CPU. This is a gateway to all the different monitoring tools that exist if I can implement them. when I exit the CPU and why I exit the CPU, I can attack all of those problems, so it's a really effective methodology for a lot of the more mysterious problems out there. CP profile method.

I just wrote this as a methodology because it is a useful thing. and that's taking a CPU profile and then here you display a flame graph which I'll get to in a moment and then you'll understand all the software that's running on the CPU, which is more than, let's say, 1%, a big deal when You are analyzing what you have done. by the way, flat graphics exist for all kinds of languages nowadays and there is a big problem we have with performances, there are so many things and even if you reduce it to a database, I am Cassandra in my sequel and look what could be. activated and all the features, there are a lot of things and you can run their status tool syncing all the metrics, it is very laborious and time consuming, very often these databases and complicated applications have all these functions that are not even activated , so we don't We don't worry about looking at the cash activity at that level or the asynchronous log writing tasks or the exact persistence branching model or whatever, those things aren't turned on like we don't have that turned on in the environment, so it just becomes a complicated procedure when you look at the database and try to understand everything if you profile what is running on the CPU.

The nice thing about this is that it reveals what is activated and you can't run it without stepping on the CPU. I got you. I know you're running and that can help narrow down which parts of my database really matter. They should appear. They may appear with a small amount because they don't require as much CPU as other parts, but that's just it. a way to limit what to study the RTFM method I want to write as a methodology just to make sure that we are good at this and this is how you understand the tools or performance metrics and so remember to look at those man pages of the books O' Reilly has many good books, of course, search the web, ask your coworkers previous talks and slides, you may have support services to understand things, but several eight are really important, we live in a time when we expect things to be open source. a lot of my time I spend enough time reading the source code of the Linux kernel or the source code of a Java JVM or whatever just to really understand what the metrics mean and also experimenting and that's where I can.

I don't really understand this metric, but what if I wrote a small program that presented the problem? It should illuminate the metric. Those are two really useful ways to understand the tools and performance metrics by reading the source code and also doing small experiments. Let's say there are several methodologies, the next section is tools and I want to show how the method can be used for resource utilization workload characterization. However, a big part of this is exposure to all the different observability tools so you know what can be done. I'm on the command line and you can tell I didn't care about the command line because I look at my servers from tools. high end performance monitoring and they often get the metrics from the same place in Prior consists from the kernel and so if you understand the standard tools like EM stat myostatin top you will understand a lot of the performance monitoring tools because there is so much overlap , it's like I've seen them.

I've seen those metrics before I knew where they came from. I have categorized performance tools into four types: observability benchmarking tuning and static performance tuning. Observability tools are generally secure so you can observe activity. benchmarking tools that are useful to be able to do experiments or load tests and see how the system responds to the adjustment. It's where I'm changing things and adjusting static performance. I'll explain it when we get to that, so the first is the observability tools and how all of these components are measured. It's really useful if you have an environment where you don't have a functional diagram to start by drawing it because then it can give you an effective visual checklist of what I need to at least understand what's in my data path.

I'll go through some of these quickly, it's more for exposition, so there are a lot of basic observability tools. of different ways you can dig out CPU load, I shouldn't say CPU system load averages used to be CPU, one of the ways is just the uptime command at the top, You'll see this even graphed by performance monitoring tools, there are three of them. numbers representing the world averages of 1.5 and 15 minutes, these numbers are actually exponentially damped moving sums and therefore are not exactly 1.5 and 15 minutes, in fact at the one minute mark only count something like 61 percent of the value goes into that metric so they're buffered, giving it three numbers helps you get a sense of how the system is changing and so this is old, it's coming from at least 10x or cts in the 1960s, before Unix, the notion of having load averages and Also in the days of primarily displaying text without graphics, having three numbers is a simple way to show the passage of time, what happens over time if the average of 1 minute is greater than 5 and greater than 15, you know. things are getting busier now if the minute is less you know things are getting busier the load average is supposed to represent how much load is applied to the system than it used to be and I explained it because if you try to read the documentation may be confusing, it used to be just about CPU and it was just about how many CPU processes or tasks are running on CP right now or queued waiting their turn to run and because of that if your load average was higher than the number of CPUs you had, so if you had a load average of 5 and you are on a 4 CPU system, you generally have more CPU work than the CPUs can distribute.

If your average load was 2 on a 4 CPU system, you should have free space or that. can vary from a second second Linux incorporated the Disgaea Stadium uninterruptible to load averages that actuallyThey confuse and complicate things, so now the load incorporates more than just CPU load. I guess I understand that rationality needs to be included if tasks are locked to the desktop or other resources. not just CPUs, it has made interpreting them even more difficult. Anyway, you shouldn't spend more than five minutes looking at the load averages, so I just look at them, sorry, five seconds, I just look at them for five seconds just to see if the system is totally down at this 0.01 or if the system is usually doing a lot of load, you have to use other tools to understand it, things like top or H top top has a great summary of the weight of the system at the top and then we have by process or by task information I have highlighted a couple of columns as percentage of CPU in the command.

It is an excellent summary of the entire system. Resolves countless CPU problems on all CPUs. That's why Java can be at four hundred and seventeen percent. One problem with most top implementations is that they sample when they refresh the screen, but you can have short-lived processes that start and end between screen updates, so your load average can be very high, but your CPU column may be mostly zero and you can't fight. Why is my blood average low or can you actually see the CPU summaries? User's system time is nice, I/O idle weight etc., you can see that the CPUs are busy working but you can't identify the responsible processes and one of the reasons is. has short-lived processes that are missing sampling a top actually resolves that a top uses the event interface.

Another reason may be that they may be kernel tasks, kernel work which you can also see in the digests. top can take some CPU time to harvest rock, I mean these days not so much but on older systems Ren did all jump on a problem and when he was on top at the same time you looked up because the top ones could dance and they had two issues. It's like the problem we were debugging and now we have seven sister admins running the top, so let's all stop, let's be one person. run above and then there is type H Type H has some visualizations it is more customizable PS is a much more basic tool on the command line and I actually have the ASCII artistic forest version of PS where it shows the parent-child process relationship PS is nice in that you can extract many more columns than you normally get at the top, although you should be able to set them from H top.

Maybe for some reason you need to take shell scripts and ad-hoc stuff out of PD: then the kernel contains a ton of interesting metrics beyond what you normally see in the limited display of vmstat's main virtual memory stats and more, in its place has been around for a while, it came from BSD and gives you a system. -broad summary only in one line, so we have our I have highlighted the columns that my eyes tend to focus on we have our for the run queue length a number there indicates that the CP user has tasks running and also tasks which can be executable so they are not actually running your wedding, the memory is divided into a free buffer and then the page cache and then the last columns are quite interesting, so the CPU summary is divided into time of user, what my applications do, for example. the JVM runtime from the nodejs runtime and then the system time, which is what the kernels that do our application may be making system calls or may be processing interrupts or other kernel tasks, they come from unity ATT and say I really like that they were cleaned.

Over the years, these days it makes sense, they always had a really nice design. Linux has improved. I like it because I can apply multiple methodologies. Everything I don't like generates 80+ blank characters. which is a sin when it comes to eunuchs and breaks my slides and books, but that's a game these days, we have a very broad terminal workload characterization, so I can apply that methodology and see reads per second, writes per second, read and write performance and I just see if the workload applied to the disks is crazy so I know I have a problem with the disks, like I'm doing ten thousand i ups for a spinning rust that could tell me that my application is simply sending too much load and that may be a big clue before we begin to understand the resulting performance.

The resulting performance is also excellent, so we have the average weight of the queue size in milliseconds, which is the total time that on average you are waiting for the block device or disk I/O. and then the service time, which is calculated to eliminate queues and the utilization percentage, the last column is actually a occupancy percentage, so during an interval, how many milliseconds of a second was that device performing service utilization? work I'm using on usage method, usage saturation and errors? so I can use that guy. I can use latencies if I'm doing latency analysis and of course working on characterization.

It's a good tool. I have recommended it to people in the past when they asked me about it. They are instrumenting a new subsystem on an operating system and they have said what statistics I should provide a copy of, so copy them because I can do a lot of methodology directly from them. It's a good set if you have no idea how to do it. to instrument something, it could be a new application that you are writing, but you can understand, for example, give me something high level, give me a high level breakdown of the applied workload, it could be client logins and client activity broken down into different dimensions, then give me some high level idea of the resulting performance, including how often the app was busy, it would only be perfect if I had another column for errors so I could do the AU method completely.

The included MP Startup is another basic tool. MP statistics for multiple processes or statistics. I'm usually running this because I want to see the CP balance and see if there are any hot CPUs that may be an application's problem. These toll-free numbers also appear in other tools, so they are at the top so you can see my block. device I/O cache and virtual page cache, so let's use some of these tools now for another problem. Now I am going to run the zero zero five lab this time and this time the client has said that the latency of my application is much higher.

Can you debug it? we can review the workload code, the problem statement method is really the first place to start, so you ask the customer how are you, what makes you think there's a problem and they tell you I measure an average latency, it means latency and that has increased. my latency distribution has increased and that is affecting customers and you go through the problem statement method and there appears to be a legitimate problem. The method of use is a good thing to give us a starting point and an end point. I like to use it. method takes like five minutes there are a lot of metrics that I'm not looking at because there are no easy tools to get on the buses, but that's something we should fix, but at least I can go over the basics of the method of use quickly and then go back to the client to know the method of use.

I just want to know the utilization saturation and the errors if they are available for the hardware resources so let's start with the top so I have my CPU summary at the top it's the third line and we have 4% user time, 38% system time, so the kernel is doing a lot of things that will be interesting to debug. We 38% have a lot of downtime and wait for I/O so it doesn't seem like it's maxed out with the CPUs, so I'm not sure there's any evidence of saturation. There's my lab 0:05, it's consuming 12% CPU, so I don't have 100% CPU utilization.

I can also see it here, so I have moderate utilization. CPU especially on system time, but with a lot of idle time and I also expect a bit of I/O here in our column. I can also break it down, let's see if it's actually not a multiprocessor system. I'm just running a VM, but I also check it just to see if I'm maxing out a single CPU. The CPUs look good. It would be interesting to understand that, but we are not at the maximum. We still have inactive memory. There seems to be a lot. In terms of utilization and saturation, we are not aggressively swapping, we are not fully utilized in terms of memory, let's do the next thing right with the disks, we are quite busy, so the disk utilization is now at 78% and I'm just going over the method of use I'm looking for those metrics to start with.

I can see the workload applied. We are doing 4,000 write operations and 19 megabytes per second. This is a slow machine. A slower virtual machine. The actual response time is fine, but it is a long time. load and disks are busy, so that's probably a bigger clue than CPUs when maxing out CPUs. Systems can generally like Unix-like systems, they can generally be elegant about it. Time-sharing systems because the colonel understands that the priority is that he can pull threads. CPU very quickly if necessary and executes other threads. The same does not happen with disks when you max out the disks of rotating disks.

It is difficult to send IO with higher priority. Send it to these children's heads if they are already in flight. something a little different in our SSD days, but you can get to the point where you've queued up a lot of work on the physical device and the kernel has less control over who runs who in its priorities and usually with disks you start to see it will vary, but you start to see performance issues beyond sixty percent utilization, definitely beyond eighty percent utilization, so disks will most likely stop me there and say: "Well, the disks look very busy, you should check why your The application does more writes to the file system or writes to the disks, but for the years method I want to finish them all, so I want to check many different ways and do this I want to check the network I/O and it is also my other device and it seems idle and that's it so I'm done with the using method.

I go back to the customer and tell them that the disks are quite busy. The CPU seems moderately busy on system time. it could be at kernel time but it may be related to this and it may be related to the disks which may be enough information for the client to say ah, that's right, we changed a property setting in the database, so now it is They empty the scriptures all the time. We're going to go and revert that change, so I quickly hit an endpoint, we'll return it to the client, and then we'll continue one without an endpoint.

You might spend too much time digging into this and then you'd be profiling a CPU time. let's do disk I/O latency distributions, etc., etc., etc., which is all very useful, but maybe overkill, so I use some of the basic commands, which is a system diagram with the basic commands illustrated and can be seen. There's a lot of white space, so I haven't covered much yet, but we can still do some methodologies just from those intermediates, so we'll trace the TCP dump networks that mix that up and I'll go through some of these traces quickly that the system called.

Trace it here. I have run it with options so you can see the timestamps. I should change that slide. It's a. It can be used very originally in length scale. Now it's speed. It was originally written using peat and most implementations still use peat. race, there has been some work in Linux to create a performance trace that uses the more efficient performance events framework to make a version of SS trace, but most s trace implementations at this time still use P trace and can slow down the target. for over a hundred volts is something we want to be very cautious about using nice to see the system calls that are happening because you can understand the workload that is being applied to the kernel, but the way the trace operates is violent and Set breakpoints.

The way it currently works is that I don't need to prefix it because there is work to fix it in terms of perfect links without the trace subcommand, but the way it currently works and will probably work for many months to come is it crashes the target when it starts to attend the call and then unfreeze it when it continues, so context switching and freezing and unfreezing can slow down applications a lot. I am very careful when running aproduction monitoring. TCP dumping is somewhat similar, but instead of instrumenting system calls, we're tracking packets and there's a whole universe of analytics and stickies that people like to use to explore network packet sniffers, which are usually useful and you can review packet sequences with timestamps and observe what happens on the cable usually dump it into a trace file and then analyze it.

I like to go to the kernel and use more kernel-level summarization to answer some of those problems because it's getting hard, it's getting harder and harder. use TCP dump in faster environments, so if you have a 10 gigabit interface, a 10 Gigabit Ethernet and you are sending more than a gigabyte per second of data and the performance issue I want you to work on is once every 30 minutes, there is a latency outlier for 30 minutes of like a gigabyte of data in a TCP dump, they have created a massive performance problem and that is your TCP dump file is terabytes and you have to do it and they are going to analyze it again on the server and a downed SCP. to your laptop, that will take hours and hours and hours, so it's getting harder to see that the speed doesn't scale in modern environments, so you have a scalability problem, it will still be used, there will still be many cases of use where it's great and helps you solve problems, but keep in mind that there we have to create other techniques and that's what we're doing with internal tracking and summaries to answer some of the problems that we used to run.

TCP dump for netstat is a multi-tool with many subcommands like net set: IF it was run before and to understand what kind of things are happening at the TCP level I wrote below a long time ago and ported to Linux just so I could review, you can see what I based Nix on, which is based on the IO statistic, so I have the workload applied and then the resulting throughput or utilization and saturation, so I can also do the use method. I think it's obscene in some repository. so you can add it to systems pede support.

This is one of my favorite intermediate tools, a pit stop: it will show breakdowns by thread so you can see the usage time and system time for each process and that gives you a clue as to what type of analysis. what you want to do if it's mainly user time you know it's in the application code, if it's mainly system time you might be looking at system calls or device being touched for device usage etc. ., pinch that :D gives you an idea of blocking the I/O of the device a process is doing and so if you are dealing with bean counters in the kernel it is a good trade off.

I have included in my intermediate list only one loot, check the swap device usage if you have run out of memory if you have Swapping enabled on many systems today it is not L SOF, it is more of a debugging tool, but there are some problems of performance that can help solve, so you have too many file descriptors open and you may run into limits. It also helps you understand an environment by seeing who is connected to who, so I'm running LS or F to see that this application on this port is connected to this other host on that port, so the system activity report is the Linux version.

It's actually very nice, it has two modes, it has archive mode, so you can use a tab in Chrome and it will collect, let's say, ten minute interval summaries or you can run it for life up to here run it with TCP statistics and the arrow TCP and then the network interface device statistics so I can see. I actually ran that before, the development information, but active and passive is TCP, so active is the outgoing connections, passive is the incoming connections and I have also relayed, so it's really cool and it's also pretty cool designed, so the units are in columns, there's a good naming scheme and the selection is pretty good too, in fact it's so good that I've created a diagram to show you the options. for SAR so you can see what they are exposing that's not so bad it would be really nice if there was a syscall interface if there was at least one count of syscalls per second in sir and plus you would like to split it I take out two forks and exact and regional weights, but it's actually pretty good, so you can solve a lot of problems with SAR alone.

D Stout is something I can do. It's a similar diagram for D start and make a similar diagram for collecting Dell. There are all. tools in the similar space, some even have fan stats if they are supported, so you might be using one of these tools to measure all of them, it's not the tool itself, it's not that important, it's just important to have a way of measuring everything, etc., in our cloud environment that I mentioned earlier at Netflix when we developed Atlas and we took into account cloud web monitoring and instance level monitoring, so it is important for us to have a way to reach to these metrics and when you have functional diagrams like this, you can One of the takeaways from this talk is to print a slide like this and say, using the toolset that we have and all the monitoring tools that we normally run, how do we monitor all of these components , and this is how I can.

I've put a dashboard in there so you can rate your current environment as another demo, so an app is taking forever, let's look at this one, the client says my app is taking forever to run now it's very slow , okay, so look covered a bunch of the new tools let me start with the overview of the entire system vmstat one. I start with a used method. I see that I use a lot of time and system time. I will say the same in mp stad. Now I can run pits. tab lab zero zero three a lot of user time and a lot of system time, so the system time is because it is talking to the disks, the answer is No, completing the usage method, are we talking to the network devices of network?

No, so it's good, what is it? that system time, what will be the system time? So I can say I have a part of the system time, but the disks are down and my network interface is down. Am I swapping, which I would also check with the usage method? I don't see the exchange, what else can I run? Coming? Can I run to find out the system time I just covered as part of the intermission? It's trice, although I said that I don't like tracking, yes because the attempt slows down. There are better tools, but I haven't covered them yet.

Sometimes getting work done can be helpful even if it means briefly slowing things down. I can run s trace. Only what I don't know I leave in execution because they will do it. Fill the screen and slow down the app so you can select the first hundred entries. I wish traits had an option for this if you want to edit the trace and add like -C or a hey count like with the TCP dump so you can exit after capturing so many even better if you change it to not use the P trace and use the throughput so the throughput is really small, but let's put 100 together so it's very busy, they'll actually put a timestamp in front of it and it's calling read on file descriptor 3 and how many bytes it's requesting.

You are actually requesting 0 bytes. The application is stuck in a loop requesting 0 bytes over and over again. I have seen this in production now not at my current employer, but I have seen it in the past and it is where the application developer has used a variable for the byte size to read and correct a logical error; that variable is now set to zero and therefore they are in a loop. try reading a database file with zero bytes at a time and those loops will never end, but I just want to check this out. That's a methodology that, although I don't like tracking, may have slowed it down a lot.

I answered the question for the client and what kind of answer is their characterization of the workload. It's just that the workload applied is silly. I'm reading a zero book at the moment. Quite often I have done this to read one byte at a time. 16 bytes or something small and you say we have to go to eight kilobytes, go to 128 kilobytes, which will reload it much faster, so it just works as a characterization this time of the system calls. I should mention that the trace on Linux in the past has sometimes frozen. applications, so you tried running s trace and then it terminated, control C and left the application in a sort of locked state, so use it with caution, test it in the lab before using it in production, but always be careful.

Always be careful with the trail, it is enormously violent.

Watch Video & Subscribe

If you have any copyright issue, please Contact