Linux Performance Troubleshooting Demos

Jun 09, 2021

Here is a cheat sheet reference guide to initially troubleshoot Linux

performance

issues. Let's get straight to the point in this first demo. A customer comes to you and tells you that your application's latency has increased. What could be happening. Let's start by executing the tenth. The top tock option provides an overview of the system. If we look here, we can see how the system uses the CPU. In this case, we can see that 0.4 percent of the computation goes to the user space, while 38.5 percent of the computation goes to the system. Also note that 39.2 percent of the time is spent on large amounts of CPU idle, which means we are not getting CPU utilization saturation, this mildly says that the problem may not be related to the CPU, we will have to do more checks to verify this. sure, now let's run vmstat 1, we run this with an argument of 1, which means it prints 1 second summaries.

This is another command that provides a summary of key server statistics. The output for the first row is different because it only shows a partial summary of the averages since boot, so we can generally ignore the first row. Let's review each of the important columns. Our column provides the number of processors that are currently running on the CPU or waiting for a turn if the R value is greater than the number. of CPU, then the CPU is saturated, in short, our large values would be a cause for concern, in this case the values seem to be generally in a small range, so there is nothing out of place here. vmstat can also be used to get an overview.

More Interesting Facts About,

linux performance troubleshooting demos...

From memory here, each of the free and cash benefit values are comfortably above zero. Also note that the swap values are zero. Swap values refer to swap memory, which is when some memory is moved to the hard drive. Swap memory is generally very slow so it is normally used as a last resort if the swap values are ever non-zero then we know we have run out of memory in all of these results which suggests we probably have enough memory in our system. Now let's look at the CPU columns, the columns represent the user's time. System time Idle time Waiting I/O time and stolen time The results here reflect the output from above Our system's time utilization is high while our user's time is low, so it is unlikely that our CPU utilization is saturated; however, keep in mind that the I/O wait is worth it. remains relatively constant at around 17% a constant degree of I/O waiting points to a disk bottleneck, this means that the CPUs are idle because tasks are blocked while waiting for disk, but first let's make sure that our CPU values are valid.

The MP stat is another command. which checks CPUs, this command is convenient as it prints time breakdowns per CPU that can be used to check for an imbalance. A single hot CPU can be evidence that a single-threaded application is causing problems, in this case all of our processes are running on a single core. Anyway, the point is moot, let's check the disk utilization. The following IO statistic is a great tool for understanding how disks are running. The most important statistic here is utilization percentage. Looking at this column, we can see that there is a disk utilization of 78%.

To interpret this percentage, understand that values above 60 percent generally lead to poor

performance

, as systems can generally run well with Mac CPUs. This is because the kernel understands priority and can start threads from the CPU very quickly if necessary. run other threads, this is not as true for disks, it is harder to send I/O with a higher priority if you are already doing something else, this is not as relevant for SSDs, however this was the value we were looking for for high utilization media. that disks are the most likely culprit for why our application has more latency; however in closing let's check the network I/O and this command checks the network I/O of course it is not used much here so there is nothing to report sir, in short we can pointed out disks as the most likely factor in causing performance issues in this demo, so we discovered that the problem in our application was related to disk I/O, which was the bottleneck we got seeing that our CPUs They were primarily concerned with system time, which allowed us to narrow down the possibilities.

High disk usage can be checked with Iost with low amounts of available RAM, which means the system will use swap space which can be checked with vmstat High Network I/O. can be verified with SAR and could also be caused by system calls from the application itself. This could have been verified with s trace. Trying to discover the root cause of performance problems means digging deeper into what the metrics are telling us. Let's move on. In this second demo, the client tells us that his application is taking forever. Let's take a look at this one. Let's start with a VM statistic to get an overview of the system.

Our column looks good. Our values are very low. There are no processes. having to wait for the CPU memory to open seems fine, there is also plenty of free buffer and cash to go around, plus both swap columns are zero as they should be, however when we look at the CPU stats, something becomes evident: there is no downtime and there is a large amount of user time and system time, specifically around 55% is user time and around 45% is system time. Let's run a different command to verify this. When we run MP stat, we can see practically the same information.

This computer is using user time and system time. Now let's make sure that it is our process that is using these CPU resources. In this case, our process is called lab oo3. After running PID stat we can see that it is our application that is using all the CPU utilization, what could be causing this, let's first narrow down what is causing so much system time to be used, first let's check the disk I/O after running IO stat, we can see that the disk utilization is within acceptable limits. Up to three percent utilization is perfectly fine and the waiting period is also milliseconds.

Let's check the network I/O. Next, the network I/O is also basically hot. There's nothing to report here either. What could be the cause? We know from the PID statistic above that a significant amount of system time is being used, we have already verified the swap disk and network causes. However, there is a different way in which we can verify the system calls used by our application. Let's verify that with s trace we execute s trace with specific parameters. Tracking is a bit resource intensive, so we opted for the top 100 responses. We also use the 80 flag to add timestamps.

We can see from the output that a read system call is being called on file descriptor 3, however it only requests 0 bytes of memory. Every time we encountered the problem, the application runs in an endless loop while we try to read a file, so we found that the application was trying to read a file with zero bytes at a time in an endless loop which we managed to do. solution after recognizing that there was a large amount of system time utilization and after eliminating the other disk and network swapping possibilities, we discovered that the application was executing an excessive number of system calls.

Now let's continue with this. In the next demo, something mysterious is eating up the CPU, it's our job to figure out what's wrong. We start with the basic top command to get an overview of the system from the beginning. We can see that the CPUs are very busy around 90% of user time and 9.6%. % System Time Note that there is basically no downtime either, oddly enough the top doesn't show us what is causing our CPU to consume so much normally, it would appear in the CPU column below such It may appear if we use a different command after running MP Statistics, we see the same behavior, a lot of user time and some system time, but this command didn't give us any additional information on what the problem may be, maybe we can trace it back. where the system time comes from, the disks are basically idle, the system time is coming. from a different source, let's check the network next, the network I/O is also basically hot, there is nothing to report here, what could be the cause, let's check the memory next, there is also a lot of memory, the columns of swap are also 0 which indicates no paging or swap continues, okay so we know we are busy with CPUs, let's try profiling with perf to see what is going on, so here we are running the performance log at 99 Hertz for all CPUs core graphics for 10 seconds, we just ran it on ninety-nine.

Hertz instead of 100 because we don't want to accidentally sample at the same time, this is to avoid not recording activity running at specific intervals now we can see the results we have finally found the culprit, we can see here that around 82% of our CPU time is spent in the checksum command, so why couldn't we see it in our top command from before? It is because these are short life processors. This is the weakness of the top. The checksum process highlighted above only exists for a couple of milliseconds. avoid this use alternative commands like top or perf tools are snoop eggs, so originally we knew something mysterious was eating up the CPU, we looked at all the possible system time offenders, but the real problem was much harder for our tool to detect favorite. couldn't show the problem right away because top isn't great for short processors, we were only able to narrow down our solution after using profiling with perf.

Make sure you keep an eye on all the tools at your disposal so that a top can be used to fix some of the major weaknesses and that's it, thanks for watching. I hope you found this helpful if you want more resources for this kind of thing. I've included some reference materials in the description below. This video was made using a video puppet. Be sure to check them out

Watch Video & Subscribe

If you have any copyright issue, please Contact