Sunday, July 14, 2013

UNIX Load Average

UNIX Load Average:


Today suddenly I got alert from Nagios Like "**Current Load is CRITICAL **". After that I have verified the LA on system and Nagios configuration as well. 

After doing some further research I have come to know about some interesting topic i.e. "What is Load Average" and "How LA is being calculated". So sharing the thing that I have found.

Actually, load average is not a UNIX command in the conventional sense. Rather it’s an embedded metric that appears in the output of other UNIX commands like uptime and procinfo. These commands are commonly used by UNIX sysadmin’s to observe system resource consumption. 

Uptime :
The uptime shell command produces the following output:

[pax:~]% uptime
9:40am up 9 days, 10:36, 4 users, load average: 0.02, 0.01, 0.00

It shows the time since the system was last booted, the number of active user processes and
something called the load average.

W
The w(ho) command produces the following output:

[pax:~]% w
9:40am up 9 days, 10:35, 4 users, load average: 0.02, 0.01, 0.00
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
mir ttyp0 :0.0 Fri10pm 3days 0.09s 0.09s bash
neil ttyp2 12-35-86-1.ea.co 9:40am 0.00s 0.29s 0.15s w
...
Notice that the first line of the output is identical to the output of the uptime command.

Top
The top command is anothor UNIX command set that ranks processes according to the amount of CPU time they consume. It produces the following output:

4:09am up 12:48, 1 user, load average: 0.02, 0.27, 0.17
58 processes: 57 sleeping, 1 running, 0 zombie, 0 stopped
CPU states: 0.5% user, 0.9% system, 0.0% nice, 98.5% idle
Mem: 95564K av, 78704K used, 16860K free, 32836K shrd, 40132K buff
Swap: 68508K av, 0K used, 68508K free, 14508K cched
PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND
5909 neil 13 0 720 720 552 R 0 1.5 0.7 0.01 top
1 root 0 0 396 396 328 S 0 0.0 0.4 0.02 init
2 root 0 0 0 0 0 SW 0 0.0 0.0 0.00 kflushd
3 root -12 -12 0 0 0 SW< 0 0.0 0.0 0.00 kswapd

The standard Nagios plugins include a "check_load" command which will raise a warning or error if the load averages for the target machine exceed some threshold. Today after getting alert I was thinking that what thresholds should be.

The usage for the check_load command is as follows:
Usage: check_load -w WLOAD1,WLOAD5,WLOAD15 -c CLOAD1,CLOAD5,CLOAD15
Without looking at the source I'm pretty sure that the program is either just opening a pipe to uptime or using the /dev/proc file system to read the load averages for the past 1, 5 and 15 minutes. Should be safe to assume then that Nagios' concept of load is exactly the same as uptime's and that the figures are ultimately is coming from the kernel scheduler. (Note: Yup. :-) Just checked the source.)

So the first question is: what does "load" actually measure?

UNIX AND LINUX LOAD

Breifly, when Unix machines report their "load" (usually through uptimetop or who) they are reporting a weighted average of the number of processes either running or waiting for the CPU (Linux will also count processes that may be blocked waiting on I/O). This average is calculated over 1, 5 and 15 minutes (hence the three values) based on values that are sampled every 5 seconds (on Linux at least). Dr Neil Gunther has written more than you might ever want to know about how those load averages are calculated and what they mean. It's an excellent series of articles (see also the inevitable Wikipedia article).

So assuming we have a single-core CPU, a load value of "1.0" would suggest that the CPU has been 100% utilised over whatever reporting period that figure was calculated for. A load of "2.0" would mean that whenever one process had the CPU there was another that was forced to wait. However, if we have 2 cores, the same "2.0" load value would suggest that both processes got the CPU time they needed, while a load of "1.0" would suggest the CPU had only been at 50% capacity.

On a simple web server, running a single 2-core CPU a load average of "2.0, 1.0, 0.5" suggests that, over the last minute, the CPU has been 100% utilised; over the last 5 minutes it's been 50% utilised; and over the last 15 minutes, it's been 25% utilised. Halve those values if 4 cores are available and double them if only one is in the system.

You can see then that sensible threshold values for warning and critical states requires you to consider how many CPUs and CPU cores your system has. You're therefore probably going to want to set your thresholds per machine or at least set them differently for each different type of configuration.

For example, one of our Solaris boxes has 12 cores so a load of "6.0" is nothing to be concerned about. However that same load figure on another, single-core box might be worthy of a warning or even critical alert, depending on how sensitive we were to process queue lengths on that box. Except if that box is a Linux box with a lot of I/O and slow devices (like a tape drive) and is counting processes that are sitting idle and waiting for an I/O operation to finish. And what is the application running on it? Is it threaded? How is your kernel counting threads in that total -- or is it just counting processes?

SETTING THE CHECK_LOAD THRESHOLDS

So determining an appropriate warning and critical set of threshold values for check_load will depend on what you think a reasonable process queue length will be; how your specific system treats threads; how your applications on that system behave (and their expected responsiveness levels); and how many CPUs / cores your system has. Oh -- and your performance targets or SLAs.

This is why experienced admins use a time honoured, complicated heuristic process to set an initial value and then continually adjust that value based on the correlation of alerts raised and actual performance and hence user impact.

In other words: we rub our bellies and take a guess and then change the values if we get too many or too few alerts. We're experienced sysadmins -- how much time do you think we have? :-)

In our case, for web servers, we decided that over 5 and 15 minute periods we expect spare capacity on the box -- but we only want to be alerted if the box is basically maxing out on CPU over a significant period. Over 1 minute we expect the occasional spike and don't really want an alert unless it's way beyond expectations. We're using Apache with no threading so 1 load point = 1 process using or waiting for CPU.

We've set warning levels for 15 minute load average at number of CPU cores times 2 (plus one!). For 5 minutes increase the threshold by 5. For one minute, increase it by 5 again. Critical threshold starts at number of CPU cores times 4 and then follows the same pattern for the 5 and 1 minute warning.

Here's a sample nrpe.cfg config file for a web server with 2 cores:
command[check_load]=/path/check_load -w 15,10,5 -c 30,25,20

It's important to actually test this set up. Use ApacheBench or JMeter or similar tool to get your load average up and test performance under those thresholds to see if it's acceptable. If your application is unacceptably slow from a user perspective at lower load values then lower your thresholds.

Special Thanks to Hisso Hathair!!


No comments:

Post a Comment

Integrate Jenkins with Azure Key Vault

Jenkins has been one of the most used CI/CD tools. For every tool which we are using in our daily life, it becomes really challenges when ...