Wednesday, June 6, 2012

HowTo Understand Linux CPU Load - When you should be worried ?


You might be familiar with Linux load averages already,  the term “load average” refers to, usually three numbers, somehow represent the load on the system's CPU. In this article I’ll try making this three numbers clearer and understandable.

The Linux Load Average is driven by the three factors :
  • Run on, or are waiting for, the CPU
  • Perform Disk I/O
  • Perform Network I/O
But how does one interpret a Load Average that seems to be too high? The first step is to look at the CPU utilization. If this isn't 100% and the Load Average is above the number of CPU's in the system, the Load Average is primarily driven by processes performing disk I/O, network I/O or the combination of both. Finding the processes responsible for most of the I/O isn't straightforward, there are many tools available to assist you in doing so. 

If the CPU utilization is 100% and the Load Average is above the number of CPUs in the system, the Load Average is either completely driven by processes running on, or waiting for, the CPU or driven by a combination of processes running on, or waiting for, the CPU and processes performing I/O (which could be in turn a combination of disk and network I/O). 

The easiest way to see the “load average” of your system is by uptime.It also appears in top and can be graphed in the console by tload . In all three cases the load average refers to a group of three numbers. For example, in the following output of uptime

10:41:47 up 9 days,48min,1 user,load average: 0.82, 0.71, 0.66
the last three numbers are the “load average”. Each number represent the systems load as a moving average over 1, 5 and 15 minutes respectively. Now, the important thing is to understand what is being averaged, the load metric.

The metric that represent the load at a given point in time is how many process are queued for running at each given time (including the process that is currently being ran). Generally speaking, on a single core machine, this can be looked at as CPU utilization percentage when multiplied by 100. For example if I had a load-average of 0.50 in the last minute, this means that over the last minute half of the time the CPU was idle as it had no running process. On the other hand if I had load average of 2.50 it means that over the last minute an average of 1.5 process were waiting to their turn to run. so the CPU was overloaded by 150%.

On a multi-core systems things are a bit different, but in order to avoid unnecessary complications one can usually divide the load-average by the number of cores an treat the result as the load average of single core machine. For example let’s say the load average of a two-core machine was 3.00 2.00 0.50. This means that over the last minute we had an average of three runnable process, this means that one process, in average, was queued as there are two core in the machine that can run to process at a time. So the machine was overloaded had a load of 150% its capability. Over the last 5 minutes the load average of 2.00 means that we roughly had 2 process running each time, so the machine was fully utilized but wasn’t overloaded by work. On the other hand over last 15 minutes the load-average of 0.50 means that we could handle 4 time that load without overloading the CPU, we only had (0.50/2)*100=25% CPU utilization in that 15 minutes.

I hope I made the load-average a bit more clearer using the above example. Load-average is an important metric for measuring a system performance, and good understanding of it is beneficial.


Note that this document comes without warranty of any kind. But every effort has been made to provide the information as accurate as possible. I welcome emails from any readers with comments, suggestions, and corrections at webmaster_at admin@linuxhowto.in

Copyright © 2012 LINUXHOWTO.IN


1 comment: