Skip navigation
1 2 3 Previous Next 53308 Views 37 Replies Latest reply: Sep 25, 2012 6:03 PM by Mike McDonough RSS
HummerBoy Rank: Green Belt 116 posts since
Dec 17, 2008
Currently Being Moderated

Mar 24, 2009 7:58 PM

ssCpuIdle value showing zero

I keep getting 'threshold of low CPU idle not met' warnings on a few servers, one in particular. It shows a value of zero constantly. If I run a manual snmpget command with ssCpuIdle (OID=1.3.6.1.4.1.2021.11.11.0) I get the same thing but I can see from all the usual tools that the CPU isn't maxed out (or zero idle time).
When I search the net all I find is documents telling us to change to using ssCpuRawIdle but this is known to max out and give invalid readings too plus you have to take two readings and figure it out yourself.
I fixed this for a short period (minutes) by restarting SNMP on the effected server. Got a value of 84 for a few minutes and now it is back to zero again. The server is a SUSE 9 server much the same as most of the other servers on our network, running Nagios if you can't figure that out from the name.
Any ideas?

 

 

zenoss@crt-monitor:~> snmpget -c snmpstring -v1 crt-nagios.blahblah.gov.au 1.3.6.1.4.1.2021.11.11.0
UCD-SNMP-MIB::ssCpuIdle.0 = INTEGER: 84
zenoss@crt-monitor:~> snmpget -c snmpstring -v1 crt-nagios.blahblah.gov.au 1.3.6.1.4.1.2021.11.11.0
UCD-SNMP-MIB::ssCpuIdle.0 = INTEGER: 0

  • szogu Rank: White Belt 19 posts since
    Mar 17, 2009
    Currently Being Moderated
    1. Jul 30, 2009 4:56 AM (in response to HummerBoy)
    RE: ssCpuIdle value showing zero
    Hi,
    Did you resolve this problem?

    Regards,
    Tommy
  • marzlarz Rank: White Belt 9 posts since
    Jun 16, 2010
    Currently Being Moderated
    3. Aug 3, 2010 10:46 AM (in response to HummerBoy)
    Re: ssCpuIdle value showing zero

    Hello,

     

    We are also experiencing this on RHEL 4 & 5.  I originally thought it was an net-snmp bug, but it appears that ssCpu*  are deprecated and ssCpuRaw is what should be used now.

     

    See RHEL BUG: https://bugzilla.redhat.com/show_bug.cgi?id=473824

     

     

    Can we get this updated in Zenoss to use ssCpuRaw? If so, how? This bug is causing our development groups to wonder why are systems are 100% utilized, when actually they are not.

     

     

    Please Advise,

     

    Thanks

    LJ

  • Dave_the_Dude Rank: White Belt 39 posts since
    Aug 3, 2010
    Currently Being Moderated
    4. Aug 5, 2010 3:10 PM (in response to HummerBoy)
    Re: ssCpuIdle value showing zero

    What you're seeing isn't a bug in net-snmp.  The net-snmp docs have stated for a long time that ssCpuIdle and the other ssCpu* OIDs are deprecated and should not be used.  Instead, you should be using the ssCpuRaw* OIDs (ssCpuRawIdle, ssCpuRawUser, ssCpuRawSystem, etc...).  You will get much better data from these OIDs as well.  Instead of getting a "rough estimate" percentage (rounded down to an integer), you can actually see how many clock ticks have been used on each, which will give you MUCH greater accuracy.

     

    This should be considered a ZenOSS bug in the default "Server/Linux" device template, but it's easy enough to fix yourself.  Just edit the Devices->Servers->Linux Template, look in the Data Sources section for the ssCpu checks.  Change the SNMP OID being used to the appropriate ssCpuRaw* OID.  For instance, the ssCpuIdle OID is .1.3.6.1.4.1.2021.11.11.0.  This OID should be changed in your Data Sources to the OID for ssCpuRawIdle, which is .1.3.6.1.4.1.2021.11.53.0.

     

    You can find the right OIDs by running the following commands from any linux machine with a net-snmp utilities installed:

    snmpwalk -v2c -c $COMMUNITY $HOSTNAME .1.3.6.1.4.1.2021.11

     

    This will show you the complete list of available CPU OIDs.  To get the OID as a numerical value rather than text, try this:

    snmpwalk -v2c -c $COMMUNITY $HOSTNAME -O n ssCpuRawIdle

     

    You can swap "ssCpuRawIdle" above with any of the other OIDs listed in the first command (ssCpuRawSystem, ssCpuRawUser, etc)

     

    One thing to remember: all ssCpu* OIDs are integer "gauge" values.  The ssCpuRaw* OIDs are all "counter" values.  Changing from a guage data source to a counter data source requires some changes to your graphs.

     

    Hope this helps...

  • marzlarz Rank: White Belt 9 posts since
    Jun 16, 2010
    Currently Being Moderated
    5. Aug 3, 2010 2:30 PM (in response to Dave_the_Dude)
    Re: ssCpuIdle value showing zero

    Dave,

     

    This is very useful information. Thanks for taking the time to write it up !

     

    Now I just need to work with Zenoss to switch to ssCpuRaw*  and update our graphs to a counter rather than a gauge ( without losing historical data )

     

     

    Thanks again!

  • Dave_the_Dude Rank: White Belt 39 posts since
    Aug 3, 2010
    Currently Being Moderated
    6. Aug 3, 2010 3:35 PM (in response to marzlarz)
    Re: ssCpuIdle value showing zero

    Trying to convert from ssCpu gauges to ssCpuRaw counters without losing historical data will be very challenging.  Perhaps I just didn't put enough time into it.  I know I could have pulled the data from the old ssCpu RRD files using rrdtool, then manually added it to the new ssCpuRaw RRD files using rrdtool as well, with a bit of math in the middle.  Just didn't want to put the time and effort into it, since we are still in "testing" phase with ZenOSS and our historical data was not critical.

     

    Also, you should be polling the full set of ssCpuRaw values, which includes more data (and additional data sources) than the default ssCpu stats.  They're not really compatible.  The good news is that if you setup your graphs properly using the ssCpuRaw stats, you will actually get more accurate info (by the clock tick), as well as a number of new data sources.

     

    If I were you, I'd just start over with your CPU stats.  Anything you were seeing from ssCpu was inaccurate garbage anyway, as the net-snmp docs clearly state.  If you are running a multi-core system, those ssCpu stats are especially useless. (i.e., 400% idle!).

     

    I actually ended up writing a script that grabs the per-core/cpu statistics as an index.  This tells me when one of the CPU cores is maxed, even if the others are seeing low usage.  Averaging load across all cores obfuscates actual CPU utilization, making it pointless to monitor CPU utilization.  I want to see when one of my CPUs is being thrashed by a process.  I don't care if the other CPUs/cores are fine.  Until *NIX kernels support balancing the CPU load of a process across multiple CPUs/cores, I don't care what total percentage of free CPU time I have across the entire system.  I'm concerned about each of the CPUs/cores, and whether ANY of them are being thrashed by some poorly written java, a runaway DB query, etc.  The only way to do that is measure each CPU separately.  The easiest way to do that on *nix-like systems (Solaris, Linux, Mac OS X) is by using the "iostat" and "mpstat" commands (part of the sysstat package).

     

  • marzlarz Rank: White Belt 9 posts since
    Jun 16, 2010
    Currently Being Moderated
    9. Aug 4, 2010 8:50 AM (in response to HummerBoy)
    Re: ssCpuIdle value showing zero

    Hummer,

     

    If you read the above thread, Dave the Dude stated:

     

    "The net-snmp docs have stated for a long time that ssCpuIdle and the other ssCpu* OIDs are deprecated and should not be used"

  • marzlarz Rank: White Belt 9 posts since
    Jun 16, 2010
    Currently Being Moderated
    11. Aug 5, 2010 8:46 AM (in response to HummerBoy)
    Re: ssCpuIdle value showing zero

    I have a ticket in with Zenoss right now about this issue.

    I have updated my Data Sources to use the New OIDs, and updated my Graphs to be Counters, rather than Gauges, but still no dice.

    I'll be sure to post what the resolution is here once I get it.

     

     

    Thanks!

  • Dave_the_Dude Rank: White Belt 39 posts since
    Aug 3, 2010
    Currently Being Moderated
    12. Aug 5, 2010 3:09 PM (in response to HummerBoy)
    Re: ssCpuIdle value showing zero

    HummerBoy wrote:

     

    This doesn't explain why this OID was working fine for at least the past 249 days (uptime on the lastest one with this issue) and it stops working now. Why does it stop working on all four servers at a site all at the same time?

    A figure of 1942502991 ticks means nothing to me where as 90% idle does

     

    What I'm trying to tell you is that if it ever did work, you weren't getting valid results anyway.  You can not rely on the SNMP OID ssCpu* to give you an accurate "percentage" of CPU usage.  You need to calculate it yourself using the values returned by ssCpuRaw*.

     

    I wrote a Nagios-compatible script for Linux devices that can be used from within ZenOSS (as a COMMAND data source) to grab the correct values for each operating system and give you pretty accurate results (to the hundredth of a percentage).  It grabs a configurable number of seconds of data and calculcates from the delta, so it will require X number of seconds to run (configure the $sample_seconds variable to set the sample period).  I've attached it to this post.

     

    The ZenOSS Administrator's Guide explains specifically how to use scripts as a COMMAND data source.  You will need to follow those instructions to setup the data sources for your graphs, but this script will give you the correct values.  Also, you must have Net-SNMP utilities and perl installed on whatever system will be running this script.  Make sure to run the script once from the command line as the zenoss user to see if it works.

     

    The script would be run like (set your Command Template to):

    check_linux_cpu.pl ${here/zSnmpVer} ${here/zSnmpCommunity} ${here/manageIp}

     

    You can run it from the command line to test your server:

    check_linux_cpu.pl $SNMP_VERSION $SNMP_COMMUNITY $HOSTNAME

     

    $SNMP_VERSION = the appropriate SNMP version (use v2c)

    $SNMP_COMMUNITY = the community string you are using

    $HOSTNAME = the hostname or IP address of the server you want to poll for data

     

    It will return values in the following format:

    |Count=2 TotalUsed=4.705 User=4.004 Kernel=0.701 Idle=95.295 Wait=0.000

     

    Count = the number of CPU cores on your system

    TotalUsed = the total percentage of "Used" CPU time

    User = percent CPU used on "User" processes

    Kernel = percent CPU used on "Kernel" tasks

    Idle = percent CPU idle (total percentage "Free" CPU)

    Wait = percent CPU iowait (amount of time CPU spends waiting for I/O)

     

    Note that it only works for Linux.  If you want to check Solaris, BSD or Mac OS X machines (etc), you'd need to hack this script to specify which of the ssCpuRaw OIDs are important, and what OIDs you want to check.

     

    Good luck!

    Attachments:
  • Dave_the_Dude Rank: White Belt 39 posts since
    Aug 3, 2010
    Currently Being Moderated
    13. Aug 5, 2010 3:18 PM (in response to HummerBoy)
    Re: ssCpuIdle value showing zero

    HummerBoy wrote:

     

    An update.

    As a bit of a background to the site - ESX 3.5 on a eight core machine, various guest VM's including these 4 linux problem machines (just one of the places we have the issue). I also realise that this issue isn't directly caused by Zenoss, it is being caused by the system being monitored. Just trying to find out if anyone knows what thing net-snmp is talking to that is giving the zero values.

    Today I restarted one of the four servers at this site and zenoss has now cleared the CPU idle issue. Many manual SNMPGET's later it is still showing a non-zero figure. This probably explains why all four linux servers had this issue happen at the same time - they all came up at the same time when the ESX host was started. Appears the issue is uptime related from the guest OS. As to what 'it' is has me stumped.

    Is there a SNMP transaction count limit that needs to be reset?

    VMware guests have their own problems with time-keeping.  It's the same as the issues Net-SNMP has with obtaining accurate values for the ssCpu* stats.  Namely, there isn't a fixed clock so no way for the kernel to know how many ticks are in a clock cycle.  If you play with ntpupdate on a linux VMware guest, you will find your time drift growing exponentially.  This is due to the kernel not calculating the clock cycle properly.  It's a known issue, and one out of many reasons the Net-SNMP project abandoned ssCpu stats a LONG time ago.

1 2 3 Previous Next

More Like This

  • Retrieving data ...