I have Zenoss 3.0.2 installed on three machines.
I have a "Master" instance on one machine and two separate "Collector" Machines.
I have 1600 devices.
All works fine for 23 1/2 hours a day. No issues at all.
Then around 1 hour before midnight the Collector I/O writes go threw the roof, then the perfsnmp misses data for about 20 minutes.
Any help is appreciated.
I would check a few things:
1) verify that a pack isn't causing the io
2) verify that a massive report hasn't been scheduled resulting in io
3) verify that the rrd cleanup script isnt causing the io
If all is well there then I would check:
1) the config interval / cycle time
2) verify that a backup isn't scheduled and running resulting in the io
Thank you for the response. I will try all of the above and report back.
It appears to be collector dependent, not server dependent. No luck yet.
What is the "rrd cleanup script"? How do I find when it is running?
What version are you running?
The rrd cleanup is a function of zenperfsnmp.py. After a quick look and recap on your symptoms I dont think this would be the culprit for the io pileup. Would it be possible to put your zenperfsnmp in debug so we can look at actvity during that time. An lsof at that time would also proove revealing I think.
Thanks for all your help. i found the problem (although not the fix yet).
I miss data for about 4-6 5-minute cycles everyday.
I noticed the cycle time is 300+, but more importantly the devices waiting to be queued are about 98% of them.
it looks like zenoss queries SNMP for the first 20 devices, and then timesout never hearing back from the devices.
About every 4 days, the collectos actually just stop collecting - they are working but zenoss seems to be no longer connected to them as when I shutdown using zenoss-stack I must shut the "lost" processes down manually.
At the same time there is another process running on the same machine.
I noticed the nice value of Zenoss perfsnmp is 15/16.
i can only imagine the other process is pushing zenoss in the background. will investigate further and let you know if I solve it.
thanks again Shane.
Glad you've found the root. One way to help ensure zenperfsnmp doesn't end up just giving up on life is to set the maxparallel value in the zenperfsnmp.conf on the collector to 250. The default is 500, however, twisted will sometimes die unexpectedly once a certain number of tests are deferred. This should help alleviate that particular problem.