Hopefully someone will have some answer for me...few weeks ago i have seen some gaps on the graphs (see attachment), i tried reindex() commit() but didnt work ...now i dont know what else i can try....honestly i dont want to delete the .rdd file and recreate it again, as I did it also, and it works fine for a couple of weeks, then it comes back...sometimes also, the graph works fine, then gaps, then its ok, then gaps or not drawing correctly the graph...and it happens for some devices...BUT sometimes it does for ALL the devices (same time, same day)..strangely...
I hve checked also the network side, and everything looks fine...(VM server)...and we have another old software (cricket) in the same network and subnet, and it draws the graphs perfectly no gaps, no draw issues, so i believe that its something related to Zenoss....
I'm not a linux expert nor Zenoss, so if you can guide me where i can take a look will be nice...
Any help will be greatly appreciated!
thanks for the info...this is what i'm seeing:
2012-05-31 11:42:36,921 INFO zen.zenperfsnmp: success:48 fail:1 pending:20 todo:9
2012-05-31 11:42:37,292 INFO zen.zenperfsnmp: success:49 fail:1 pending:20 todo:8
2012-05-31 11:42:38,338 INFO zen.zenperfsnmp: success:50 fail:1 pending:20 todo:7
2012-05-31 11:42:39,239 INFO zen.zenperfsnmp: success:51 fail:1 pending:20 todo:6
2012-05-31 11:42:40,781 INFO zen.zenperfsnmp: success:52 fail:1 pending:20 todo:5
2012-05-31 11:42:41,938 INFO zen.zenperfsnmp: success:53 fail:1 pending:20 todo:4
2012-05-31 11:42:42,022 INFO zen.zenperfsnmp: success:54 fail:1 pending:20 todo:3
2012-05-31 11:42:42,983 INFO zen.zenperfsnmp: success:55 fail:1 pending:20 todo:2
2012-05-31 11:42:43,821 INFO zen.zenperfsnmp: success:56 fail:1 pending:20 todo:1
2012-05-31 11:42:45,110 INFO zen.zenperfsnmp: success:57 fail:1 pending:20 todo:0
2012-05-31 11:42:46,699 INFO zen.zenperfsnmp: success:58 fail:1 pending:19 todo:0
2012-05-31 11:42:48,227 INFO zen.zenperfsnmp: success:59 fail:1 pending:18 todo:0
2012-05-31 11:42:49,563 INFO zen.zenperfsnmp: success:60 fail:1 pending:17 todo:0
2012-05-31 11:42:49,764 INFO zen.zenperfsnmp: success:61 fail:1 pending:16 todo:0
2012-05-31 11:42:49,995 INFO zen.zenperfsnmp: success:62 fail:1 pending:15 todo:0
2012-05-31 11:42:50,721 INFO zen.zenperfsnmp: success:63 fail:1 pending:14 todo:0
2012-05-31 11:42:51,824 INFO zen.zenperfsnmp: success:64 fail:1 pending:13 todo:0
2012-05-31 11:42:53,523 INFO zen.zenperfsnmp: success:65 fail:1 pending:12 todo:0
It says "fail: 1" i dont know exaclty what that means...or where i can take a look?
adding some performance graphs for the monitor (localhost)
Even this graph has gaps..
Hmm...so maybe not just zenperfsnmp, but something else that's dropping things. I'm new to Zenoss myself, so I'm not sure where to check...see if events.log has anything interesting? Also check top to make sure you don't have a lot of processes eating up all your CPU, and free to make sure you've got enough memory and aren't swapping a bunch to disk.
I think that monitoring processes (SNMP, WMI ...) have a lower priority than "regular" services, meaning that when a box is busy, or running out of resources, it will stop sending monitoring information in an attempt to prioritize its resources. (I have no real proof that this is happening and have not found this discussed anywhere but it seems to fit the evidence.) If all the graphs are dropping at the same time, and the lights still on , I would suspect the Zenoss server is the one running low on some resource.
One example of this I have seen is when a disk is going bad and takes a long time to respond to requests on a windows machine. In this case the server/workstation just sits there, responses to mouse clicks can take literally minutes but the task manager / performance monitor show nothing unusual when they do respond.
Can you detail here some of the graphs that have gaps? The data may be collected by SNMP, in which case explodinglemur suggestion to check zenperfsnmp.log should show any real oddities. You often do find that there are a few devices in the fail count because they were busy / down / .. . There is often a message at the end of the polling cycle saying which devices the fails were for.
There may be other Zenoss daemons collecting your data though. zencommand gets data through ssh and is also used under-the-covers by some of the other data collection types. For example, if you install the Http ZenPack then there is an HTTP data collector but it actually uses zencommand fundamentally. So check zencommand.log too.
Another thing you can do is turn up the logging level on zenperfsnmp and/or zencommand. From ADVANCED -> Settings -> Daemons, select the damon and use the edit config button. Change the logging level from Info to Debug and restart the daemon. If you have lots of devices then this will generate LOTS of stuff but you should see far more what is going on. Don't leave this logging level turned on once you have the problem resolved.
If the data does come back periodically then there is obviously nothing fundamentally wrong so there is no point in deleting your rrd files.
You could also check your event console and see if there are any heartbeat events from the data collection daemons such as zenperfsnmp and zencommand. If there have been issues, you might check their respective log files to see if the daemons died at times that match your data gaps??
Another thing that can wreck data collection is if you change the cycle time on command templates (the zenperfsnmp ones are fixed, typically at 5 mins, by the parameters of the zenperfsnmp daemon though with 3.2.1 you can now change that on a per-device basis). The underlying rrd technology does not play nicely with changing the cycle time (what it calls the STEP parameter). If someone has changed the cycle time in a template and then changed it back again, you probably end up with lots of NaN values in your rrd files which would match with your graph gaps.
Thanks everyone to take the time to answer my question..tly .heret are some thoughts:
- the Debug for zenperfsnmp in fact shows more information but i was still seeing "fail:1" but not the reason why it failed...
- same for zenperfcommand, nothing out of ordinary...
- the event console shows some errors that i'm aware, for example SVI interfaces on the routers, that we no monitoring and are disabled...
- what i'm thinking after reviewing all those logs, that the only "reason" that i'm seeing is that could be related to resources, even if the "top" doesnt show high cpu, almost 10% or less...but memory can be added..its a VM so i will need to shutdown the VM to add more RAM...
i dont know if this will fix the issue, but i will let you know...and btw, i restarted the whole zenoss services and strangely for good 10min some devices didnt draw the lines on the graph...then started drawing again...
My last resource is building another VM in pararell...honestly i don want to do that...