I've been chasing this for about two weeks right now, and coming up dry. Has anyone else seen the following problem? Any idea where to look?
About once every other day, we'll get a spattering of query timeouts. Pick any random ten or twenty queries, and they all time out. The event logs indicates that they were unable to get a response in time. By the time we get the alerts and login the problem has long since cleared.
1. I've set up constant ping and tcp monitoring between some of the affected systems and proved that there was no networking outage when the timeouts occurred.
2. Many of the services which have reported failures would have large primary failures other systems would notice (ie DB servers would create db failure messages in the logs) and this simply doesn't occur.
3. The timing of the messages is completely random and unrelated to load. In fact, they have happened during off-peak periods more often than during peak load.
In short, we've isolated that this "timeout" seems to be occuring inside Zenoss itself, and is not actually a problem with the remote service. Some sort of internal locking?
1. This started about two weeks ago, and there had been zero other changes to the system for many months. Not related to a change.
2. This server does NOTHING except run Zenoss. It has no cron scripts unrelated to Zenoss, etc.
3. Zenoss and SAR monitoring of the system indicate no resource consumption issues -- plenty of free memory, cpu, etc.
Zenoss Stack 3.2.1
24 GB main memory
Nobody has seen this before?
Are these snmp queries ?
SNMP queries. SQL queries. Localhost commands. Everything and anything. There's no consistency in this, it appears to be "every check command run in that one minute interval".
The nature of it made me suspect that we were running out of memory, file handles, or something like that but after setting up some extensive reporting I am certain that this is not what is happening. I/O doesn't spike, it actually drops during the outage period. No issues hitting file descriptor or any other limits.
My best guess is that there is some global lock contention within Zenoss itself that we are slamming into.
dhcp or network card issues? is it using static ip?
Static IP. No networking problems. No packet loss to the system during these outages (left a spray running), and tests running on the local system that use no networking fail at the same time.
Whatever the problem is, it's internal to the queueing mechanism for running tests.
Follow Us On Twitter »
||Latest from the Zenoss Blog »||Community||Products||Services||Customers||About Us|
Copyright © 2005-2011 Zenoss, Inc.