Help for figuring out why collectors are failing
Identify the Problem
Check the logs.
Your first step in troubleshooting a daemon is to look in the logs for errors. The Zenoss logs are in $ZENHOME/logs. Starting in version 1.1, you can even view the logs with via the web user interface under the About link.
Get more information.
You can get additional logging by decreasing the verbose limit and running the daemon in foreground mode.
For example, this command will run zenperfsnmp:
$ $ZENHOME/bin/zenperfsnmp run -v 10
See this page for a list of command line options (The specified document was not found.) for the remaining Zenoss daemons.
You will get more (sometimes, a lot more) log messages, but zenperfsnmp will only perform a single scan of all the devices.
Find out what the program is _really_ doing.
If a program is hanging, or not behaving as you expect, you can eavesdrop on what the program is asking the operating system to do on its behalf. This is a good way to determine if a helper program is failing, or system errors are not propagating up to the log file.
On Linux machines, the command to trace these system calls is "strace". On the appliance you can add strace to your system with:
$ conary update strace
Other posix-like operating systems have their own commands (truss on Solaris, ktrace on OS X). For example, you can verify that zenperfsnmp is really sending packets:
$ strace -f -e trace=sendto \
- Don't forget the largest and most complex Zenoss daemon: MySQL.
Check the version against the one needed by Zenoss. If you see "Lost Connection to Zenoss" in the dashboard, it is likely a MySQL connection problem.
Narrow the Problem
If you have managed to limit a problem to a single device, or simply suspect a device because it's running an odd configuration, or was recently added, most of the active collectors will allow you to scan a single device:
$ $ZENHOME/bin/zenperfsnmp run -v 10 --device SomeDevice
If the problem is related to a long running server, you can ask that the server run in foreground mode, but continue with the normal endless cycle:
$ $ZENHOME/bin/zenperfsnmp run -v 10 --cycle
Look for Conflicts
Are you running more than one copy of a daemon? During debugging and before version 1.1, Zenoss could lose track of background processes.
Stop zenoss and look for stray processes:
$ ps auxww | grep /z
Are you resource limited? Is the file system full? Do you have free memory? My favorite tool for this is "top" under Linux:
This program will constantly update the display with a list of the most CPU hungry programs. You can also sort the list by memory usage.
Reproduce the Problem
The ability to reproduce a problem with a consistent set of steps will help enormously. Often the only way to find the problem is to use "binary search". You reproduce the problem and take away "half" of the configuration. Slowly you can reduce the "halves" that are causing the problem until a single element remains.
Search the Mailing Lists
Hopefully someone has seen a similar problem. Their pain can save you time and energy. Also, we are always looking for patterns across users and that will help narrow down an issue.
Report the Problem
Some problems should be reported, even in the absence of detailed information because they are almost certainly bugs.
- the python interpreter crashes (a segmentation fault, for example)
- a python trace in a log file
- a daemon regularly drops heartbeats
- a daemon's size grows over time to consume all resources
Customers _always_ come first.