Recently, I've been toying with the idea of adding some self-monitoring to Zenoss. Previously we've used Enterprise 3.2.1, but have since upgraded to 4.1.1. In both version, we've come across a particular issue. We have a large number of reports that we send to users via cron + reportmail. From time to time, we get messages from users about data missing from reports. Sure enough, we dig around and find that performance data hasn't been collected for a particular device -- not to mention, nothing substantial in the event history to suggest a problem.
In an effort to be proactive instead of reactive to these issues, my immediate thought is to run a cron that checks to see if/when the last time an rrd file was touched. If it hasn't been touch in X amount of time, then fire off a message to someone. I leaned on Zenoss support a bit to see if they had any suggestions, but they simply stated that there was talk of adding such functionality to future releases. I'm curious if anyone else has implemented anything along these lines in their environment to keep Zenoss honest.
This feature is planned and on the roadmap.
What I do, in the mean time, is test the process table and use predictive thresholds on the collector graphs. The datapoints graph should never change abruptly so it's prefect for detecting collecting failures.
Along those lines, I've just added some basic daemon monitoring using monit. It also has the capability to detect file timestamp changes but I'm not doing that right now. http://mmonit.com/monit/documentation/monit.html#timestamp_testing