We have recently migrated to Amazon EC2 and have found Zenoss to be fantastic although integration with Amazon CloudWatch doesn’t work particularly well because Amazon throttle the requests and as a result, Zenoss provides many false alarms.
As an example:
ERROR:boto:400 Bad Request
The work around was to avoid using the zenpack and falling back to using a combination of the standard ‘Device’ monitoring templates and our own templates.
Are there any other suggestions to help resolve this?
An FYI to third parties reading this: this conversation started as an email thread, where there is more detail. So if you are reading this, keep in mind that all of the context may not be included.
I have worked with the EC2 ZenPack, and I know about that throttling. You can request to Amazons Cloudwatch support team to alter this throttling, and they usually will change it. If you need a contact there, let me know, though I think they have a support arm of some kind that you should try first. Especially if you are paying them for their service, and you explain that you are attempting to use a 3rd party monitoring tool that interfaces via Cloudwatch, this request should be no problem.
I was just at a DevOps meeting last night and I told them about what you said during a relevant discussion - that you found the normal snmp, ssh, etc monitoring to be more specific and reliable than the limited EC2 Cloudwatch stats. Another person had made the comment that auto-scaling in the cloud sometimes cannot be based on simple statistics like CPU, memory, etc. Instead, he found that he had to look at the limits of his applications and services. When those get to their limits, you spawn up or down new resources. This fits zenoss well. Know also that you can run commands when thresholds or events occur - maybe spawn a new instance when app-related connections gets too high.
You mentioned also "What does work particularly well in these environments are zenoss maintenance windows as we stop/start servers when processing capacity is required."
I dont understand the mainteance windows usage. Isnt processing capacity not predictable, whereas maintenance windows have to be set in advance? Or is your processing capacity predictable enough in advance that you can plan these?