Well as it states, I have a custom script that checks mysql for the seconds behind the master. I need to tweak it a bit because if the slave stops, it returns NULL, so I need to make that some kind of critical alert, but for this forum discussion/thread, here is my problem.
** Editing **
The mysql template has a mysql_check_slave command script that returns a value which is the seconds behind. When I select a server that has the mysql template, then click graph's I get one like image1. You will see I stopped the slave for a few and it went about 700 or so seconds behind the master and it's clearly shown in that image. I see the max of 731m so I guess that was the max, but not sure what the m is, the same is reflcted in the Y axis, you see 600m, etc.
Now I also have that exact graph in a multi-graph so I can see all 30+ mysql servers and how there running. When I tested and verified the 700 seconds behind, both via command line and the graph shown in image1, I then went to my multi-graph report and looked at that server. From the default page load, you will see image2, and you can see how the graph seems to pin at 200 where it should be 700! The min is set at 0 and max is -1 (same graph as I said), and you do see the same max so we know that's correct.
I used the zoom in button which is image3, and that is where it's get's even odder and less helpful. You still see the spike at the same time, again, the max on the graph pin at 200, but looking at the max on the legend you see it change to 234.84m
I and the other admin's use the multgraph as it's so much quicker, easier to see what's going on, but the data being shown is not correct so it doesn't really help. So the 2 questsions directly related are;
1. Why is this happening? How can this be fixed, prevented, or written so the multi-graph shows the real max (as in image1) on the multi?
2. Looking at the max value (that 731.37m), can I use that as al alert value? Actually I would like to use the current, so if the current was > 1000 seconds or so, fire a critical so we could be notified.
I did tweak the COMMAND, the default is 300 seconds which wasn't much of a help, so I am using 40 seconds which seemed to be much more reliable. Thanks for any comments/help or advice.
You're being bitten by RRDs. Basically, the RRDs always normalize data over time, and the resolution gets less over time also. I.e. by default you'll have a datapoint every 5 minutes. That gets less detailed - it throws out data - after 50 hours if I recall correctly. So if you look back a week later, you'll no longer have every 5 minutes a datapoint, you'll have every hour a datapoint, so the averaged numbers change.
Also when you zoom in, you're not zooming in like you expect. You're re-generating the graph for a different range of time, so again, the details change and the averages change. It gets more complicated, but read the rrd website and search these forums on RRD stuff and you'll find MANY discussions on the oddities.
LEPP Computer Group