[rrd-users] More observations and questions on COUNTER
philip at vogon.net
Sat Oct 23 17:24:54 CEST 2010
On 10/23/2010 1:06 AM, Simon Hobson wrote:
> Philip Peake wrote:
>> The fix I used was one suggested by Alex van den Bogaerdt, which was
>> essentially to insert a NaN to indicate that the counter is now in an
>> unknown state, followed by a zero, so that the next (real) value will be
>> represented correctly.
>> This worked for my tests, so I deployed the fix.
>> Now, I use a DB which actually holds one month 4 weeks) of data, with a
>> 30 second sampling period.
>> I use this DB to display three graphs:
>> Last month
>> Last day
>> Last hour
>> I do this by just setting the start to the appropriate value from <now>.
>> Strangely, I have noticed that this fix doesn't always work.
>> What I see if I look back over the data is a sequence looking like this
>> (simplified, with thee data sources):
>> T1 1000 1004 997
>> T2 1010 1020 1003
>> T3 NaN Nan NaN
>> T4 NaN NaN NaN
>> T5 0 0 0
>> T6 0 0 0
>> T7 0 0 0
>> T8 4E6 4E6 4E6
>> T9 15 12 10
>> No spike is displayed on the month or day graphs, but one is displayed
>> on the hour graph.
>> Two odd things (to me) - Why is rrd still recording a counter roll-over
>> Why does the same data show a spike on one graph, but not on the other two?
>> I suppose the third question might be why isn't the roll-over recorded
>> with the first zero rather than the first non-zero?
> I suspect all three questions may be related. There is a distinct but
> small time period where your updates may get out of sync. If an
> update occurs between you writing NaN and zero, then your zero won't
> work and the previous count doesn't get properly reset. In fact,
> depending on the timing, it's entirely possible an update is missing
> because it failed due to "time standing still" (ie two updates with
> the same timestamp).
> In fact, if you are updating every 30 seconds, there is a 1 in 15
> chance of a clash. Your reset script will take two seconds of time in
> the rrd file to do it's work (ie update to NaN at time t, update to 0
> at time t+1second). Thus two seconds of time are not available in a
> 30 second window) for your script to update the file.
> I'd be inclined to add some logging statement to your scripts to log
> the actual update statements they are using to a text file - that
> way, when you next see the problem occur, your can refer to the text
> file and see what actual updates were done - and replay them into a
> fresh file a step at a time while monitoring the result.
Simon, the script forces log data on 30 second boundaries, I use
calculated times, not "now".
This includes the NaN value when a data source disappears, and for the
zero values entered into it every 30 seconds until the source comes back
I have dumped the DB values, and see exactly what I expect - increasing
values, a NaN (well, I actually got two Nans), then a string of zeros
followed by a HUGE number (rrd thinks a counter rollover occurred, but
only when it sees a non-zero data value ????) followed by data source
More information about the rrd-users