[rrd-users] More observations and questions on COUNTER

Sat Oct 23 17:24:54 CEST 2010

On 10/23/2010 1:06 AM, Simon Hobson wrote:
> Philip Peake wrote:
>
>> The fix I used was one suggested by Alex van den Bogaerdt, which was
>> essentially to insert a NaN to indicate that the counter is now in an
>> unknown state, followed by a zero, so that the next (real) value will be
>> represented correctly.
>>
>> This worked for my tests, so I deployed the fix.
>>
>> Now, I use a DB which actually holds one month 4 weeks) of data, with a
>> 30 second sampling period.
>> I use this DB to display three graphs:
>>
>> Last month
>> Last day
>> Last hour
>>
>> I do this by just setting the start to the appropriate value from <now>.
>>
>> Strangely, I have noticed that this fix doesn't always work.
>>
>> What I see if I look back over the data is a sequence looking like this
>> (simplified, with thee data sources):
>>
>> T1    1000    1004    997
>> T2    1010    1020    1003
>> T3    NaN     Nan     NaN
>> T4    NaN     NaN     NaN
>> T5    0        0        0 
>> T6    0        0        0
>> T7    0        0        0
>> T8   4E6      4E6      4E6
>> T9    15      12       10
>>
>> No spike is displayed on the month or day graphs, but one is displayed
>> on the hour graph.
>>
>> Two odd things (to me) - Why is rrd still recording a counter roll-over
>> value?
>> Why does the same data show a spike on one graph, but not on the other two?
>>
>> I suppose the third question might be why isn't the roll-over recorded
>> with the first zero rather than the first non-zero?
> I suspect all three questions may be related. There is a distinct but 
> small time period where your updates may get out of sync. If an 
> update occurs between you writing NaN and zero, then your zero won't 
> work and the previous count doesn't get properly reset. In fact, 
> depending on the timing, it's entirely possible an update is missing 
> because it failed due to "time standing still" (ie two updates with 
> the same timestamp).
>
> In fact, if you are updating every 30 seconds, there is a 1 in 15 
> chance of a clash. Your reset script will take two seconds of time in 
> the rrd file to do it's work (ie update to NaN at time t, update to 0 
> at time t+1second). Thus two seconds of time are not available in a 
> 30 second window) for your script to update the file.
>
> I'd be inclined to add some logging statement to your scripts to log 
> the actual update statements they are using to a text file - that 
> way, when you next see the problem occur, your can refer to the text 
> file and see what actual updates were done - and replay them into a 
> fresh file a step at a time while monitoring the result.

Simon, the script forces log data on 30 second boundaries, I use
calculated times, not "now".
This includes the NaN value when a data source disappears, and for the
zero values entered into it every 30 seconds until the source comes back
online.

I have dumped the DB values, and see exactly what I expect - increasing
values, a NaN (well, I actually got two Nans), then a string of zeros
followed by a HUGE number (rrd thinks a counter rollover occurred, but
only when it sees a non-zero data value ????)  followed by data source
readings.