[rrd-developers] mmap considerations

Wed Jun 13 11:59:23 CEST 2007

On Wed, Jun 13, 2007 at 11:03:45AM +0200, Tobias Oetiker wrote:
>Hi Bernhard,
>
>the time bit is trecherouse ... I HAVE todo other stuff ... alas
>... but I sneek out every now and then ...

sounds familiar..

>the reason rrd_create is 'critical' is not performance as such, but
>cache pollution ... by creating just a few rrd files cache gets
>thrown out of whak quite badly .. I have been running my
>performance.pl script which creates a bunch of rrd files and then
>updates them ... I found that the memory system takes quite a long
>time to figure which data it should keep and which it should drop
>... a single rrd_create can have quite a long lasting negative
>effect ...

ok. I will switch rrd_create to the new accessors, sounds good?
>
>I assume the kernel prefers keeping a whole file in memory as to
>just a few scattered blocks in many files ... (which makes sense in
>general, just not for rrdtool)
>
>the test case shows that the mmap code goes to about 20k updates
>per second when stuff is in cache while 1.2.24dev goes to 12k
>updates per second ...

Yes, this is about the same figure that i was seeing. rrdtool-1.3 is
about 80% faster for e.g. updates than rrdtool-1.2. Worst-case speedup
was around 30%. I consider this a feature :)

>
>I have not quite figured out the logic behind it all ...
>
>if you care to play as well, I have attached my testing code ...

I'll have a look if time permits (perhaps on sunday, we'll see).

>regarding the graphing part, the only bit where it accesses the
>filesystem directly is when writing out the png/pdf/svn/... file
>the rest of the interaction goes through rrd_fetch ...
>
>not sure about dropping the png file ... this only makes sense for
>people who create many pngs and not on demand ...

I assume that the created picture will soon be delivered through a
network-pipe to a browser on the client-side, at least usually. Thus
keeping it fully in cache should be benefical, i think.
>
>dropping rrd_fetch (except the hot blocks) does make sense in my opinnion
>since it stops cache pollution ... should be configurable for cases
>where the user knows that he is going to re-read the same rrd file
>twice in a row ...

I will look at the call-graph of rrd_graph. I suppose a command-line
option (--keep, -k or the like) to keep the whole RRD in hot cache
can be implemented for rrd_fetch() and rrd_graph(). Sounds ok?

cheers,
Bernhard
-- 
>
>
>Today Bernhard Fischer wrote:
>
>> On Tue, Jun 12, 2007 at 11:20:39PM +0200, Tobias Oetiker wrote:
>> >Hi Bernhard,
>> [I'm changing the subject a little bit; Please feel free to move this
>> discussion to the list, if you prefer. Discussing this a little bit in
>> private mode is of course fine with me. Nice that you seem to have a
>> little bit time to work again on rrdtool :)]
>> >
>> >looking at the new code, I find that configure does not check for
>> >posix_fadvise if mmap code is active ...
>> >
>> >the effect is that in rrd_create no fdatasync and dontneed happens,
>> >which makes rrd_create faster but also evicts all previously cached
>> >data quite effectively ...
>> >
>> >is there a reason to not check for posixs fadvise if mmaping is
>> >active ?
>>
>> My way of thinking is that for mmap we generally do not want to fadvise
>> but madvise, where possible. It is correct that rrd_create in it's
>> current incarnation should use fadvise (since it doesn't [yet] use
>> mmap).
>>
>> I suggest we do one of these:
>> 1) rewrite rrd_create to not use filps (FILE*) but FD/mmap based I/O
>>    Up until now i didn't implement this since in my POV rrd_create is
>>    not really performance critical.
>>    The advantage is that potentially this would use less memory than the
>>    current implementation.
>>    open the new file, with O_CREAT (see rrd_resize as an example how to
>>    rrd_open() a new file in the new accessor impl).
>>    This is IMHO the preferred way to go but a bit more work than the
>>    alternative below.
>> 2) leave rrd_create alone and check for fadvise also for the mmap case.
>>    Change all HAVE_POSIX_FADVISE in sources which provide new
>>    accessor-methods to not be called. The net effect is that
>>    a) not updated functions (create,graph come to mind) use fadvise
>>    b) updated functions (update, resize, etc) do *not* call fadvise but
>>    use only madvise.
>>
>> thoughts?
>> PS: while i think that rrd_create is not too performance critical,
>> rrd_graph certainly is since it is potentially called very often (for
>> obvious reasons, i.e. users :). So, while switching rrd_create to the
>> nwe accessor functions would be nice to have, updating rrd_graph is
>> overall more benefical, i assume.
>>
>> cheers,
>> Bernhard
>>
>>
>
>-- 
>Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten
>http://it.oetiker.ch tobi at oetiker.ch ++41 62 213 9902
>#define _GNU_SOURCE 1
>#include <unistd.h>
>#include <fcntl.h>
>int main(int argc, char *argv[]) {
>    int fd;
>    fd = open(argv[1], O_RDONLY);
>    posix_fadvise64(fd, 0,0,POSIX_FADV_DONTNEED);
>    close(fd);
>    return 0;
>}

>#include <stdio.h>
>#include <stdlib.h>
>#include <fcntl.h>
>#include <sys/types.h>
>#include <sys/stat.h>
>#include <unistd.h>
>#include <sys/mman.h>
>
>int main(int argc, char *argv[]) {
>   int fd;
>   struct stat file_stat;
>   void *file_mmap;
>   unsigned char *mincore_vec;
>   size_t page_size = getpagesize();
>   size_t page_index;
>   fd = open(argv[1],0);
>   fstat(fd, &file_stat);
>   file_mmap = mmap((void *)0, file_stat.st_size, PROT_NONE, MAP_SHARED, fd, 0);
>   mincore_vec = calloc(1, (file_stat.st_size+page_size-1)/page_size);
>   mincore(file_mmap, file_stat.st_size, mincore_vec);
>   printf("Cached Blocks of %s: ",argv[1]);
>   for (page_index = 0; page_index <= file_stat.st_size/page_size; page_index++) {
>      if (mincore_vec[page_index]&1) {
>	   printf("%lu ", (unsigned long)page_index);
>      }
>   }
>   printf("\n");
>   free(mincore_vec);
>   munmap(file_mmap, file_stat.st_size);
>   close(fd);
>   return 0;
>}

>#! /usr/bin/perl
>#
># $Id:$
>#
># Created By Tobi Oetiker <tobi at oetiker.ch>
># Date 2006-10-27
>#
>#makes programm work AFTER install
>
>use lib qw( ../bindings/perl-shared/blib/lib ../bindings/perl-shared/blib/arch );
>
>print <<NOTE;
>
>RRDtool Performance Tester
>--------------------------
>Runnion on $RRDs::VERSION;
>
>RRDtool update performance is ultimately disk-bound. Since very little data
>does actually get written to disk in a single update, the performance
>is highly dependent on the cache situation in your machine.
>
>This test tries to cater for this. It works like this:
>
>1) Create RRD file tree
>
>2) Update RRD files several times in a row.
>
>NOTE
>
>use strict;
>use Time::HiRes qw(time);
>use RRDs;
>use IO::File;
>use Time::HiRes qw( usleep );
>
>sub create($$){
>  my $file = shift;
>  my $time = shift;
>  my $start = time; #since we loaded HiRes
>  RRDs::create  ( $file.".rrd", "-b$time", qw(
>			-s300                        
>		        DS:in:GAUGE:400:U:U
>		        DS:out:GAUGE:400:U:U
>		        RRA:AVERAGE:0.5:1:600
>		        RRA:AVERAGE:0.5:6:600
>		        RRA:MAX:0.5:6:600
>		        RRA:AVERAGE:0.5:24:600
>		        RRA:MAX:0.5:24:600
>		        RRA:AVERAGE:0.5:144:600
>		        RRA:MAX:0.5:144:600
>		));
>   my $total = time - $start;
>   my $error =  RRDs::error;
>   die $error if $error;
>   return $total;
>}
>
>sub update($$){
>  my $file = shift;
>  my $time = shift;
>  my $in = rand(1000);
>  my $out = rand(1000);
>  my $start = time;
>  my $ret = RRDs::updatev($file.".rrd", $time.":$in:$out");
>  my $total = time - $start;
>  my $error =  RRDs::error;
>  die $error if $error;
>  return $total;
>}
>
>sub tune($){
>  my $file = shift;
>  my $start = time;
>  RRDs::tune ($file.".rrd", "-a","in:U","-a","out:U","-d","in:GAUGE","-d","out:GAUGE");
>  my $total = time - $start;
>  my $error =  RRDs::error;
>  die $error if $error;
>  return $total;
>}
>
>sub infofetch($){
>  my $file = shift;
>  my $start = time;
>  my $info = RRDs::info ($file.".rrd");
>  my $error =  RRDs::error;
>  die $error if $error;
>  my $lasttime =  $info->{last_update} - $info->{last_update} % $info->{step};           
>  my $fetch = RRDs::fetch ($file.".rrd",'AVERAGE','-s',$lasttime-1,'-e',$lasttime);
>  my $total = time - $start;
>  my $error =  RRDs::error;
>  die $error if $error;
>  return $total;
>}
>
>sub stddev ($$$){ #http://en.wikipedia.org/wiki/Standard_deviation
>  my $sum = shift;
>  my $squaresum = shift;
>  my $count = shift;
>  return sqrt( 1 / $count * ( $squaresum - $sum*$sum / $count ))
>}
>
>sub makerrds($$$$){
>    my $count = shift;
>    my $total = shift;
>    my $list = shift;
>    my $time = shift;
>    my @files;
>    my $now = int(time);
>    for (1..$count){
>        my $id = sprintf ("%07d",$total);
>        $id =~ s/^(.)(.)(.)(.)(.)//;
>        push @$list, "$1/$2/$3/$4/$5/$id";    
>        -d "$1" or mkdir "$1";
>        -d "$1/$2" or mkdir "$1/$2";
>        -d "$1/$2/$3" or mkdir "$1/$2/$3";
>        -d "$1/$2/$3/$4" or mkdir "$1/$2/$3/$4";
>        -d "$1/$2/$3/$4/$5" or mkdir "$1/$2/$3/$4/$5";
>	push @files, $list->[$total];
>        create $list->[$total++],$time-2;
>	if ($now < int(time)){
>	  $now = int(time);
>	  print STDERR $count - $_," rrds to go. \r";
>        }
>    }
>    return $count;
>}
> 
>sub main (){
>    mkdir "db-$$" or die $!;
>    chdir "db-$$";
>
>    my $step = 10000; # number of rrds to creat for every round
>    
>    my @path;
>    my $time=int(time);
>
>    my $tracksize = 0;
>    my $uppntr = 0;
>
>    
>    my %squaresum = ( cr => 0, up => 0 );
>    my %sum = ( cr => 0, up => 0 );
>    my %count =( cr => 0, up => 0 );
>
>    my $printtime = time;
>    my %step;
>    for (qw(1 6 24 144)){
>          $step{$_} = int($time / 300 / $_);
>    }
>    
>    for (0..2) {
>        # enhance the track
>        $time += 300;
>        $tracksize += makerrds $step,$tracksize,\@path,$time;            
>        # run benchmark
>    
>        for (0..50){
>      	    $time += 300;
>            my $count = 0;
>            my $sum = 0;
>            my $squaresum = 0;
>            my $prefix = "";
>            for (qw(1 6 24 144)){
>                if (int($time / 300 / $_) > $step{$_})  {
>                    $prefix .= "$_  ";
>                    $step{$_} = int($time / 300 / $_);
>                 }
>                 else {
>                    $prefix .= (" " x length("$_")) . "  ";
>                 }   
>            }
>            my $now = int(time);
>            for (my $i = 0; $i<$tracksize;$i ++){
>               my $ntime = int(time);
>               if ($now < $ntime){
>                   printf STDERR "$prefix %7d \r",$i;
>                   $now = $ntime;
>               }
>               my $elapsed = update($path[$i],$time); 
>               $sum += $elapsed;
>               $squaresum += $elapsed**2;
>               $count++;
>            };
>            my $ups = $count/$sum;
>            my $sdv = stddev($sum,$squaresum,$count);
>            printf STDERR "$prefix %7d %6.0f Up/s (%6.5f sdv)\n",$count,$ups,$sdv;
>        }
>	print STDERR "\n";
>    }
>}
>
>main;