[smokeping-users] Re: Scalability

Dan Tucny dan at tucny.com
Sat Jun 22 03:29:19 MEST 2002


See my reply to the 'question on probe timing' thread about this,
however, I'll go into some more detail specific to your problems here...

Theoretical limit on default fping settings would be 600 hosts in 5
mins. 

Failures shouldn't affect timing as a there is a 500ms timeout on pings,
so with the 1 second between pings to the same host that give a whole
half second of spare time after the packet has been flagged MIA and the
host is next sampled...

In reality I'd work at closer to 500 hosts in 5 mins as being a limit at
fping from some very simple tests I have done... i.e.

fping -C 20 -q -s <40 reachable, 40 unreachable hosts>

      80 targets
      40 alive
      40 unreachable
       0 unknown addresses

       0 timeouts (waiting for response)
    1600 ICMP Echos sent
     800 ICMP Echo Replies received
      53 other ICMP received

 0.38 ms (min round trip time)
 0.45 ms (avg round trip time)
 1.94 ms (max round trip time)
       48.886 sec (elapsed real time)

fping -C 20 -q -s <80 reachable hosts>

      80 targets
      80 alive
       0 unreachable
       0 unknown addresses

       0 timeouts (waiting for response)
    1600 ICMP Echos sent
    1600 ICMP Echo Replies received
       0 other ICMP received

 0.35 ms (min round trip time)
 0.42 ms (avg round trip time)
 1.95 ms (max round trip time)
       48.856 sec (elapsed real time)

fping -C 20 -q -s <80 unreachable hosts>

      80 targets
       0 alive
      80 unreachable
       0 unknown addresses

       0 timeouts (waiting for response)
    1600 ICMP Echos sent
       0 ICMP Echo Replies received
      89 other ICMP received

 0.00 ms (min round trip time)
 0.00 ms (avg round trip time)
 0.00 ms (max round trip time)
       48.951 sec (elapsed real time)

To increase the sample rate, I've also tried fping with for example -i
12.5 to reduce the default per packet wait time from 25ms to 12.5ms
which should result in a theoretical limit of 1200 hosts in 5 minutes,
however from the results I've obtained here, it looks closer to 750
hosts...

fping -C 20 -q -s -i 12.5 <40 reachable, 40 unreachable hosts>

      80 targets
      40 alive
      40 unreachable
       0 unknown addresses

       0 timeouts (waiting for response)
    1600 ICMP Echos sent
     800 ICMP Echo Replies received
      36 other ICMP received

 0.35 ms (min round trip time)
 0.42 ms (avg round trip time)
 1.65 ms (max round trip time)
       32.863 sec (elapsed real time)

fping -C 20 -q -s -i 12.5 <80 reachable hosts>

      80 targets
      80 alive
       0 unreachable
       0 unknown addresses

       0 timeouts (waiting for response)
    1600 ICMP Echos sent
    1600 ICMP Echo Replies received
       0 other ICMP received

 0.36 ms (min round trip time)
 0.41 ms (avg round trip time)
 2.05 ms (max round trip time)
       32.866 sec (elapsed real time)

fping -C 20 -q -s -i 12.5 <80 unreachable hosts>

      80 targets
       0 alive
      80 unreachable
       0 unknown addresses

       0 timeouts (waiting for response)
    1600 ICMP Echos sent
       0 ICMP Echo Replies received
      62 other ICMP received

 0.00 ms (min round trip time)
 0.00 ms (avg round trip time)
 0.00 ms (max round trip time)
       32.958 sec (elapsed real time)

The debug output you have below is due to fping always returning errors,
even when running -q, this shouldn't affect the runtime of fping itself
though.

This is of course all purely looking at fping, there is of course also
the time taken for Smokeping to process these results to be taken into
consideration though I don't have any timings for that...

I hope this is helpful to you...

Dan

-----Original Message-----
From: smokeping-users-bounce at list.ee.ethz.ch
[mailto:smokeping-users-bounce at list.ee.ethz.ch] On Behalf Of Marc Powell
Sent: 20 June 2002 01:30
To: Tobias Oetiker
Cc: Smokeping
Subject: [smokeping-users] Re: Scalability

Sure thing. Here is what I have done, I created a test smokeping binary
that points to my original config file with 546 hosts on this particular
data collector. I ran it with -debug and -nodaemon (I think debug
implies nodaemon, but I wanted to cover all bases).
 
# [smokep at dc2 ~/bin]date ; ./smokeping.test -debug -nodaemon ; date

Wed Jun 19 19:20:10 CDT 2002
### fping seems to report in 1 miliseconds
Launched successfully
FPing: probing 546 targets
Wed Jun 19 19:28:11 CDT 2002
 
This 8 minute duration seems to be fairly consistent, at least right now
;)
 
Here's a snippet of truss about 4 minutes into the run:
 
[smokep at dc2 ~]$ date
Wed Jun 19 19:23:49 CDT 2002
[smokep at dc2 ~]$ truss -fea -p 14837
14837:  psargs: /usr/local/bin/perl -w ./smokeping.test -debug -nodaemon
14837:  read(7, " I C M P   T i m e   E x".., 5120)     = 70
14837:  read(7, 0x004E380C, 5120)       (sleeping...)
14837:  read(7, " I C M P   T i m e   E x".., 5120)     = 69
14837:  read(7, " I C M P   T i m e   E x".., 5120)     = 70
14837:  read(7, 0x004E380C, 5120)       (sleeping...)
14837:  read(7, " I C M P   T i m e   E x".., 5120)     = 70
14837:  read(7, " I C M P   T i m e   E x".., 5120)     = 74
14837:  read(7, " I C M P   T i m e   E x".., 5120)     = 70
14837:  read(7, " I C M P   T i m e   E x".., 5120)     = 75
14837:  read(7, 0x004E380C, 5120)       (sleeping...)
14837:  read(7, " I C M P   T i m e   E x".., 5120)     = 70
14837:  read(7, 0x004E380C, 5120)       (sleeping...)
14837:  read(7, " I C M P   T i m e   E x".., 5120)     = 18
14837:  read(7, "   f r o m  ", 5120)                   = 6
14837:  read(7, " 1 0 . 5 5 . 0 . 1 1", 5120)           = 10
14837:  read(7, "   f o r   I C M P   E c".., 5120)     = 23
14837:  read(7, " 1 7 2 . 3 1 . 5 6 . 2", 5120)         = 11
14837:  read(7, "\n", 5120)                             = 1
14837:  read(7, 0x004E380C, 5120)       (sleeping...)
14837:  read(7, " I C M P   T i m e   E x".., 5120)     = 70
14837:  read(7, 0x004E380C, 5120)       (sleeping...)
14837:  read(7, " I C M P   T i m e   E x".., 5120)     = 70
14837:  read(7, " I C M P   T i m e   E x".., 5120)     = 74
14837:  read(7, " I C M P   T i m e   E x".., 5120)     = 70
14837:  read(7, " I C M P   T i m e   E x".., 5120)     = 75
14837:  read(7, 0x004E380C, 5120)       (sleeping...)
14837:  read(7, " I C M P   T i m e   E x".., 5120)     = 70
14837:  read(7, 0x004E380C, 5120)       (sleeping...)
14837:  read(7, " I C M P   T i m e   E x".., 5120)     = 69
14837:  read(7, " I C M P   T i m e   E x".., 5120)     = 70
14837:  read(7, 0x004E380C, 5120)       (sleeping...)
^C[smokep at dc2 ~]$ date
Wed Jun 19 19:24:44 CDT 2002

If there is anything else that I can provide that would be of
assistance, please don't hesitate to let me know.
 
Thanks,
 
Marc

	-----Original Message----- 
	From: Tobias Oetiker [mailto:oetiker at ee.ethz.ch] 
	Sent: Wed 6/19/2002 5:26 PM 
	To: Marc Powell 
	Cc: Smokeping 
	Subject: Re: [smokeping-users] Re: Scalability
	
	

	Yesterday Marc Powell wrote: 

	> The only major problem I am having is that I see gaps in the
graphs 
	> (10-15 minutes) for those regions with relatively high numbers
of hosts 
	> down (20-30). We're monitoring schools so it's the off season
here in 
	> the US and the routers fluctuate depending on what maintenance
is going 
	> on at the schools, whether the janitor has spilt his coffee in
the 
	> router, etc... I am attributing the gaps to the slower
response time for 
	> ICMP UNREACHABLE's from fping, which lengthens the overall
time it takes 
	> before smokeping spawns the next run to 10-15 minutes or
longer. Since 
	> smokeping appears to wait until the fping process terminates
before 
	> writing to the RRDs or spawning the next fping process, the
gaps are 
	> appearing for all hosts in a region. To minimize the number of
hosts 
	> affected by this problem, I have just implemented unique
configurations 
	> for each alphabetical grouping per region so that I can spawn
a 
	> smokeping daemon for each grouping as opposed to each region
(i.e. 5 
	> smokeping processes per data collector). As a result, I have a
feature 
	> request or two to make things easier: 

	try running smokeping by hand, at least in theory it should ping

	ALL the hosts in your config in parallel. The time it will wait
for 
	a 'lost' paket is about 1 second at most so this means in theory
a 
	fping run is over in 20 seconds regardless of the number of 
	machines involved. Now there is a small gap between each icmp 
	packet sent out from fping, so there is an impact per machine
but 
	it should not at all depend on how long the machine has to
answer 
	... this after all is the whole motivation behinde fping ... 

	>       1) Add a pidfile directive to either complement or
replace 
	> piddir. Currently, it is necessary to either create a
directory for each 
	> pid file specifically or remove the pidfile before starting
the next 
	> smokeping process. 

	running multiple smokeping processes is not the solution ... if 
	fping has a bug, we will fix fping ... 

	>       2) The ability to INCLUDE external files within a config
file. 
	> This should help cut down on the number of unique files I'm
having to 
	> create. 

	this is already there ... check the documentation on 
	ISG::ParseConfig 

	cheers 
	tobi 

	-- 
	 ______    __   _ 
	/_  __/_  / /  (_) Oetiker, OETIKER+PARTNER AG, Gallusstrasse 25

	 / // _ \/ _ \/ / CH-4600 Olten, phoneto:+41(0)62 213 9909 
	/_/ \.__/_.__/_/ tobi at oetiker.ch http://google.com/search?q=tobi



--
Unsubscribe
mailto:smokeping-users-request at list.ee.ethz.ch?subject=unsubscribe
Help        mailto:smokeping-users-request at list.ee.ethz.ch?subject=help
Archive     http://www.ee.ethz.ch/~slist/smokeping-users
WebAdmin    http://www.ee.ethz.ch/~slist/lsg2.cgi


--
Unsubscribe mailto:smokeping-users-request at list.ee.ethz.ch?subject=unsubscribe
Help        mailto:smokeping-users-request at list.ee.ethz.ch?subject=help
Archive     http://www.ee.ethz.ch/~slist/smokeping-users
WebAdmin    http://www.ee.ethz.ch/~slist/lsg2.cgi



More information about the smokeping-users mailing list