From tobi at oetiker.ch  Mon Feb  3 11:08:32 2014
From: tobi at oetiker.ch (Tobias Oetiker)
Date: Mon, 3 Feb 2014 11:08:32 +0100 (CET)
Subject: [smokeping-users] mod_fcgid: HTTP request length xxx (so far)
 exceeds MaxRequestLen (yyyy)
In-Reply-To: <CAHYeK0ft9=+LBs6ibTveK0x36DDyN6ZkfEcGV=EjWwr8xoT5bg@mail.gmail.com>
References: <CAHYeK0ft9=+LBs6ibTveK0x36DDyN6ZkfEcGV=EjWwr8xoT5bg@mail.gmail.com>
Message-ID: <alpine.DEB.2.02.1402031105590.13918@froburg.oetiker.ch>

Hi Paul,

Jan 27 Paul Mansfield wrote:

> We have had a significant number of problems with smokeping client
> consuming all the memory on a server, we've had these grow to 12GB in
> the worst case!
>
> There appear to be two problems, probably related.
>
> * One is that the slave can't communicate with the master for a while
> and the smokeping slave cache builds up a backlog of work, and then it
> is unable to send the data.
>
> * The other problem is that the slave process, e.g. usually FPing,
> gets wedged and becomes larger and larger.
>
>
> the clue to the first problem are lines like this in the Apache error log:
>
> [Mon Jan 27 12:12:42 2014] [warn] [client w.x.y.z] mod_fcgid: HTTP
> request length 394304 (so far) exceeds MaxRequestLen (393216)
>
> To resolve this, you have to kill-9 the smokeping slave, and any child
> processes like FPing, then remove everything in the ~smokeping/cache
> directory. Then it's safe to restart.
>
>
> It's probably worth setting a larger number than the default in apache
> config anyway to allow slaves to catch up... e.g.
>     FcgidMaxRequestLen 131072
>
> but if you find anything larger than that it probably means something
> else has gone wrong.

hmmm never seen such a problem ... 12GB ist VERY much ... for how
long has the slave been 'offline' when this happens? and how large
is a normal 'submission' by the slave ?

cheers
tobi
-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900


From paul.mansfield+smokeping at grapeshot.co.uk  Mon Feb  3 11:29:51 2014
From: paul.mansfield+smokeping at grapeshot.co.uk (Paul Mansfield)
Date: Mon, 3 Feb 2014 10:29:51 +0000
Subject: [smokeping-users] mod_fcgid: HTTP request length xxx (so far)
 exceeds MaxRequestLen (yyyy)
In-Reply-To: <alpine.DEB.2.02.1402031105590.13918@froburg.oetiker.ch>
References: <CAHYeK0ft9=+LBs6ibTveK0x36DDyN6ZkfEcGV=EjWwr8xoT5bg@mail.gmail.com>
	<alpine.DEB.2.02.1402031105590.13918@froburg.oetiker.ch>
Message-ID: <CAHYeK0eDknT7Tw1pY2LPbePXpG5ANmL1616mgQT85P-b4eh=gA@mail.gmail.com>

the default length of a request is 131072 bytes, i.e. 128KiB
http://httpd.apache.org/mod_fcgid/mod/mod_fcgid.html#fcgidmaxrequestlen

I think I increased it to four times that before I realised something
had gone wrong!

I don't know how long the bad slaves had gone offline, weeks, I think.
We have quite a few of them.

Now, we've imposed ulimit on the "smokeping" userID, so that the
slaves will die instead of eating the entire system.


From dalgibbard at gmail.com  Mon Feb 17 16:38:49 2014
From: dalgibbard at gmail.com (Darren Gibbard)
Date: Mon, 17 Feb 2014 15:38:49 +0000
Subject: [smokeping-users] Managing a lot of historic data
Message-ID: <CA+zHMm8TN+2VG2Ov4iikcyEOfsNhTm1mzrxn69VguM7FPr+AVw@mail.gmail.com>

Hello all,
I'm running a 'master' Smokeping (v.2.004002) instance, with Slaves located
across 20 sites, and each site has four targets itself, and are checked by
all of the slave instances...

This setup as been running for quite some time now, and it's come to my
realisation that each site is totalling about 3 - 4GB of data - about 80GB
total.

Are there any documents floating about with regards to
rotating/cleaning/compressing files, and/or clearing out any data more than
2yrs old for example?

I noted especially that the "*.slave_cache" files are the large ones,
whereas the actual RRD files are comparatively small - but I'm failing to
find any specific documentation about what these slave_cache files are
actually used for?

Thanks in advance!
Darren.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.oetiker.ch/pipermail/smokeping-users/attachments/20140217/351e7594/attachment.htm 

From pwehunt at gmail.com  Fri Feb 21 05:42:49 2014
From: pwehunt at gmail.com (Philip Wehunt)
Date: Thu, 20 Feb 2014 23:42:49 -0500
Subject: [smokeping-users] edgetrigger
Message-ID: <D3797550-94A6-4686-80C1-E407D35D0824@gmail.com>


I am currently building out our smokeping implementation and all is going fantastic. However, due to specific needs, I am piping alerts via edgetrigger to an external python script. I pulled my hair out for nearly five hours debugging my script because the 'cleared' argument was not firing my python script although the built in alerts would.  I then discovered with a two liner bash script that echo'd the args from smokeping that the expected '0' on cleared is not being passed--only the 1 when raised. 

I did my due diligence searching the list archives and if course google. However, I was only able to find one mention of the issue but no fix or remedy. 

Hopefully someone can point me in the right direction. 

Thanks!


From gregs at sloop.net  Fri Feb 21 06:11:00 2014
From: gregs at sloop.net (Gregory Sloop)
Date: Thu, 20 Feb 2014 21:11:00 -0800
Subject: [smokeping-users] edgetrigger
In-Reply-To: <D3797550-94A6-4686-80C1-E407D35D0824@gmail.com>
References: <D3797550-94A6-4686-80C1-E407D35D0824@gmail.com>
Message-ID: <711503562.20140220211100@sloop.net>


PW> I am currently building out our smokeping implementation and all
PW> is going fantastic. However, due to specific needs, I am piping
PW> alerts via edgetrigger to an external python script. I pulled my
PW> hair out for nearly five hours debugging my script because the
PW> 'cleared' argument was not firing my python script although the
PW> built in alerts would.  I then discovered with a two liner bash
PW> script that echo'd the args from smokeping that the expected '0'
PW> on cleared is not being passed--only the 1 when raised. 

PW> I did my due diligence searching the list archives and if course
PW> google. However, I was only able to find one mention of the issue but no fix or remedy.

PW> Hopefully someone can point me in the right direction. 

I can't offer any guidance - my solution to the very basic reporting
in SP was to query the RRD's with a Nagios plug-in and use Nagios for
reporting/alerting.

Nagios can't generate alerts with the same elaborate criteria that SP
does, but basic criteria work fine for me.

In short, I think trying to handle reporting/alerting with SP is kind
of nuts. [No offense to you, I tried too at one point - and I gave up.
So, if anyone is nuts, I'm grouping myself with the "nuts" too.]

I'd guess with 5 more hours, you could integrate this all in Nagios...
:)

[And I should mention that I can't get the detail I can get in SP with
Nagios, so I don't use Nagios to actually gather stats on these targets,
only SP. I use each tool where its strengths lie. SP for stats, and
Nagios for alerts/reports.]

But perhaps you're doing something else in your python script - but
thought I'd offer my work-around for SP's minimal alerting.

HTH

-Greg


From pwehunt at gmail.com  Fri Feb 21 06:42:13 2014
From: pwehunt at gmail.com (Philip Wehunt)
Date: Fri, 21 Feb 2014 00:42:13 -0500
Subject: [smokeping-users] edgetrigger
In-Reply-To: <711503562.20140220211100@sloop.net>
References: <D3797550-94A6-4686-80C1-E407D35D0824@gmail.com>
	<711503562.20140220211100@sloop.net>
Message-ID: <11FFB054-C597-436F-83C3-F7287A1B5473@gmail.com>

Thanks for the reply. You helped me realize in my initial post I left out a key part of why I am scripting the alerts. Our current needs require an MTR to fire and catch a glimpse of each hop when our thresholds set in SP are hooked. So basically I have my python script parsing the args from SP in an argparse based function and passing that to a function that uses the parsed args to create my email, iterate mtr ten or so times with the --report flag and email it--as well as log it to a log file. We frequently need this granular data to escalate with our upstream BW providers. 

Works perfect when the 'raise' arg passes the '1' when SP triggers alert. But it only passes 5 arguments on the  cleared run--so my script dies because it expects 6 args.  Hence it doesn't fire my script to infirm the issue has cleared. 

I could hackishly work around this in my python but I wanted to identify if I am doing something wrong on the SP side or if it is a bug. Mainly in the spirit of KISS. I don't like to let hackish scripts linger. 

Agreed on the Nagios --however, we are a Science Logic/EM7 shop (I voted nagios) 

Thanks for the reply. 


> On Feb 21, 2014, at 12:11 AM, Gregory Sloop <gregs at sloop.net> wrote:
> 
> 
> PW> I am currently building out our smokeping implementation and all
> PW> is going fantastic. However, due to specific needs, I am piping
> PW> alerts via edgetrigger to an external python script. I pulled my
> PW> hair out for nearly five hours debugging my script because the
> PW> 'cleared' argument was not firing my python script although the
> PW> built in alerts would.  I then discovered with a two liner bash
> PW> script that echo'd the args from smokeping that the expected '0'
> PW> on cleared is not being passed--only the 1 when raised. 
> 
> PW> I did my due diligence searching the list archives and if course
> PW> google. However, I was only able to find one mention of the issue but no fix or remedy.
> 
> PW> Hopefully someone can point me in the right direction. 
> 
> I can't offer any guidance - my solution to the very basic reporting
> in SP was to query the RRD's with a Nagios plug-in and use Nagios for
> reporting/alerting.
> 
> Nagios can't generate alerts with the same elaborate criteria that SP
> does, but basic criteria work fine for me.
> 
> In short, I think trying to handle reporting/alerting with SP is kind
> of nuts. [No offense to you, I tried too at one point - and I gave up.
> So, if anyone is nuts, I'm grouping myself with the "nuts" too.]
> 
> I'd guess with 5 more hours, you could integrate this all in Nagios...
> :)
> 
> [And I should mention that I can't get the detail I can get in SP with
> Nagios, so I don't use Nagios to actually gather stats on these targets,
> only SP. I use each tool where its strengths lie. SP for stats, and
> Nagios for alerts/reports.]
> 
> But perhaps you're doing something else in your python script - but
> thought I'd offer my work-around for SP's minimal alerting.
> 
> HTH
> 
> -Greg
> 


From bluewind at xinu.at  Fri Feb 21 10:50:05 2014
From: bluewind at xinu.at (Florian Pritz)
Date: Fri, 21 Feb 2014 10:50:05 +0100
Subject: [smokeping-users] edgetrigger
In-Reply-To: <11FFB054-C597-436F-83C3-F7287A1B5473@gmail.com>
References: <D3797550-94A6-4686-80C1-E407D35D0824@gmail.com>	<711503562.20140220211100@sloop.net>
	<11FFB054-C597-436F-83C3-F7287A1B5473@gmail.com>
Message-ID: <5307214D.4000709@xinu.at>

On 21.02.2014 06:42, Philip Wehunt wrote:
> I could hackishly work around this in my python but I wanted to
> identify if I am doing something wrong on the SP side or if it is a
> bug. Mainly in the spirit of KISS. I don't like to let hackish
> scripts linger.

You probably found the same script on gist, but here's my version[1]
which doesn't fail when the 6th arg is missing. It will not add "
cleared" to the subject without the arg, but it will send you the report.

[1]: https://git.server-speed.net/users/flo/bin/tree/smokemtr.py

From the documentation in smokeping_config I'd say this is a bug, but
given I get my mails I didn't bother fixing it yet.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: OpenPGP digital signature
Url : http://lists.oetiker.ch/pipermail/smokeping-users/attachments/20140221/310d74ed/attachment-0001.pgp 

From gregs at sloop.net  Fri Feb 21 15:38:00 2014
From: gregs at sloop.net (Greg Sloop <gregs@sloop.net>)
Date: Fri, 21 Feb 2014 06:38:00 -0800
Subject: [smokeping-users] edgetrigger
In-Reply-To: <11FFB054-C597-436F-83C3-F7287A1B5473@gmail.com>
References: <D3797550-94A6-4686-80C1-E407D35D0824@gmail.com>
	<711503562.20140220211100@sloop.net>
	<11FFB054-C597-436F-83C3-F7287A1B5473@gmail.com>
Message-ID: <CAAjorqVvmt2ZjD_DE-MNMm+rFpSaC1cH=Y8U6QYFSjEJz9pkrQ@mail.gmail.com>

I'd love to have your script when it's done. Provided you're willing to
share..

I've been meaning to use an MTR capture just as you are doing, but haven't
done it yet..  Thus having yours as a template would be fab!

Thanks
On Feb 20, 2014 9:42 PM, "Philip Wehunt" <pwehunt at gmail.com> wrote:

> Thanks for the reply. You helped me realize in my initial post I left out
> a key part of why I am scripting the alerts. Our current needs require an
> MTR to fire and catch a glimpse of each hop when our thresholds set in SP
> are hooked. So basically I have my python script parsing the args from SP
> in an argparse based function and passing that to a function that uses the
> parsed args to create my email, iterate mtr ten or so times with the
> --report flag and email it--as well as log it to a log file. We frequently
> need this granular data to escalate with our upstream BW providers.
>
> Works perfect when the 'raise' arg passes the '1' when SP triggers alert.
> But it only passes 5 arguments on the  cleared run--so my script dies
> because it expects 6 args.  Hence it doesn't fire my script to infirm the
> issue has cleared.
>
> I could hackishly work around this in my python but I wanted to identify
> if I am doing something wrong on the SP side or if it is a bug. Mainly in
> the spirit of KISS. I don't like to let hackish scripts linger.
>
> Agreed on the Nagios --however, we are a Science Logic/EM7 shop (I voted
> nagios)
>
> Thanks for the reply.
>
>
>
> > On Feb 21, 2014, at 12:11 AM, Gregory Sloop <gregs at sloop.net> wrote:
> >
> >
> > PW> I am currently building out our smokeping implementation and all
> > PW> is going fantastic. However, due to specific needs, I am piping
> > PW> alerts via edgetrigger to an external python script. I pulled my
> > PW> hair out for nearly five hours debugging my script because the
> > PW> 'cleared' argument was not firing my python script although the
> > PW> built in alerts would.  I then discovered with a two liner bash
> > PW> script that echo'd the args from smokeping that the expected '0'
> > PW> on cleared is not being passed--only the 1 when raised.
> >
> > PW> I did my due diligence searching the list archives and if course
> > PW> google. However, I was only able to find one mention of the issue
> but no fix or remedy.
> >
> > PW> Hopefully someone can point me in the right direction.
> >
> > I can't offer any guidance - my solution to the very basic reporting
> > in SP was to query the RRD's with a Nagios plug-in and use Nagios for
> > reporting/alerting.
> >
> > Nagios can't generate alerts with the same elaborate criteria that SP
> > does, but basic criteria work fine for me.
> >
> > In short, I think trying to handle reporting/alerting with SP is kind
> > of nuts. [No offense to you, I tried too at one point - and I gave up.
> > So, if anyone is nuts, I'm grouping myself with the "nuts" too.]
> >
> > I'd guess with 5 more hours, you could integrate this all in Nagios...
> > :)
> >
> > [And I should mention that I can't get the detail I can get in SP with
> > Nagios, so I don't use Nagios to actually gather stats on these targets,
> > only SP. I use each tool where its strengths lie. SP for stats, and
> > Nagios for alerts/reports.]
> >
> > But perhaps you're doing something else in your python script - but
> > thought I'd offer my work-around for SP's minimal alerting.
> >
> > HTH
> >
> > -Greg
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.oetiker.ch/pipermail/smokeping-users/attachments/20140221/3e70d433/attachment.htm 

From pwehunt at gmail.com  Sat Feb 22 00:15:55 2014
From: pwehunt at gmail.com (Philip Wehunt)
Date: Fri, 21 Feb 2014 18:15:55 -0500
Subject: [smokeping-users] edgetrigger
In-Reply-To: <5307214D.4000709@xinu.at>
References: <D3797550-94A6-4686-80C1-E407D35D0824@gmail.com>
	<711503562.20140220211100@sloop.net>
	<11FFB054-C597-436F-83C3-F7287A1B5473@gmail.com>
	<5307214D.4000709@xinu.at>
Message-ID: <CAHUVhaOmSWOP+m2RuB7GieV=95NKvEc+DM1Dk=767KxkPSPg-Q@mail.gmail.com>

Interestingly enough, I did reference that particular script on gist--but
many moons ago when I was cutting my python teeth--I used it for the email
part.  I have revisited it and taken a gander at your version.  I actually
used your options on the 6th argument as inspiration and I have added other
tweaks to make it report 'ALERT' or 'CLEARED' when used as edge trigger
script.

I have tweaked it a bit and added a few more things throughout the script.
 I plan on officially forking it on github (to give the original author
credit) and committing my changes.  When I do this, I will post the link
here.

Many thanks for the input and link--certainly helped me from making a
mountain out of a mole-hill.  :-)


On Fri, Feb 21, 2014 at 4:50 AM, Florian Pritz <bluewind at xinu.at> wrote:

> On 21.02.2014 06:42, Philip Wehunt wrote:
> > I could hackishly work around this in my python but I wanted to
> > identify if I am doing something wrong on the SP side or if it is a
> > bug. Mainly in the spirit of KISS. I don't like to let hackish
> > scripts linger.
>
> You probably found the same script on gist, but here's my version[1]
> which doesn't fail when the 6th arg is missing. It will not add "
> cleared" to the subject without the arg, but it will send you the report.
>
> [1]: https://git.server-speed.net/users/flo/bin/tree/smokemtr.py
>
> From the documentation in smokeping_config I'd say this is a bug, but
> given I get my mails I didn't bother fixing it yet.
>
>
> _______________________________________________
> smokeping-users mailing list
> smokeping-users at lists.oetiker.ch
> https://lists.oetiker.ch/cgi-bin/listinfo/smokeping-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.oetiker.ch/pipermail/smokeping-users/attachments/20140221/5f0bd717/attachment.htm 

From pwehunt at gmail.com  Sat Feb 22 00:16:26 2014
From: pwehunt at gmail.com (Philip Wehunt)
Date: Fri, 21 Feb 2014 18:16:26 -0500
Subject: [smokeping-users] edgetrigger
In-Reply-To: <CAAjorqVvmt2ZjD_DE-MNMm+rFpSaC1cH=Y8U6QYFSjEJz9pkrQ@mail.gmail.com>
References: <D3797550-94A6-4686-80C1-E407D35D0824@gmail.com>
	<711503562.20140220211100@sloop.net>
	<11FFB054-C597-436F-83C3-F7287A1B5473@gmail.com>
	<CAAjorqVvmt2ZjD_DE-MNMm+rFpSaC1cH=Y8U6QYFSjEJz9pkrQ@mail.gmail.com>
Message-ID: <CAHUVhaMbG=GGncXdnft8scAGF3udPfx_my93vkqRKng+KH=Kjg@mail.gmail.com>

I have spent part of the day tweaking the script on github referenced by
Florian.  I will fork and commit my version sometime this weekend.  I will
post link here when I have done so.


On Fri, Feb 21, 2014 at 9:38 AM, Greg Sloop <gregs at sloop.net> <
gregs at sloop.net> wrote:

> I'd love to have your script when it's done. Provided you're willing to
> share..
>
> I've been meaning to use an MTR capture just as you are doing, but haven't
> done it yet..  Thus having yours as a template would be fab!
>
> Thanks
> On Feb 20, 2014 9:42 PM, "Philip Wehunt" <pwehunt at gmail.com> wrote:
>
>> Thanks for the reply. You helped me realize in my initial post I left out
>> a key part of why I am scripting the alerts. Our current needs require an
>> MTR to fire and catch a glimpse of each hop when our thresholds set in SP
>> are hooked. So basically I have my python script parsing the args from SP
>> in an argparse based function and passing that to a function that uses the
>> parsed args to create my email, iterate mtr ten or so times with the
>> --report flag and email it--as well as log it to a log file. We frequently
>> need this granular data to escalate with our upstream BW providers.
>>
>> Works perfect when the 'raise' arg passes the '1' when SP triggers alert.
>> But it only passes 5 arguments on the  cleared run--so my script dies
>> because it expects 6 args.  Hence it doesn't fire my script to infirm the
>> issue has cleared.
>>
>> I could hackishly work around this in my python but I wanted to identify
>> if I am doing something wrong on the SP side or if it is a bug. Mainly in
>> the spirit of KISS. I don't like to let hackish scripts linger.
>>
>> Agreed on the Nagios --however, we are a Science Logic/EM7 shop (I voted
>> nagios)
>>
>> Thanks for the reply.
>>
>>
>>
>> > On Feb 21, 2014, at 12:11 AM, Gregory Sloop <gregs at sloop.net> wrote:
>> >
>> >
>> > PW> I am currently building out our smokeping implementation and all
>> > PW> is going fantastic. However, due to specific needs, I am piping
>> > PW> alerts via edgetrigger to an external python script. I pulled my
>> > PW> hair out for nearly five hours debugging my script because the
>> > PW> 'cleared' argument was not firing my python script although the
>> > PW> built in alerts would.  I then discovered with a two liner bash
>> > PW> script that echo'd the args from smokeping that the expected '0'
>> > PW> on cleared is not being passed--only the 1 when raised.
>> >
>> > PW> I did my due diligence searching the list archives and if course
>> > PW> google. However, I was only able to find one mention of the issue
>> but no fix or remedy.
>> >
>> > PW> Hopefully someone can point me in the right direction.
>> >
>> > I can't offer any guidance - my solution to the very basic reporting
>> > in SP was to query the RRD's with a Nagios plug-in and use Nagios for
>> > reporting/alerting.
>> >
>> > Nagios can't generate alerts with the same elaborate criteria that SP
>> > does, but basic criteria work fine for me.
>> >
>> > In short, I think trying to handle reporting/alerting with SP is kind
>> > of nuts. [No offense to you, I tried too at one point - and I gave up.
>> > So, if anyone is nuts, I'm grouping myself with the "nuts" too.]
>> >
>> > I'd guess with 5 more hours, you could integrate this all in Nagios...
>> > :)
>> >
>> > [And I should mention that I can't get the detail I can get in SP with
>> > Nagios, so I don't use Nagios to actually gather stats on these targets,
>> > only SP. I use each tool where its strengths lie. SP for stats, and
>> > Nagios for alerts/reports.]
>> >
>> > But perhaps you're doing something else in your python script - but
>> > thought I'd offer my work-around for SP's minimal alerting.
>> >
>> > HTH
>> >
>> > -Greg
>> >
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.oetiker.ch/pipermail/smokeping-users/attachments/20140221/51a5427d/attachment.htm 

From pwehunt at gmail.com  Sun Feb 23 20:13:24 2014
From: pwehunt at gmail.com (Philip Wehunt)
Date: Sun, 23 Feb 2014 14:13:24 -0500
Subject: [smokeping-users] Python alert script
Message-ID: <CAHUVhaOGh4vzqUY59EViD9H2eUBBqpCSoroDMECf_rxGiBuDNg@mail.gmail.com>

As discussed in my original thread, I am sharing the link to my updated
version of the python alert script discussed.  I am certainly not a
programmer by trade, however, as a sysadmin I know enough to make my life
easier.  My updates make the script work with edgetrigger and fixes the
issue I am having with smokeping not passing the sixth arg on 'clear.'

https://gist.github.com/ixgeek/9144930
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.oetiker.ch/pipermail/smokeping-users/attachments/20140223/266beba3/attachment-0001.htm