From tobi at oetiker.ch Mon Feb 3 11:08:32 2014 From: tobi at oetiker.ch (Tobias Oetiker) Date: Mon, 3 Feb 2014 11:08:32 +0100 (CET) Subject: [smokeping-users] mod_fcgid: HTTP request length xxx (so far) exceeds MaxRequestLen (yyyy) In-Reply-To: References: Message-ID: Hi Paul, Jan 27 Paul Mansfield wrote: > We have had a significant number of problems with smokeping client > consuming all the memory on a server, we've had these grow to 12GB in > the worst case! > > There appear to be two problems, probably related. > > * One is that the slave can't communicate with the master for a while > and the smokeping slave cache builds up a backlog of work, and then it > is unable to send the data. > > * The other problem is that the slave process, e.g. usually FPing, > gets wedged and becomes larger and larger. > > > the clue to the first problem are lines like this in the Apache error log: > > [Mon Jan 27 12:12:42 2014] [warn] [client w.x.y.z] mod_fcgid: HTTP > request length 394304 (so far) exceeds MaxRequestLen (393216) > > To resolve this, you have to kill-9 the smokeping slave, and any child > processes like FPing, then remove everything in the ~smokeping/cache > directory. Then it's safe to restart. > > > It's probably worth setting a larger number than the default in apache > config anyway to allow slaves to catch up... e.g. > FcgidMaxRequestLen 131072 > > but if you find anything larger than that it probably means something > else has gone wrong. hmmm never seen such a problem ... 12GB ist VERY much ... for how long has the slave been 'offline' when this happens? and how large is a normal 'submission' by the slave ? cheers tobi -- Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900 From paul.mansfield+smokeping at grapeshot.co.uk Mon Feb 3 11:29:51 2014 From: paul.mansfield+smokeping at grapeshot.co.uk (Paul Mansfield) Date: Mon, 3 Feb 2014 10:29:51 +0000 Subject: [smokeping-users] mod_fcgid: HTTP request length xxx (so far) exceeds MaxRequestLen (yyyy) In-Reply-To: References: Message-ID: the default length of a request is 131072 bytes, i.e. 128KiB http://httpd.apache.org/mod_fcgid/mod/mod_fcgid.html#fcgidmaxrequestlen I think I increased it to four times that before I realised something had gone wrong! I don't know how long the bad slaves had gone offline, weeks, I think. We have quite a few of them. Now, we've imposed ulimit on the "smokeping" userID, so that the slaves will die instead of eating the entire system. From dalgibbard at gmail.com Mon Feb 17 16:38:49 2014 From: dalgibbard at gmail.com (Darren Gibbard) Date: Mon, 17 Feb 2014 15:38:49 +0000 Subject: [smokeping-users] Managing a lot of historic data Message-ID: Hello all, I'm running a 'master' Smokeping (v.2.004002) instance, with Slaves located across 20 sites, and each site has four targets itself, and are checked by all of the slave instances... This setup as been running for quite some time now, and it's come to my realisation that each site is totalling about 3 - 4GB of data - about 80GB total. Are there any documents floating about with regards to rotating/cleaning/compressing files, and/or clearing out any data more than 2yrs old for example? I noted especially that the "*.slave_cache" files are the large ones, whereas the actual RRD files are comparatively small - but I'm failing to find any specific documentation about what these slave_cache files are actually used for? Thanks in advance! Darren. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.oetiker.ch/pipermail/smokeping-users/attachments/20140217/351e7594/attachment.htm From pwehunt at gmail.com Fri Feb 21 05:42:49 2014 From: pwehunt at gmail.com (Philip Wehunt) Date: Thu, 20 Feb 2014 23:42:49 -0500 Subject: [smokeping-users] edgetrigger Message-ID: I am currently building out our smokeping implementation and all is going fantastic. However, due to specific needs, I am piping alerts via edgetrigger to an external python script. I pulled my hair out for nearly five hours debugging my script because the 'cleared' argument was not firing my python script although the built in alerts would. I then discovered with a two liner bash script that echo'd the args from smokeping that the expected '0' on cleared is not being passed--only the 1 when raised. I did my due diligence searching the list archives and if course google. However, I was only able to find one mention of the issue but no fix or remedy. Hopefully someone can point me in the right direction. Thanks! From gregs at sloop.net Fri Feb 21 06:11:00 2014 From: gregs at sloop.net (Gregory Sloop) Date: Thu, 20 Feb 2014 21:11:00 -0800 Subject: [smokeping-users] edgetrigger In-Reply-To: References: Message-ID: <711503562.20140220211100@sloop.net> PW> I am currently building out our smokeping implementation and all PW> is going fantastic. However, due to specific needs, I am piping PW> alerts via edgetrigger to an external python script. I pulled my PW> hair out for nearly five hours debugging my script because the PW> 'cleared' argument was not firing my python script although the PW> built in alerts would. I then discovered with a two liner bash PW> script that echo'd the args from smokeping that the expected '0' PW> on cleared is not being passed--only the 1 when raised. PW> I did my due diligence searching the list archives and if course PW> google. However, I was only able to find one mention of the issue but no fix or remedy. PW> Hopefully someone can point me in the right direction. I can't offer any guidance - my solution to the very basic reporting in SP was to query the RRD's with a Nagios plug-in and use Nagios for reporting/alerting. Nagios can't generate alerts with the same elaborate criteria that SP does, but basic criteria work fine for me. In short, I think trying to handle reporting/alerting with SP is kind of nuts. [No offense to you, I tried too at one point - and I gave up. So, if anyone is nuts, I'm grouping myself with the "nuts" too.] I'd guess with 5 more hours, you could integrate this all in Nagios... :) [And I should mention that I can't get the detail I can get in SP with Nagios, so I don't use Nagios to actually gather stats on these targets, only SP. I use each tool where its strengths lie. SP for stats, and Nagios for alerts/reports.] But perhaps you're doing something else in your python script - but thought I'd offer my work-around for SP's minimal alerting. HTH -Greg From pwehunt at gmail.com Fri Feb 21 06:42:13 2014 From: pwehunt at gmail.com (Philip Wehunt) Date: Fri, 21 Feb 2014 00:42:13 -0500 Subject: [smokeping-users] edgetrigger In-Reply-To: <711503562.20140220211100@sloop.net> References: <711503562.20140220211100@sloop.net> Message-ID: <11FFB054-C597-436F-83C3-F7287A1B5473@gmail.com> Thanks for the reply. You helped me realize in my initial post I left out a key part of why I am scripting the alerts. Our current needs require an MTR to fire and catch a glimpse of each hop when our thresholds set in SP are hooked. So basically I have my python script parsing the args from SP in an argparse based function and passing that to a function that uses the parsed args to create my email, iterate mtr ten or so times with the --report flag and email it--as well as log it to a log file. We frequently need this granular data to escalate with our upstream BW providers. Works perfect when the 'raise' arg passes the '1' when SP triggers alert. But it only passes 5 arguments on the cleared run--so my script dies because it expects 6 args. Hence it doesn't fire my script to infirm the issue has cleared. I could hackishly work around this in my python but I wanted to identify if I am doing something wrong on the SP side or if it is a bug. Mainly in the spirit of KISS. I don't like to let hackish scripts linger. Agreed on the Nagios --however, we are a Science Logic/EM7 shop (I voted nagios) Thanks for the reply. > On Feb 21, 2014, at 12:11 AM, Gregory Sloop wrote: > > > PW> I am currently building out our smokeping implementation and all > PW> is going fantastic. However, due to specific needs, I am piping > PW> alerts via edgetrigger to an external python script. I pulled my > PW> hair out for nearly five hours debugging my script because the > PW> 'cleared' argument was not firing my python script although the > PW> built in alerts would. I then discovered with a two liner bash > PW> script that echo'd the args from smokeping that the expected '0' > PW> on cleared is not being passed--only the 1 when raised. > > PW> I did my due diligence searching the list archives and if course > PW> google. However, I was only able to find one mention of the issue but no fix or remedy. > > PW> Hopefully someone can point me in the right direction. > > I can't offer any guidance - my solution to the very basic reporting > in SP was to query the RRD's with a Nagios plug-in and use Nagios for > reporting/alerting. > > Nagios can't generate alerts with the same elaborate criteria that SP > does, but basic criteria work fine for me. > > In short, I think trying to handle reporting/alerting with SP is kind > of nuts. [No offense to you, I tried too at one point - and I gave up. > So, if anyone is nuts, I'm grouping myself with the "nuts" too.] > > I'd guess with 5 more hours, you could integrate this all in Nagios... > :) > > [And I should mention that I can't get the detail I can get in SP with > Nagios, so I don't use Nagios to actually gather stats on these targets, > only SP. I use each tool where its strengths lie. SP for stats, and > Nagios for alerts/reports.] > > But perhaps you're doing something else in your python script - but > thought I'd offer my work-around for SP's minimal alerting. > > HTH > > -Greg > From bluewind at xinu.at Fri Feb 21 10:50:05 2014 From: bluewind at xinu.at (Florian Pritz) Date: Fri, 21 Feb 2014 10:50:05 +0100 Subject: [smokeping-users] edgetrigger In-Reply-To: <11FFB054-C597-436F-83C3-F7287A1B5473@gmail.com> References: <711503562.20140220211100@sloop.net> <11FFB054-C597-436F-83C3-F7287A1B5473@gmail.com> Message-ID: <5307214D.4000709@xinu.at> On 21.02.2014 06:42, Philip Wehunt wrote: > I could hackishly work around this in my python but I wanted to > identify if I am doing something wrong on the SP side or if it is a > bug. Mainly in the spirit of KISS. I don't like to let hackish > scripts linger. You probably found the same script on gist, but here's my version[1] which doesn't fail when the 6th arg is missing. It will not add " cleared" to the subject without the arg, but it will send you the report. [1]: https://git.server-speed.net/users/flo/bin/tree/smokemtr.py From the documentation in smokeping_config I'd say this is a bug, but given I get my mails I didn't bother fixing it yet. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: OpenPGP digital signature Url : http://lists.oetiker.ch/pipermail/smokeping-users/attachments/20140221/310d74ed/attachment-0001.pgp From gregs at sloop.net Fri Feb 21 15:38:00 2014 From: gregs at sloop.net (Greg Sloop ) Date: Fri, 21 Feb 2014 06:38:00 -0800 Subject: [smokeping-users] edgetrigger In-Reply-To: <11FFB054-C597-436F-83C3-F7287A1B5473@gmail.com> References: <711503562.20140220211100@sloop.net> <11FFB054-C597-436F-83C3-F7287A1B5473@gmail.com> Message-ID: I'd love to have your script when it's done. Provided you're willing to share.. I've been meaning to use an MTR capture just as you are doing, but haven't done it yet.. Thus having yours as a template would be fab! Thanks On Feb 20, 2014 9:42 PM, "Philip Wehunt" wrote: > Thanks for the reply. You helped me realize in my initial post I left out > a key part of why I am scripting the alerts. Our current needs require an > MTR to fire and catch a glimpse of each hop when our thresholds set in SP > are hooked. So basically I have my python script parsing the args from SP > in an argparse based function and passing that to a function that uses the > parsed args to create my email, iterate mtr ten or so times with the > --report flag and email it--as well as log it to a log file. We frequently > need this granular data to escalate with our upstream BW providers. > > Works perfect when the 'raise' arg passes the '1' when SP triggers alert. > But it only passes 5 arguments on the cleared run--so my script dies > because it expects 6 args. Hence it doesn't fire my script to infirm the > issue has cleared. > > I could hackishly work around this in my python but I wanted to identify > if I am doing something wrong on the SP side or if it is a bug. Mainly in > the spirit of KISS. I don't like to let hackish scripts linger. > > Agreed on the Nagios --however, we are a Science Logic/EM7 shop (I voted > nagios) > > Thanks for the reply. > > > > > On Feb 21, 2014, at 12:11 AM, Gregory Sloop wrote: > > > > > > PW> I am currently building out our smokeping implementation and all > > PW> is going fantastic. However, due to specific needs, I am piping > > PW> alerts via edgetrigger to an external python script. I pulled my > > PW> hair out for nearly five hours debugging my script because the > > PW> 'cleared' argument was not firing my python script although the > > PW> built in alerts would. I then discovered with a two liner bash > > PW> script that echo'd the args from smokeping that the expected '0' > > PW> on cleared is not being passed--only the 1 when raised. > > > > PW> I did my due diligence searching the list archives and if course > > PW> google. However, I was only able to find one mention of the issue > but no fix or remedy. > > > > PW> Hopefully someone can point me in the right direction. > > > > I can't offer any guidance - my solution to the very basic reporting > > in SP was to query the RRD's with a Nagios plug-in and use Nagios for > > reporting/alerting. > > > > Nagios can't generate alerts with the same elaborate criteria that SP > > does, but basic criteria work fine for me. > > > > In short, I think trying to handle reporting/alerting with SP is kind > > of nuts. [No offense to you, I tried too at one point - and I gave up. > > So, if anyone is nuts, I'm grouping myself with the "nuts" too.] > > > > I'd guess with 5 more hours, you could integrate this all in Nagios... > > :) > > > > [And I should mention that I can't get the detail I can get in SP with > > Nagios, so I don't use Nagios to actually gather stats on these targets, > > only SP. I use each tool where its strengths lie. SP for stats, and > > Nagios for alerts/reports.] > > > > But perhaps you're doing something else in your python script - but > > thought I'd offer my work-around for SP's minimal alerting. > > > > HTH > > > > -Greg > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.oetiker.ch/pipermail/smokeping-users/attachments/20140221/3e70d433/attachment.htm From pwehunt at gmail.com Sat Feb 22 00:15:55 2014 From: pwehunt at gmail.com (Philip Wehunt) Date: Fri, 21 Feb 2014 18:15:55 -0500 Subject: [smokeping-users] edgetrigger In-Reply-To: <5307214D.4000709@xinu.at> References: <711503562.20140220211100@sloop.net> <11FFB054-C597-436F-83C3-F7287A1B5473@gmail.com> <5307214D.4000709@xinu.at> Message-ID: Interestingly enough, I did reference that particular script on gist--but many moons ago when I was cutting my python teeth--I used it for the email part. I have revisited it and taken a gander at your version. I actually used your options on the 6th argument as inspiration and I have added other tweaks to make it report 'ALERT' or 'CLEARED' when used as edge trigger script. I have tweaked it a bit and added a few more things throughout the script. I plan on officially forking it on github (to give the original author credit) and committing my changes. When I do this, I will post the link here. Many thanks for the input and link--certainly helped me from making a mountain out of a mole-hill. :-) On Fri, Feb 21, 2014 at 4:50 AM, Florian Pritz wrote: > On 21.02.2014 06:42, Philip Wehunt wrote: > > I could hackishly work around this in my python but I wanted to > > identify if I am doing something wrong on the SP side or if it is a > > bug. Mainly in the spirit of KISS. I don't like to let hackish > > scripts linger. > > You probably found the same script on gist, but here's my version[1] > which doesn't fail when the 6th arg is missing. It will not add " > cleared" to the subject without the arg, but it will send you the report. > > [1]: https://git.server-speed.net/users/flo/bin/tree/smokemtr.py > > From the documentation in smokeping_config I'd say this is a bug, but > given I get my mails I didn't bother fixing it yet. > > > _______________________________________________ > smokeping-users mailing list > smokeping-users at lists.oetiker.ch > https://lists.oetiker.ch/cgi-bin/listinfo/smokeping-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.oetiker.ch/pipermail/smokeping-users/attachments/20140221/5f0bd717/attachment.htm From pwehunt at gmail.com Sat Feb 22 00:16:26 2014 From: pwehunt at gmail.com (Philip Wehunt) Date: Fri, 21 Feb 2014 18:16:26 -0500 Subject: [smokeping-users] edgetrigger In-Reply-To: References: <711503562.20140220211100@sloop.net> <11FFB054-C597-436F-83C3-F7287A1B5473@gmail.com> Message-ID: I have spent part of the day tweaking the script on github referenced by Florian. I will fork and commit my version sometime this weekend. I will post link here when I have done so. On Fri, Feb 21, 2014 at 9:38 AM, Greg Sloop < gregs at sloop.net> wrote: > I'd love to have your script when it's done. Provided you're willing to > share.. > > I've been meaning to use an MTR capture just as you are doing, but haven't > done it yet.. Thus having yours as a template would be fab! > > Thanks > On Feb 20, 2014 9:42 PM, "Philip Wehunt" wrote: > >> Thanks for the reply. You helped me realize in my initial post I left out >> a key part of why I am scripting the alerts. Our current needs require an >> MTR to fire and catch a glimpse of each hop when our thresholds set in SP >> are hooked. So basically I have my python script parsing the args from SP >> in an argparse based function and passing that to a function that uses the >> parsed args to create my email, iterate mtr ten or so times with the >> --report flag and email it--as well as log it to a log file. We frequently >> need this granular data to escalate with our upstream BW providers. >> >> Works perfect when the 'raise' arg passes the '1' when SP triggers alert. >> But it only passes 5 arguments on the cleared run--so my script dies >> because it expects 6 args. Hence it doesn't fire my script to infirm the >> issue has cleared. >> >> I could hackishly work around this in my python but I wanted to identify >> if I am doing something wrong on the SP side or if it is a bug. Mainly in >> the spirit of KISS. I don't like to let hackish scripts linger. >> >> Agreed on the Nagios --however, we are a Science Logic/EM7 shop (I voted >> nagios) >> >> Thanks for the reply. >> >> >> >> > On Feb 21, 2014, at 12:11 AM, Gregory Sloop wrote: >> > >> > >> > PW> I am currently building out our smokeping implementation and all >> > PW> is going fantastic. However, due to specific needs, I am piping >> > PW> alerts via edgetrigger to an external python script. I pulled my >> > PW> hair out for nearly five hours debugging my script because the >> > PW> 'cleared' argument was not firing my python script although the >> > PW> built in alerts would. I then discovered with a two liner bash >> > PW> script that echo'd the args from smokeping that the expected '0' >> > PW> on cleared is not being passed--only the 1 when raised. >> > >> > PW> I did my due diligence searching the list archives and if course >> > PW> google. However, I was only able to find one mention of the issue >> but no fix or remedy. >> > >> > PW> Hopefully someone can point me in the right direction. >> > >> > I can't offer any guidance - my solution to the very basic reporting >> > in SP was to query the RRD's with a Nagios plug-in and use Nagios for >> > reporting/alerting. >> > >> > Nagios can't generate alerts with the same elaborate criteria that SP >> > does, but basic criteria work fine for me. >> > >> > In short, I think trying to handle reporting/alerting with SP is kind >> > of nuts. [No offense to you, I tried too at one point - and I gave up. >> > So, if anyone is nuts, I'm grouping myself with the "nuts" too.] >> > >> > I'd guess with 5 more hours, you could integrate this all in Nagios... >> > :) >> > >> > [And I should mention that I can't get the detail I can get in SP with >> > Nagios, so I don't use Nagios to actually gather stats on these targets, >> > only SP. I use each tool where its strengths lie. SP for stats, and >> > Nagios for alerts/reports.] >> > >> > But perhaps you're doing something else in your python script - but >> > thought I'd offer my work-around for SP's minimal alerting. >> > >> > HTH >> > >> > -Greg >> > >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.oetiker.ch/pipermail/smokeping-users/attachments/20140221/51a5427d/attachment.htm From pwehunt at gmail.com Sun Feb 23 20:13:24 2014 From: pwehunt at gmail.com (Philip Wehunt) Date: Sun, 23 Feb 2014 14:13:24 -0500 Subject: [smokeping-users] Python alert script Message-ID: As discussed in my original thread, I am sharing the link to my updated version of the python alert script discussed. I am certainly not a programmer by trade, however, as a sysadmin I know enough to make my life easier. My updates make the script work with edgetrigger and fixes the issue I am having with smokeping not passing the sixth arg on 'clear.' https://gist.github.com/ixgeek/9144930 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.oetiker.ch/pipermail/smokeping-users/attachments/20140223/266beba3/attachment-0001.htm