RE: [squid-users] Squid Performance (with Polygraph) from Dave Raven on 2007-11-09 (squid-users)

From: Dave Raven <dave@dont-contact.us>
Date: Fri, 9 Nov 2007 17:15:42 +0200

Hi Adrian,

It works for the full 4 hours with a null cache directory. How would
I see any kind of stats/information on disk IO? From the stats I can see so
far, the disk stats don't change at all when it fails ...

I'm currently using COSS, but I've also tried this with ufs and diskd (with
the same results, just different times that it fails after).

Thanks
Dave

-----Original Message-----
From: Adrian Chadd [mailto:adrian@creative.net.au]
Sent: Friday, November 09, 2007 3:35 PM
To: Dave Raven
Cc: squid-users@squid-cache.org
Subject: Re: [squid-users] Squid Performance (with Polygraph)

Rightio; this reads like you're running out of disk IO.
Try running the test with a null cache dir and make sure the box can handle
that load.

Squid unfortunately had crap disk IO code for whats available these days.

Adrian

On Fri, Nov 09, 2007, Dave Raven wrote:
> Hi all,
> Okay I managed to do a lot more testing at the office today. Firstly
> some of the questions asked --
>
> CPU Usage: The cpu usage is around 30% during the test, when the unit
begins
> to fail it actually goes down a bit.
>
> Mbufs/Clusters: All fine - they do rise quickly after the problem happens
> but this is because the established network connections are still coming
in
> 600 a second, but only being satisfied at a rate of say 200 a second. The
> send queues then get big, and mbuf usage goes up - this is not the cause
of
> the failure though, it's a side effect. For the first x minutes its
between
> 250 and 3000 mbufs (and clusters) used, and my max is 65k/32k
>
> As for system logs there are none - there is nothing suspicious anywhere
> until the side effects kick in, e.g. mbufs running out etc. Squid also
logs
> nothing at all. I've also checked if I'm using too much memory and that's
> not the case - swap is not used at all during the entire test.
>
> This is the process of what happens --
>
> 1. PolyClt + PolySrv begin, 800 RPS.
>
> 2. ESTABLISHED netstat connections are around 2000 once 800RPS is reached
> (about 20 seconds). CPU load is 30%, mbufs are available etc.
>
> 3. Once memory becomes full (quickly) disk drive usage begins - squid -z
> puts the TPS per drive at well over 1000/s when I run it, when the cache
is
> doing 800 RPS the tps is about 30 per drive (low..).
>
> 4. After a period of time (almost always the same (+/- 60 seconds)
depending
> on RPS) the ESTABLISHED connections start rising, at the exact same time
the
> PolyClt starts showing less RPS. This is the "slow down" as such.
>
> 5. Because of this, polyclt continues to send requests which the unit
> continues to accept - quickly all available sockets are used, and the unit
> will then crash
>
> Interestingly enough though - if I stop the polyclt when this happens and
> restart it - in under 10 seconds - it continues on for another x minutes
> without problem. If I leave it running the unit never comes right.
>
> I have used "systat -vmstat 1", "systat -tcp 1", "systat -iostat 1" and
all
> the stats from Munin, and a MRTG graphing config for squid and they all
show
> nothing of interest. The only result that changes between working time and
> slow down is that the connections go through the roof as explained
above...
>
> I have also seen it fail at 300RPS, but only after 82 minutes - which
seems
> like a very long time if it was going to fail because of disk load. The
> entire time the disks are very underloaded. That said, if I use a null
cache
> directory this doesn't happen....
>
> I know that sounds like its clearly drives - but 82 minutes ??
>
> Thanks for all the help
> Dave
>
> -----Original Message-----
> From: Adrian Chadd [mailto:adrian@creative.net.au]
> Sent: Friday, November 09, 2007 11:55 AM
> To: Dave Raven
> Cc: 'Adrian Chadd'; squid-users@squid-cache.org
> Subject: Re: [squid-users] Squid Performance (with Polygraph)
>
> Check netstat -mb and see if you're running out of mbufs?
> You haven't mentioned whether the CPU is being pegged at this point?
>
>
>
> Adrian
>
> On Fri, Nov 09, 2007, Dave Raven wrote:
> > Hi all,
> > Okay I've done some of what you requested, and unfortunately failed
> > to find anything specific. I can pretty much guarantee the times at
which
> > the requests will slow down now. 600RPS = 15 minutes, 800 RPS = 11
> minutes,
> > 400 RPS = ~80 minutes.
> >
> > During that time (before and during the problem) systat -vmstat 1 shows
> the
> > same interrupts - about 4000 on em1 (ifac) and 250 on hptmv0 - my
> controller
> > for the SATA drives.
> >
> > If I use a systat -iostat 1 I can see that none of the drives are 100%
> > utilized at any time during the test. Systat -tcp 1 also doesn't show me
> > anything out of the ordinary. I have setup munin to monitor the host but
> > unfortunately its not showing much.
> >
> > Also the problem is that when the problem begins, it starts filling up
> > network connections - once it fills all the available ports nothing can
> > monitor it :/
> >
> > I'm going to try use a different network card, then a different
> motherboard
> > etc - try some different setups today. Thanks again for all the help and
> > please let me know if anyone has any ideas...
> >
> > Thanks
> > Dave
> >
> > -----Original Message-----
> > From: Adrian Chadd [mailto:adrian@creative.net.au]
> > Sent: Friday, November 09, 2007 4:08 AM
> > To: Dave Raven
> > Cc: squid-users@squid-cache.org
> > Subject: Re: [squid-users] Squid Performance (with Polygraph)
> >
> > On Thu, Nov 08, 2007, Dave Raven wrote:
> > > Hi Adrian,
> > > What would cause it to fail after a specific time though - if the
> > cache_mem
> > > is already full and its using the drives? I would have thought it
would
> > fail
> > > immediately ?
> > >
> > > Also there are no log messages about failures or anything...
> >
> > Who knows :) its hard without having remote access, or lots of logging/
> > statistics to correlate the trouble times with.
> >
> > Try installing munin and graph all the system-specific stuff. See what
> > correlates against the failure time. You might notice something, like
> > out of memory/paging, or an increase in interrupts, or something. ;)
> >
> > Thats all I can offer at the present time, sorry.
> >
> >
> >
> > Adrian
> >
> > --
> > - Xenion - http://www.xenion.com.au/ - VPS Hosting - Commercial Squid
> > Support -
> > - $25/pm entry-level VPSes w/ capped bandwidth charges available in WA -
>
> --
> - Xenion - http://www.xenion.com.au/ - VPS Hosting - Commercial Squid
> Support -
> - $25/pm entry-level VPSes w/ capped bandwidth charges available in WA -

-- 
- Xenion - http://www.xenion.com.au/ - VPS Hosting - Commercial Squid
Support -
- $25/pm entry-level VPSes w/ capped bandwidth charges available in WA -

Received on Fri Nov 09 2007 - 08:16:06 MST

This archive was generated by hypermail pre-2.1.9 : Sat Dec 01 2007 - 12:00:02 MST