Hi,
Alexey and Herbert - thanks for the replies.
Alexey wrote:
> > similar things on a router - it has died 3 or 4 times (over a period
> > of a few months) with such an error with very little traffic passing
> > through it and a stream of the 'dst cache overflow' errors on the screen.
>
> Actually, it is quite unusual. The problems with garbage collection
> are all transient, essentially you do not see anything bad but
> some annoying messages.
>
> If the machine dies... Well, it cannot be the reason of death.
> I would even suspect that "dst cache overflow" was not reason of death,
> but rather a consequense.
>
> If the death means just a loss of network connectivity, it could
> mean that you experience a _true_ (not related to gc problems)
> dst cache overflow i.e. it happens because some part of kernel leaks
> dst cache entries. It is the first thing to check, see below.
I think the machine was still alive; but all it does is
route so there wasn't too much to tell; certainly it had
stopped routing (most?) traffic a period of about 10 hours
before I got to it and was still very ill - so it isn't
a transient thing.
The router handles outgoing traffic and routing between two
small subnets (probably 200 ish IPs or so on each); it doesn't
open any connections itself and it isn't directly on the outside
world.
One thought; the day before it fell into this state there had
been a minor screw up on one of the networks where someone
mispatched two subnets together (one of which was one
of the ones connected to this box); now that may have
caused a lot of arping and general unhappiness - but it all
seemed to resolve itself; I don't think similar problems
had happened before the previous failures.
> > a patch by Denis Lunev that is currently in one of the 2.6.13-pre's
> > ('Fix too aggressive backoff in dst garbage collection' git commit number
> > f0098f7863f814a5adc0b9cb271605d063cad7fa )
>
> It will not help, it is a transient problem.
OK.
> Plus run "ip route ls cache" periodically.
OK, I'll add that to some monitoring.
> The first thing, which you should watch is difference between
> number of entries shown by "ip route ls cache" (alive entries)
> and rtstat (it shows all, including lost ones).
>
> If the difference gradually grows with time, we definitely see a leakage.
OK.
> <explanation of route.c and dst.c>
Thanks for that explanation - it helps somewhat - one thing I was
confused by was why the timer mechanism for the garbage collection
was so elaborate; why does it do all that back off stuff and
adjusting itself? Why not just run at some fixed rate?
* Herbert Xu ([EMAIL PROTECTED]) wrote:
> Alexey Kuznetsov <[EMAIL PROTECTED]> wrote:
> >
> > Really bad overflow happens when lots of entries remain in use, because
> > someone forgot to release the references to dst cache entries.
> > It is the first thing to check.
>
> Yes. I once had a situation where a buggy user-land program held
> many sockets open each of which had ancient packets stuck in their
> receive queues. The result was a lot of dst entries hanging around.
Nod - I don't think it is that in this case because the machine
doesn't open any connections itself.
> In such cases checking /proc/slabinfo could be useful.
But I will try and remember that next time it goes or add
it to the monitoring scripts.
Thank you for your suggestions; if I'm unlucky you'll
see a question from me (with some more debug) in a month
or two if it does it again!
Dave
--
-----Open up your eyes, open up your mind, open up your code -------
/ Dr. David Alan Gilbert | Running GNU/Linux on Alpha,68K| Happy \
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
\ _________________________|_____ http://www.treblig.org |_______/
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html