On Wed, 2007-01-03 at 15:46 -0800, Andrew Morton wrote:
> 
> Begin forwarded message:
> 
> Date: Wed, 3 Jan 2007 11:54:26 +0000
> From: Steve Hill <[EMAIL PROTECTED]>
> To: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
> Subject: Intermittent SCTP multihoming breakage
> 
> 
> 
> Apologies if I'm posting to the wrong list - the lksctp lists seem to be a
> bit dead these days and a bit of Googling seemed to inidicate that SCTP
> developemnt discussions might have moved here.

No. lksctp-developers mailing list is still the best place for SCTP related
discussions. You can subscribe and look in the archives at 
  http://lists.sourceforge.net/lists/listinfo/lksctp-developers

> 
> I'm running under the 2.6.16.1 kernel and have an intermittent problem
> with the SCTP stack.  Having reviewed the git logs I can't see any
> indication that the problem has been fixed in more recent kernels, but it
> is very difficult to test since it is so intermittent.

If possible, i would suggest moving to the latest mainline 2.6.19.
But 2.6.16.1 should work OK for simple multihoming cases.

> 
> I am running a multihomed connection between 2 machines, (2 NICs on
> each machine, so 2 paths for the connection) and tcpdump shows heartbeat
> requests and acks on both paths.  Putting data over the link correctly
> sends it over the first path.
How are the 2 machines connected? Are they connected directly or
via a router?

Do you see both the addresses when you do cat /proc/net/sctp/assocs 
after the association is established on both the peers?

> 
> If I drop the traffic on one of the NICs then most of the time it
> correctly fails over the the second path and I see the data being sent
> and acknowledged correctly on the second path.  However, I also
> intermittently see two failure conditions:

How are you dropping traffic? You could try simulating failover by
bringing down the interface or physically removing the link.

> 
> 1. Sometimes, just after failing over to the second path I see an ABORT.
This seems to indicate that somehow the app has terminated.

> 2. More frequently, the association stays up indefinately, with heartbeat
> requests and acks on the second path, but no data chunks are sent even
> though the transmit queue on the transmitting end appears to be full and
> the socket is blocking writes.
This is strange. Can you collect tcpdump traces on sender and receiver when 
this happens?

Thanks
Sridhar

> 
> I have been adding debugging to the kernel in an attempt to track down the
> source of the second failure condition, and I am wondering if anyone else
> has seen similar behaviour?
> 
> --
>  - Steve Hill
>    Software Engineer
>    Dialogic
>    Fordingbridge, Hampshire, UK
>    +44-1425-651392
>    [EMAIL PROTECTED]
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to