Re: [tcpdump-workers] [net-2.6 PATCH] af_packet: move strict addr_len check right

2010-03-02 Thread Eric Dumazet
Le mercredi 03 mars 2010 à 07:40 +0100, Jiri Pirko a écrit :
> Subject: [net-2.6 PATCH] af_packet: move strict addr_len check right before 
> dev_[mc/unicast]_[add/del]
> 
> My previous patch 914c8ad2d18b62ad1420f518c0cab0b0b90ab308 incorrectly changed
> the length check in packet_mc_add to be more strict. The problem is that
> userspace is not filling this field (and it stays zeroed) in case of setting
> PACKET_MR_PROMISC or PACKET_MR_ALLMULTI. So move the strict check to the point
> in path where the addr_len must be set correctly.
> 
> Signed-off-by: Jiri Pirko 
> 

I am not sure it solves Pavel Roskin concern, but some credit should be
given to him :)

Reported-by: Pavel Roskin 

Thanks


-
This is the tcpdump-workers list.
Visit https://cod.sandelman.ca/ to unsubscribe.


Re: [tcpdump-workers] twice past the taps, thence out to net?

2011-12-15 Thread Eric Dumazet
Le jeudi 15 décembre 2011 à 10:32 -0800, Rick Jones a écrit :
> > More exactly, we call dev_queue_xmit_nit() from dev_hard_start_xmit()
> > _before_ giving skb to device driver.
> >
> > If device driver returns NETDEV_TX_BUSY, and a qdisc was setup on the
> > device, packet is requeued.
> >
> > Later, when queue is allowed to send again packets, packet is
> > retransmitted (and traced a second time in dev_queue_xmit_nit())
> 
> Is this then an unintended consequence bug, or a known feature?
> 

Its a well known feature, some people attempted to remove it ;)

http://answers.softpicks.net/answers/topic/-PATCH-tcpdump-may-trace-some-outbound-packets-twice--2204640-1.htm



-
This is the tcpdump-workers list.
Visit https://cod.sandelman.ca/ to unsubscribe.


Re: [tcpdump-workers] twice past the taps, thence out to net?

2011-12-15 Thread Eric Dumazet
Le mercredi 14 décembre 2011 à 18:12 -0800, Vijay Subramanian a écrit :
> On 14 December 2011 11:27, Rick Jones  wrote:
> > While looking at "something else" with tcpdump/tcptrace, tcptrace emitted
> > lots of notices about hardware duplicated packets being detected (same TCP
> > sequence number and IP datagram ID).  Sure enough, if I go into the tcpdump
> > trace (taken on the sender) I can find instances of what it was talking
> > about, separated in time by rather less than I would expect to be the RTO,
> > and often as not with few if any intervening arriving ACKs to trigger
> > anything like fast retransmit.  And besides, those would have a different IP
> > datagram ID no?
> >
> > I did manage to reproduce the issue with plain netperf tcp_stream tests. I
> > had one sending system with 30 concurrent netperf tcp_stream tests to 30
> > other receiving systems.  There are "hardware duplicates" in the sending
> > trace, but no duplicate segments (that I can find thus far) in the two
> > receiver side traces I took.  Of course that doesn't mean "conclusively"
> > there were two actual sends but it suggests there werent.
> >
> > While I work through the "obtain permission" path to post the packet traces
> > (don't ask...) I thought I would ask if anyone else has seen something
> > similar.
> >
> > In this case, all the systems are running a 2.6.38-8 Ubuntu kernel (the same
> > sorts of issues which delay my just putting the traces up on netperf.org
> > preclude a later kernel, and I've no other test systems :( ), with Intel
> > 82576 interfaces being driven by:
> >
> > $ sudo ethtool -i eth0
> > driver: igb
> > version: 2.1.0-k2
> > firmware-version: 1.8-2
> > bus-info: :05:00.0
> >
> > All the systems were connected to the same switch.
> >
> 
> Rick,
> This may be of help.
> http://www.tcptrace.org/faq_ans.html#FAQ%2021

More exactly, we call dev_queue_xmit_nit() from dev_hard_start_xmit()
_before_ giving skb to device driver.

If device driver returns NETDEV_TX_BUSY, and a qdisc was setup on the
device, packet is requeued.

Later, when queue is allowed to send again packets, packet is
retransmitted (and traced a second time in dev_queue_xmit_nit())

You can see the 'requeues' counter from "tc -s -d qdisc" output :

qdisc mq 0: dev eth2 root 
 Sent 29421597369 bytes 20301716 pkt (dropped 0, overlimits 0 requeues 371) 
 backlog 0b 0p requeues 371 


-
This is the tcpdump-workers list.
Visit https://cod.sandelman.ca/ to unsubscribe.


Re: [tcpdump-workers] twice past the taps, thence out to net?

2011-12-15 Thread Eric Dumazet
Le jeudi 15 décembre 2011 à 14:22 -0800, Rick Jones a écrit :
> On 12/15/2011 11:00 AM, Eric Dumazet wrote:
> >> Device's work better if the driver proactively manages 
> >> stop_queue/wake_queue.
> >> Old devices used TX_BUSY, but newer devices tend to manage the queue
> >> themselves.
> >>
> >
> > Some 'new' drivers like igb can be fooled in case skb is gso segmented ?
> >
> > Because igb_xmit_frame_ring() needs skb_shinfo(skb)->nr_frags + 4
> > descriptors, igb should stop its queue not at MAX_SKB_FRAGS + 4, but
> > MAX_SKB_FRAGS*4
> >
> > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
> > b/drivers/net/ethernet/intel/igb/igb_main.c
> > index 89d576c..989da36 100644
> > --- a/drivers/net/ethernet/intel/igb/igb_main.c
> > +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> > @@ -4370,7 +4370,7 @@ netdev_tx_t igb_xmit_frame_ring(struct sk_buff *skb,
> > igb_tx_map(tx_ring, first, hdr_len);
> >
> > /* Make sure there is space in the ring for the next send. */
> > -   igb_maybe_stop_tx(tx_ring, MAX_SKB_FRAGS + 4);
> > +   igb_maybe_stop_tx(tx_ring, MAX_SKB_FRAGS * 4);
> >
> > return NETDEV_TX_OK;
> 
> 
> Is there a minimum transmit queue length here?  I get the impression 
> that MAX_SKB_FRAGS is at least 16 and is 18 on a system with 4096 byte 
> pages.  The previous addition then would be OK so long as the TX queue 
> was always at least 22 entries in size, but now it would have to always 
> be at least 72?
> 
> I guess things are "OK" at the moment:
> 
> raj@tardy:~/net-next/drivers/net/ethernet/intel/igb$ grep IGB_MIN_TXD *.[ch]
> igb_ethtool.c:new_tx_count = max_t(u16, new_tx_count, IGB_MIN_TXD);
> igb.h:#define IGB_MIN_TXD   80
> 
> but is that getting a little close?
> 
> rick jones

Sure !

I only pointed out a possible problem, and not gave a full patch, since
we also need to change the opposite threshold (when we XON the queue at
TX completion)

You can see its not even consistent with the minimum for a single TSO
frame ! Most probably your high requeue numbers come from this too low
value given the real requirements of the hardware (4 + nr_frags
descriptors per skb)

/* How many Tx Descriptors do we need to call netif_wake_queue ? */ 
#define IGB_TX_QUEUE_WAKE   16


Maybe we should CC Intel guys

Could you try following patch ?

Thanks !

diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index c69feeb..93ce118 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -51,8 +51,8 @@ struct igb_adapter;
 /* TX/RX descriptor defines */
 #define IGB_DEFAULT_TXD  256
 #define IGB_DEFAULT_TX_WORK 128
-#define IGB_MIN_TXD   80
-#define IGB_MAX_TXD 4096
+#define IGB_MIN_TXDmax_t(unsigned, 80U, IGB_TX_QUEUE_WAKE * 2)
+#define IGB_MAX_TXD 4096
 
 #define IGB_DEFAULT_RXD  256
 #define IGB_MIN_RXD   80
@@ -121,8 +121,11 @@ struct vf_data_storage {
 #define IGB_RXBUFFER_16384 16384
 #define IGB_RX_HDR_LEN IGB_RXBUFFER_512
 
-/* How many Tx Descriptors do we need to call netif_wake_queue ? */
-#define IGB_TX_QUEUE_WAKE  16
+/* How many Tx Descriptors should be available
+ * before calling netif_wake_subqueue() ?
+ */
+#define IGB_TX_QUEUE_WAKE  (MAX_SKB_FRAGS * 4)
+
 /* How many Rx Buffers do we bundle into one write to the hardware ? */
 #define IGB_RX_BUFFER_WRITE16  /* Must be power of 2 */
 


-
This is the tcpdump-workers list.
Visit https://cod.sandelman.ca/ to unsubscribe.


Re: [tcpdump-workers] twice past the taps, thence out to net?

2011-12-16 Thread Eric Dumazet
Le vendredi 16 décembre 2011 à 10:28 -0800, Jesse Brandeburg a écrit :
> On Thu, Dec 15, 2011 at 8:27 PM, Eric Dumazet  wrote:
> > Le jeudi 15 décembre 2011 à 14:22 -0800, Rick Jones a écrit :
> >> On 12/15/2011 11:00 AM, Eric Dumazet wrote:
> >> >> Device's work better if the driver proactively manages 
> >> >> stop_queue/wake_queue.
> >> >> Old devices used TX_BUSY, but newer devices tend to manage the queue
> >> >> themselves.
> >> >>
> >> >
> >> > Some 'new' drivers like igb can be fooled in case skb is gso segmented ?
> >> >
> >> > Because igb_xmit_frame_ring() needs skb_shinfo(skb)->nr_frags + 4
> >> > descriptors, igb should stop its queue not at MAX_SKB_FRAGS + 4, but
> >> > MAX_SKB_FRAGS*4
> 
> can you please help me understand the need for MAX_SKB_FRAGS * 4 as
> the requirement?  Currently driver uses logic like
> 
> in hard_start_tx: hey I just finished a tx, I should stop the qdisc if
> I don't have room (in tx descriptors) for a worst case transmit skb
> (MAX_SKB_FRAGS + 4) the next time I'm called.
> when cleaning from interrupt: My cleanup is done, do I have enough
> free tx descriptors (should be MAX_SKB_FRAGS + 4) for a worst case
> transmit?  If yes, restart qdisc.
> 
> I'm missing the jump from the above logic to using MAX_SKB_FRAGS * 4
> (== (18 * 4) == 72) as the minimum number of descriptors I need for a
> worst case TSO.  Each descriptor can point to up to 16kB of contiguous
> memory, typically we use 1 for offload context setup, 1 for skb->data,
> and 1 for each page.  I think we may be overestimating with
> MAX_SKB_FRAGS + 4, but that should be no big deal.

Did you read my second patch ?

Problem is you wakeup the queue too soon (16 available descriptors,
while a full TSO packet needs more than that)

How would you explain high 'requeues' number if it was not the problem ?

Also, its suboptimal to wakeup the queue if available space is very low,
since only _one_ packet may be dequeued from qdisc (you pay high cost in
cache line bouncing)

My first patch was about a very rare event : A full TSO packet is
segmented in gso_segment() [ say if you dynamically disable sg on eth
device and an old tcp buffer is retransmitted ] : You end with 16 skbs
delivered to NIC : In this case we can hit tx ring limit at 4th or 5th
skb, and Rick complains tcpdump outputs some packets several times ;)

Since igb needs 4 descriptors for linear skb, I said : 4 *
MAX_SKB_FRAGS, but real problem is addressed in my second patch, I
believe ?



-
This is the tcpdump-workers list.
Visit https://cod.sandelman.ca/ to unsubscribe.


Re: [tcpdump-workers] twice past the taps, thence out to net?

2011-12-16 Thread Eric Dumazet
Le vendredi 16 décembre 2011 à 11:35 -0800, Rick Jones a écrit :

> I would *love* to.  All my accessible igb-driven hardware is in an 
> environment locked to the kernels already there :(  Not that it makes it 
> more possible for me to do it, but I suspect it does not require 30 
> receivers to reproduce the dups with netperf TCP_STREAM.  Particularly 
> if the tx queue len is at 256 it may only take 6 or 8. In fact let me 
> try that now...
> 
> Yep, with just 8 destinations/concurrent TCP_STREAM tests from the one 
> system one can still see the duplicates in the packet trace taken on the 
> sender.
> 
> Perhaps we can trouble the Intel guys to try to reproduce what I've seen?
> 

I do have an igb card somewhere (in fact two dual ports), I'll do the
test myself !

Thanks


-
This is the tcpdump-workers list.
Visit https://cod.sandelman.ca/ to unsubscribe.


Re: [tcpdump-workers] vlan tagged packets and libpcap breakage

2012-12-13 Thread Eric Dumazet
On Tue, 2012-12-11 at 14:36 -0800, Ani Sinha wrote:
> >
> > It is possible to test for the presence of support of the new vlan bpf
> > extensions by attempting to load a filter that uses them.  As only valid
> > filters can be loaded, old kernels that do not support filtering of vlan
> > tags will fail to load the a test filter with uses them.
> 
> Unfortunately I do not see this. The sk_chk_filter() does not have a
> default in the case statement and the check will not detect an unknown
> instruction. It will fail when the filter is run and as far as I can see,
> the packet will be dropped. Something like this might help?
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index c23543c..96338aa 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -548,6 +548,8 @@ int sk_chk_filter(struct sock_filter *filter, unsigned 
> int flen)
>   return -EINVAL;
>   /* Some instructions need special checks */
>   switch (code) {
> + /* for unknown instruction, return EINVAL */
> + default : return -EINVAL;
>   case BPF_S_ALU_DIV_K:
>   /* check for division by zero */
>   if (ftest->k == 0)

This patch is wrong.

Check lines 546, 547, 548 where we do the check for unknown instructions

code = codes[code];
if (!code)
return -EINVAL;

If you want to test ANCILLARY possible values, its already too late, as
old kernels wont use any patch anyway.



___
tcpdump-workers mailing list
tcpdump-workers@lists.tcpdump.org
https://lists.sandelman.ca/mailman/listinfo/tcpdump-workers


Re: [tcpdump-workers] [PATCH net 1/2] net: dev_queue_xmit_nit: fix skb->vlan_tci field value

2013-01-09 Thread Eric Dumazet
On Wed, 2013-01-09 at 11:27 -0800, Ani Sinha wrote:

> This is wrong. Accelerated or not, the kernel code was organized to
> have the tags in the packet aux data. So I think this is how user land
> should be coded as well.

You have your opinion, thats good.

My opinion as a kernel developer is that the network tap is here to have
a copy of the exact frame given to the _device_.

Because in the end, users will complain to netdev, giving us tcpdump
traces. And if these traces have nothing to do with what is given to the
device, they are almost useless.

If you want other taps, and catch frames before/after various netfilter
hooks, segmentations, vlan accel, tunnels, or before GRO layer, thats a
totally different request.

A packet can be modified by a lot of layers in the kernel.

And yes, BPF filters can be incredibly complex, but it appears kernel is
not a piece of cake.



___
tcpdump-workers mailing list
tcpdump-workers@lists.tcpdump.org
https://lists.sandelman.ca/mailman/listinfo/tcpdump-workers