OK, here's a stronger attempt.  I might have gone a little wild with the patch
description :).  Two questions and a data point:

1) It occurs to me as I write this that I don't know how the trailer_len stuff
   works if a fragments final data bits are page frags in the skb.  Did I
   screw this up?  How did the old code handle it?  Maybe I'm just paranoid.

2) I changed the final-frag test to be length + fraggap as the math later on
   seemed to match that.. is that OK?

I tested this with two back to back e1000s.. ping ponging udp messages of sizes
1 ... 65507 across mtus of 576, 1500, 9000, and 16110.  That did shake out a
few bugs.

So how's this looking?

- z

--------

Avoid multi-page allocs for large mtu datagram sends 

Previously the size of the first skb allocated for an IP fragment was bounded
by the MTU of the outgoing device.  With large MTUs (32k, say) this results in
multi order allocations.  We were seeing these fail on machines that were
heavily loaded after about a week of uptime.

This patch clamps the size of the first skb allocated to a single page,
including the headers and skb accounting overhead.  SKB_DATA_KMALLOC_BYTES() is
introduced to let callers know how large an allocation will be.
ip_append_data() uses this to restrict the amount of payload that it puts in
the initial skb to just fit a page-sized allocation.  This is only done if the
device supports scatter-gather so that the rest of the fragment can be built of
more page-sized allocs and hung off the skb.

The confusing case is the inclusion of the trailer_len overhead in the
allocation for the final fragment.  The risk is in having a page-sized initial
skb that has room for the full payload but not enough room for the trailer_len.
We avoid this by always including the trailer_len overhead in the allocaiton if
the payload fits the MTU.  If we shrink datalen then it won't be the final
fragment and it won't have a trailer, but we've allocated space for it.  We
reuse that space for payload only if doing so doesn't increase datalen so much
that it consumes the entire payload again.  So in that case we'll have an
initial fragment that doesn't fill the mtu and a second fragment that has a
small amount of payload and the trailer.  Sheesh.

And finally, while we introduce SKB_DATA_KMALLOC_BYTES() we use it in some
other code paths that previously had the same functionality coded by hand.

Index: 2.6.16-mm2-bigmtu/net/ipv4/ip_output.c
===================================================================
--- 2.6.16-mm2-bigmtu.orig/net/ipv4/ip_output.c
+++ 2.6.16-mm2-bigmtu/net/ipv4/ip_output.c
@@ -891,32 +891,50 @@ alloc_new_skb:
                        datalen = length + fraggap;
                        if (datalen > mtu - fragheaderlen)
                                datalen = maxfraglen - fragheaderlen;
-                       fraglen = datalen + fragheaderlen;
 
-                       if ((flags & MSG_MORE) && 
-                           !(rt->u.dst.dev->features&NETIF_F_SG))
-                               alloclen = mtu;
-                       else
-                               alloclen = datalen + fragheaderlen;
+                       alloclen = fragheaderlen + hh_len + 15;
 
                        /* The last fragment gets additional space at tail.
                         * Note, with MSG_MORE we overallocate on fragments,
                         * because we have no idea what fragment will be
                         * the last.
                         */
-                       if (datalen == length)
+                       if (datalen == length + fraggap)
                                alloclen += rt->u.dst.trailer_len;
 
+                       if ((rt->u.dst.dev->features&NETIF_F_SG) &&
+                           (SKB_DATA_KMALLOC_BYTES(alloclen + datalen)
+                                                               > PAGE_SIZE)) {
+                               /*
+                                * datalen is shrinking so there won't be a
+                                * trailer, but alloclen already accounted for
+                                * it.  use that space for payload but only if
+                                * it wasn't the trailer_len itself that pushed
+                                * the alloc beyond a single page.
+                                */
+                               if (datalen - SKB_MAX_ORDER(alloclen, 0) >
+                                   rt->u.dst.trailer_len)
+                                       alloclen -= rt->u.dst.trailer_len;
+
+                               datalen = SKB_MAX_ORDER(alloclen, 0);
+                       }
+
+                       fraglen = datalen + fragheaderlen;
+
+                       if ((flags & MSG_MORE) &&
+                           !(rt->u.dst.dev->features&NETIF_F_SG))
+                               alloclen += mtu - fragheaderlen;
+                       else
+                               alloclen += datalen;
+
                        if (transhdrlen) {
-                               skb = sock_alloc_send_skb(sk, 
-                                               alloclen + hh_len + 15,
+                               skb = sock_alloc_send_skb(sk, alloclen,
                                                (flags & MSG_DONTWAIT), &err);
                        } else {
                                skb = NULL;
                                if (atomic_read(&sk->sk_wmem_alloc) <=
                                    2 * sk->sk_sndbuf)
-                                       skb = sock_wmalloc(sk, 
-                                                          alloclen + hh_len + 
15, 1,
+                                       skb = sock_wmalloc(sk, alloclen, 1,
                                                           sk->sk_allocation);
                                if (unlikely(skb == NULL))
                                        err = -ENOBUFS;
Index: 2.6.16-mm2-bigmtu/include/linux/skbuff.h
===================================================================
--- 2.6.16-mm2-bigmtu.orig/include/linux/skbuff.h
+++ 2.6.16-mm2-bigmtu/include/linux/skbuff.h
@@ -39,6 +39,10 @@
 
 #define SKB_DATA_ALIGN(X)      (((X) + (SMP_CACHE_BYTES - 1)) & \
                                 ~(SMP_CACHE_BYTES - 1))
+/* ip_append_data relies on this and SKB_MAX_ORDER matching so it masks
+ * after adding in sizeof(skb_shared_info) */
+#define SKB_DATA_KMALLOC_BYTES(X) SKB_DATA_ALIGN(X + \
+                                               sizeof(struct skb_shared_info))
 #define SKB_MAX_ORDER(X, ORDER)        (((PAGE_SIZE << (ORDER)) - (X) - \
                                  sizeof(struct skb_shared_info)) & \
                                  ~(SMP_CACHE_BYTES - 1))
Index: 2.6.16-mm2-bigmtu/net/core/skbuff.c
===================================================================
--- 2.6.16-mm2-bigmtu.orig/net/core/skbuff.c
+++ 2.6.16-mm2-bigmtu/net/core/skbuff.c
@@ -148,8 +148,7 @@ struct sk_buff *__alloc_skb(unsigned int
                goto out;
 
        /* Get the DATA. Size must match skb_add_mtu(). */
-       size = SKB_DATA_ALIGN(size);
-       data = ____kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
+       data = ____kmalloc(SKB_DATA_KMALLOC_BYTES(size), gfp_mask);
        if (!data)
                goto nodata;
 
@@ -1486,7 +1485,7 @@ void skb_insert(struct sk_buff *old, str
 void skb_add_mtu(int mtu)
 {
        /* Must match allocation in alloc_skb */
-       mtu = SKB_DATA_ALIGN(mtu) + sizeof(struct skb_shared_info);
+       mtu = SKB_DATA_KMALLOC_BYTES(mtu);
 
        kmem_add_cache_size(mtu);
 }
Index: 2.6.16-mm2-bigmtu/drivers/atm/iphase.c
===================================================================
--- 2.6.16-mm2-bigmtu.orig/drivers/atm/iphase.c
+++ 2.6.16-mm2-bigmtu/drivers/atm/iphase.c
@@ -1306,7 +1306,7 @@ static void rx_dle_intr(struct atm_dev *
           {
              atomic_inc(&vcc->stats->rx_err);
              dev_kfree_skb_any(skb);
-             atm_return(vcc, atm_guess_pdu2truesize(len));
+             atm_return(vcc, SKB_DATA_KMALLOC_BYTES(len));
              goto INCR_DLE;
            }
           // get real pkt length  pwang_test
@@ -1320,7 +1320,7 @@ static void rx_dle_intr(struct atm_dev *
              IF_ERR(printk("rx_dle_intr: Bad  AAL5 trailer %d (skb len %d)", 
                                                             length, skb->len);)
              dev_kfree_skb_any(skb);
-             atm_return(vcc, atm_guess_pdu2truesize(len));
+             atm_return(vcc, SKB_DATA_KMALLOC_BYTES(len));
              goto INCR_DLE;
           }
           skb_trim(skb, length);
Index: 2.6.16-mm2-bigmtu/include/linux/atmdev.h
===================================================================
--- 2.6.16-mm2-bigmtu.orig/include/linux/atmdev.h
+++ 2.6.16-mm2-bigmtu/include/linux/atmdev.h
@@ -418,17 +418,6 @@ void atm_dev_deregister(struct atm_dev *
 void vcc_insert_socket(struct sock *sk);
 
 
-/*
- * This is approximately the algorithm used by alloc_skb.
- *
- */
-
-static inline int atm_guess_pdu2truesize(int size)
-{
-       return (SKB_DATA_ALIGN(size) + sizeof(struct skb_shared_info));
-}
-
-
 static inline void atm_force_charge(struct atm_vcc *vcc,int truesize)
 {
        atomic_add(truesize, &sk_atm(vcc)->sk_rmem_alloc);
Index: 2.6.16-mm2-bigmtu/net/atm/atm_misc.c
===================================================================
--- 2.6.16-mm2-bigmtu.orig/net/atm/atm_misc.c
+++ 2.6.16-mm2-bigmtu/net/atm/atm_misc.c
@@ -28,7 +28,7 @@ struct sk_buff *atm_alloc_charge(struct 
     gfp_t gfp_flags)
 {
        struct sock *sk = sk_atm(vcc);
-       int guess = atm_guess_pdu2truesize(pdu_size);
+       int guess = SKB_DATA_KMALLOC_BYTES(pdu_size);
 
        atm_force_charge(vcc,guess);
        if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf) {
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to