[RFC] Net vm deadlock fix (take two)

Daniel Phillips Sat, 06 Aug 2005 00:35:52 -0700

Hi,

This version does not do blatantly stupid things in hardware irq context, is
more efficient, and... wow the patch is smaller!  (That never happens.)


I don't mark skbs as being allocated from reserve any more.  That works, but
it is slightly bogus, because it doesn't matter which skb came from reserve,
it only matters that we put one back.  So I just count them and don't mark
them.

The tricky issue that had to be dealt with is the possibility that a massive 
number of skbs could in theory be queued by the hardware interrupt before the 
softnet softirq gets around to delivering them to the protocol.  But we can 
only allocate a limited number of skbs from reserve memory.  If we run out of 
reserve memory we have no choice but to fail skb allocations, and that can 
cause packets to be dropped.  Since we don't know which packets are blockio 
packets and which are not at that point, we could be so unlucky as to always 
drop block IO packets and always let other junk through.  The other junk 
won't get very far though, because those packets will be dropped as soon as 
the protocol headers are decoded, which will reveal that they do not belong 
to a memalloc socket.  This short circuit ought to help take away the cpu 
load that caused the softirq constipation in the first place.

What is actually going to happen is, a few block IO packets might be randomly 
dropped under such conditions, degrading the transport efficiency.  Block IO 
progress will continue, unless we manage to accidently drop every block IO 
packet and our softirqs continue to stay comatose, probably indicating a 
scheduler bug.

OK, we want to allocate skbs from reserve, but we cannot go infinitely into 
reserve.  So we count how many packets a driver has allocated from reserve, 
in the net_device struct.  If this goes over some limit, the skb 
allocation fails and the device driver may drop a packet because of that.  
Note that the e1000 driver will just be trying to refill its rx-ring at this 
point, and it will try to refill it again as soon as the next packet arrives, 
so it is still some ways away from actually dropping a packet.  Other drivers 
may immediately drop a packet at this point, c'est la vie.  Remember, this 
can only happen if the softirqs are backed up a silly amount.

The thing is, we have got our block IO traffic moving, by virtue of dipping 
into the reserve, and most likely moving at near-optimal speed.  Normal 
memory-consuming tasks are not moving because they are blocked on vm IO.  The 
things that can mess us up are cpu hogs - a scheduler problem - and tons of 
unhelpful traffic sharing the network wire, which we are killing off early as 
mentioned above.

What happens when a packet arrives at the protocol handler is a little subtle.  
At this point, if the interface is into reserve, we can always decrement the 
reserve count, regardless of what type of packet it is.  If it is a block IO 
packet, the packet is still accounted for within the block driver's 
throttling.  We are sure that the packet's resources will be returned to the 
common pool in an organized way.  If it is some other random kind of packet, 
we drop it right away, also returning the resources to the common pool.  
Either way, it is not the responsibility of the interface to account for it 
any more.

I'll just reiterate what I'm trying to accomplish here:

   1) Guarantee network block io forward progress
   2) Block IO throughput should not degrade much under low memory

This iteration of the patch addresses those goals nicely, I think.  I have not 
yet shown how to drive this from the block IO layer, and I haven't shown how 
to be sure that all protocols on an interface (not just TCPv4, as here) can 
handle the reserve management semantics.  I have ignored all transports 
besides IP, though not much changes for other transports.  I have some 
accounting code that is very probably racy and needs to be rewritten with 
atomic_t.  I have ignored the many hooks that are possible in the protocol 
path.  I have assumed that all receive skbs are the same size, and haven't 
accounted for the possibility that that size (MTU) might change.  All these 
things need looking at, but the main point at the moment is to establish a 
solid sense of correctness and to get some real results on a vanilla delivery 
path.  That in itself will be useful for cluster work, where configuration
issues are kept under careful control.

As far as drivers are concerned, the new interface is dev_memalloc_skb, which 
is straightforward.  It needs to know about the netdev for accounting 
purposes, so it takes it as a parameter and thoughtfully plugs it into the 
skb for you.

I am still using the global memory reserve, not mempool.  But notice, now I am 
explicitly accounting and throttling how deep a driver dips into the global 
reserve.  So GFP_MEMALLOC wins a point: the driver isn't just using the 
global reserve blindly, as has been traditional.  The jury is still out 
though.  Mainly, I need to present a sane way for drivers to declare their 
maximum reserve requirements to the vm system, so the vm can adjust the size 
of the global pool appropriately.

Regards,

Daniel

diff -up --recursive 2.6.12.3.clean/include/linux/gfp.h 
2.6.12.3/include/linux/gfp.h
--- 2.6.12.3.clean/include/linux/gfp.h  2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/include/linux/gfp.h        2005-08-05 21:53:09.000000000 -0400
@@ -39,6 +39,7 @@ struct vm_area_struct;
 #define __GFP_COMP     0x4000u /* Add compound page metadata */
 #define __GFP_ZERO     0x8000u /* Return zeroed page on success */
 #define __GFP_NOMEMALLOC 0x10000u /* Don't use emergency reserves */
+#define __GFP_MEMALLOC  0x20000u /* Use emergency reserves */
 
 #define __GFP_BITS_SHIFT 20    /* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
diff -up --recursive 2.6.12.3.clean/include/linux/netdevice.h 
2.6.12.3/include/linux/netdevice.h
--- 2.6.12.3.clean/include/linux/netdevice.h    2005-07-15 17:18:57.000000000 
-0400
+++ 2.6.12.3/include/linux/netdevice.h  2005-08-06 01:06:18.000000000 -0400
@@ -371,6 +371,8 @@ struct net_device
        struct Qdisc            *qdisc_ingress;
        struct list_head        qdisc_list;
        unsigned long           tx_queue_len;   /* Max frames per queue allowed 
*/
+       int                     rx_reserve;
+       int                     rx_reserve_used;
 
        /* ingress path synchronizer */
        spinlock_t              ingress_lock;
@@ -929,6 +931,28 @@ extern void                net_disable_timestamp(void)
 extern char *net_sysctl_strdup(const char *s);
 #endif
 
+static inline struct sk_buff *__dev_memalloc_skb(struct net_device *dev,
+       unsigned length, int gfp_mask)
+{
+       struct sk_buff *skb = __dev_alloc_skb(length, gfp_mask);
+       if (skb)
+               goto done;
+       if (dev->rx_reserve_used >= dev->rx_reserve)
+               return NULL;
+       if (!__dev_alloc_skb(length, gfp_mask|__GFP_MEMALLOC))
+               return NULL;;
+       dev->rx_reserve_used++;
+done:
+       skb->dev = dev;
+       return skb;
+}
+
+static inline struct sk_buff *dev_alloc_skb_reserve(struct net_device *dev,
+       unsigned length)
+{
+       return __dev_memalloc_skb(dev, length, GFP_ATOMIC);
+}
+
 #endif /* __KERNEL__ */
 
 #endif /* _LINUX_DEV_H */
diff -up --recursive 2.6.12.3.clean/include/net/sock.h 
2.6.12.3/include/net/sock.h
--- 2.6.12.3.clean/include/net/sock.h   2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/include/net/sock.h 2005-08-05 21:53:09.000000000 -0400
@@ -382,6 +382,7 @@ enum sock_flags {
        SOCK_NO_LARGESEND, /* whether to sent large segments or not */
        SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
        SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+       SOCK_MEMALLOC, /* protocol can use memalloc reserve */
 };
 
 static inline void sock_set_flag(struct sock *sk, enum sock_flags flag)
@@ -399,6 +400,11 @@ static inline int sock_flag(struct sock 
        return test_bit(flag, &sk->sk_flags);
 }
 
+static inline int is_memalloc_sock(struct sock *sk)
+{
+       return sock_flag(sk, SOCK_MEMALLOC);
+}
+
 static inline void sk_acceptq_removed(struct sock *sk)
 {
        sk->sk_ack_backlog--;
diff -up --recursive 2.6.12.3.clean/mm/page_alloc.c 2.6.12.3/mm/page_alloc.c
--- 2.6.12.3.clean/mm/page_alloc.c      2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/mm/page_alloc.c    2005-08-05 21:53:09.000000000 -0400
@@ -802,8 +802,8 @@ __alloc_pages(unsigned int __nocast gfp_
 
        /* This allocation should allow future memory freeing. */
 
-       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-                       && !in_interrupt()) {
+       if ((((p->flags & PF_MEMALLOC) || 
unlikely(test_thread_flag(TIF_MEMDIE)))
+                       && !in_interrupt()) || (gfp_mask & __GFP_MEMALLOC)) {
                if (!(gfp_mask & __GFP_NOMEMALLOC)) {
                        /* go through the zonelist yet again, ignoring mins */
                        for (i = 0; (z = zones[i]) != NULL; i++) {
diff -up --recursive 2.6.12.3.clean/net/ethernet/eth.c 
2.6.12.3/net/ethernet/eth.c
--- 2.6.12.3.clean/net/ethernet/eth.c   2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/net/ethernet/eth.c 2005-08-06 02:32:02.000000000 -0400
@@ -281,6 +281,7 @@ void ether_setup(struct net_device *dev)
        dev->mtu                = 1500; /* eth_mtu */
        dev->addr_len           = ETH_ALEN;
        dev->tx_queue_len       = 1000; /* Ethernet wants good queues */        
+       dev->rx_reserve         = 50;
        dev->flags              = IFF_BROADCAST|IFF_MULTICAST;
        
        memset(dev->broadcast,0xFF, ETH_ALEN);
diff -up --recursive 2.6.12.3.clean/net/ipv4/tcp_ipv4.c 
2.6.12.3/net/ipv4/tcp_ipv4.c
--- 2.6.12.3.clean/net/ipv4/tcp_ipv4.c  2005-07-15 17:18:57.000000000 -0400
+++ 2.6.12.3/net/ipv4/tcp_ipv4.c        2005-08-06 00:45:07.000000000 -0400
@@ -1766,6 +1766,12 @@ int tcp_v4_rcv(struct sk_buff *skb)
        if (!sk)
                goto no_tcp_socket;
 
+       if (skb->dev->rx_reserve_used) {
+               skb->dev->rx_reserve_used--; // racy
+               if (!is_memalloc_sock(sk))
+                       goto discard_and_relse;
+       }
+
 process:
        if (sk->sk_state == TCP_TIME_WAIT)
                goto do_time_wait;
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] Net vm deadlock fix (take two)

Reply via email to