I'd had this idea in the back of my head for a while and finally I tried it out yesterday to see how it would look.
Basically, we store the write queue packets in an RB tree keyed by start sequence number. The objective is to use this information to parse the SACK blocks more efficiently and get rid of the "hints" we use to optimize that code currently. The big win is that we can now find the start of any sequence of packets in the retransmit queue in O(log n) time. The downsides are numerous, such as: 1) Increased state stored in sk_buff, we need to store an rb_node there which is 3 points :-((( It is possible that perhaps we could alias the rb_node with the existing next/prev pointers in sk_buff, but like the VMA code in the Linux MM, I decided to keep the linked list around since that's the fastest for all the other operations. rb_next() and rb_prev() would need to be used otherwise, and those aren't the cheapest. 2) We eat a O(log n) insert and delete for every packet now, even when no SACK processing occurs. I think this part can be sped up. We insert to the tail on new packets, so we can take the existing tail packet and do something similar to "rb_next()" to find the insertion point. But we'd still have the cost of the potential tree rotations. Even if none of the RB stuff is useful, the first patch which abstracts all of the write queue accesses should probably go in because it allows experimentation in this area to be done quite effortlessly. One thing I'm not sure about wrt. the RB tree stuff is that although we key on start sequence of the SKB, we do change this when trimming the head of TSO packets. I think this is "OK" because such changes would never change the position of the SKB within the RB tree, but I'd like someone else to think that is true too :-) Worst case we could key on end sequence which never ever changes. Actually, the whole case of tcp SKB chopping and mid-queue insert/delete, and it's effect on the RB tree entires needs to be audited if we are to take this idea seriously. I wonder if there is an algorithm better suited to this application. It's an interval search, which needs fast insert to the tail and fast delete from the head. Another aspect of this patch are the per-SKB packet counts (I named them "fack_count" but I'd rename that to "packet_count" when I ever commited it for real). The idea is that this can eliminate most of the packet count hints in the tcp_sock. The algorithm is simple: 1) On skb insert: if write queue empty, assign packet_count of zero else, assign packet_count of tail_skb->packet_count + tcp_skb_pcount(tail_skb) 2) To get normalized packet_count of arbitrary SKB: (skb->packet_count - head_skb->packet_count) You can see in patch 4 how fastpath_cnt_hint is replaced by that logic. I added the packet count to TCP_SKB_CB() and that takes it up to the limit of 48 bytes of skb->cb[] on 64-bit architectures. I wanted to steal TCP_SKB_CB()->ack_seq for this, but that's used for some SACK logic already. There are some pains necessary to keep the counts correct, and in some cases I just recalculate the whole queue's packet counts after the insert point. I'm sure this can be improved. Back to the RB tree, there is another way to go after at least the SACK processing overhead, and that is to maintain a not-sacked table which is the inverse of the SACK blocks. There are a few references out there which discuss that kind of approach. Some of the other hints are hard to eliminate. In particular the cached retransmit state is non-trivial to replace with logic that does not require state. The RB tree can't help, and we can't even use the per-SKB packet_count for the count hints because one of them wants to remember how many packets in the queue were marked lost, for example. If we could get rid of all the hints, that would be an easy 44 bytes killed in tcp_sock on 64-bit. Anyways, here come the patches, enjoy. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html