Andi> Perhaps a good start of that discussion David asked for
    Andi> would be if you could give us an overview of the differences
    Andi> and how you avoid the TOE problems.

Well, here's a quick overview, leaving out some of the details.  The
difference between TOE and iWARP/RDMA is really the interface that
they present.

A TOE ("TCP Offload Engine") is a piece of hardware that offloads TCP
processing from the main system to handle regular sockets.  There is
either some way to hand off a socket from the host stack to the TOE,
or a socket is created on the TOE to start with, but in both cases,
the TOE is handling processing for normal TCP sockets.  This means
that the TOE has some hardware and/or firmware to do stateful TCP
processing.

An iWARP device, or RNIC (RDMA NIC), also usually has hardware and/or
firmware TCP processing, but this isn't exposed through the BSD socket
interface.  Instead, an RNIC presents an interface more like an
InfiniBand HCA: work requests (sends, receives, RDMA operations) are
passed to the RNIC via work queues, and completion notification is
returned asynchronously via completion queues.  An iWARP connection
can handle both send/receive ("two-sided") and get/put (RDMA or
"one-sided") operations.

For full details of the protocol used for this, you can look at the
drafs from the IETF rddp working group, but basically an RDMA protocol
is layered on top of a connected stream protocol -- usually TCP, but
SCTP could be used as well.

A lot of the perfomance of iWARP comes from the RDMA/direct placement
capabilities -- for example an NFS/RDMA server can process requests
out of order and put data directly into the buffer that's waiting for
it, without using any CPU on the destination -- but even send/receive
operations can be useful.

As a side note, an RNIC will also typically support the same sort of
kernel bypass as an IB HCA, where work queues can be safely mapped
into a userspace process's memory so that work requests can be posted
without a system call.  In fact, when people usually use RDMA as a
shorthand for the combination of the three features I described:
asynchronous work queues and completion queues, connections that
support both send/receive and RDMA, and kernel bypass.

In any case, RNIC support can be added to the existing IB stack with
fairly minor modifications -- you can search the netdev archives for
the patchsets posted by Steve Wise, but nearly all of the new code is
in the low-level hardware driver for the specific iWARP devices.

The real issues for netdev are things like Steve Wise's patch to add
route change notifiers, which could be used to tell RNICs when to
update the next hop for a connection they're handling.  More
generally, it would be interesting to see if it's possible to tie an
RNIC into the kernel's packet filtering, so that disallowed
connections don't get set up.  This seems very similar in spirit to
the problems around packet filtering that were raised for VJ netchannels.

 - Roland
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to