I've been working on the implementation for seekable sockets in tcp. A patch
of my work so far is attached. I believe that it works correctly excepting a
bug in tcp_collapse_seekable, which has been disabled in tcp_prune_queue to
prevent it from arising at the moment.

What this patch creates is a new pseudo protocol, SOCK_SEEK_STREAM. It
really isn't a new protocol, since it uses a the same tcp stack as normal
SOCK_STREAM connections use. All this allows for is declaring that the
socket is seekable. The original tcp implementation still exists and as far
as I can tell, although functions such as tcp_recvmsg have been heavily
modified, the code travels through the same path as the original
implementation. I have never had a problem using normal SOCK_STREAM
connections in my seeking kernels, so it should be safe in that regard.

Once a socket is created SOCK_SEEK_STREAM, you can use recv, and recvmsg (I
think though I haven't tried) normally. A new function seek_recv,
implemented as a syscall as sys_seek_recv, takes the following arguments:

ssize_t recv(int s, void *buf, size_t len, int flags, size_t offset);

s, buf, len, and flags are unchanged from normal sys_recv.
offset is the number of bytes you wish to seek into the receive buffer.

The offset is always relative to the first byte you would receive if you
were to call a normal sys_recv. Example:

1. sys_recv(sockfd, buf, 10, 0);
2. sys_seek_recv(sockfd, buf, 10, 0, 10);
3. sys_seek_recv(sockfd, buf, 5, 0, 0);
4. sys_seek_recv(sockfd, buf, 4, 0, 3);
5. sys_recv(sockfd, buf, 6, 0);

The bytes in the stream may be represented by one digit each with ^ pointing
to the first byte in the stream and _ represented bytes that have been
removed from the stream and cannot be accessed again. The stream is
originally:

0123456789012345678901234567890123456789
^

After the first call, the stream looks like:

__________012345678901234567890123456789
          ^

After the second call, the stream looks like:

__________0123456789__________0123456789
          ^

After the third call:

_______________56789__________0123456789
               ^

After the fourth call:

_______________567______________23456789
               ^

After the fifth call:

___________________________________56789
                                   ^

This can be useful for programs such as mpi. In mpi, a server receives
results of computations from clients. However, the server cannot control who
sends data when. If the server needs data from client A to know how to
process the data from client B, then the server will want data from client A
first. Currently, if data from Client B comes first, then the mpi library
will copy the data into the library in userspace, then copy the data from
client A into the server program, and then copy the data from client B from
its own library buffer into the server program. If the socket is seekable,
then if data from client B comes first, we can seek past it and grab the
data from client A and copy it directly to the server program, then copy the
data from client B directly into the server program, saving a copy from
userspace to userspace (and possibly an allocation in userspace in the mpi
library). Other uses can also be found.

A few things need to be done before using the seekable sockets on large
messages:

1. Change the maximum receive buffer size: echo "SOME_LARGE_NUMBER" >
/proc/sys/net/core/rmem_max
2. Set the receive buffer to the size you want using setsockopt()

Also, since seekable sockets disables the prequeue mechanism, if you want to
benchmark against normal receives you should also turn off the prequeue
mechanism there: echo "1" > /proc/sys/net/ipv4/tcp_low_latency

I am new to kernel hacking and the networking internals of the kernel, so
some of my methods of implementation may not be the best or "correct" way of
doing things. I am posting this to hopefully receive some critique and
possibly some help. The trickiest part of the implementation is worrying
about wrapped sequence numbers. Normally, the range of sequence numbers in
the receive queue is insignificant with respect to the maximum value of a
u32, and hence, some of the mechanisms in the normal receive process can
fail when seekable sockets is introduced. Say the first byte in the receive
queue is always seeked past, and we begin to seek past it so much that we
wrap sequence numbers. Bad things can happen, and I have tried to fix them,
but some still elude me. This happens especially in tcp_collapse_seekable at
the BUG() check for offset < 0. I believe this is the cause of the problem,
but I am having trouble tracking the real bug down. Unfortunately, I cannot
seem to run either kgdb or kdb on my test machines (they reboot when
entering debugging mode) and I can't seem to replicate them on other
systems.

If anyone could help me diagnose and fix these problems I would appreciate
it.

Please let me know of your thoughts and suggestions.

Thank you

Attachment: linux-2.6.13-rc2-seek.patch.bz2
Description: Binary data

Reply via email to