http://www.ioremap.net/node/197Elliptics network: approaching the milestone
By zbr - Posted on March 25th, 2009
Tagged:
What did
happen during the last 4 days? I did not blog for a while slacking and
thinking about the next step I want to make with the elliptics network. First, I implemented set of testing tools for the performance and integrity testing. And while they are rather dumb, it is possible to check store/load sequence of the files of different sizes and manually check that they are the same (by looking at their md5 checksums). Thinking about set of fully automatic tests. Second, I implemented configurable server backends to store data. Backend, used in the previous version, which stores transactions as files, was moved into example server. I have plans to add BerkleyDB one and SMTP-store/IMAP-load backends. During the testing I
found that file-based IO storage backend, when
operated on top of XFS partition, has rather small performance with
syncs turned on (sync of the file after object was written). Turning
off syncs resulted in 10 times higher numbers - with 10kb writes it
jumbed from 300 KB/s upto 2-3 MB/s. XFS is known to be slow in the
workload where huge amount of rather small files is created in the
directory (I use 256 directories indexed by the first byte of the
transaction ID) though, for example And still I expected better results. Playing with the system I found (actually proved again) a simple way to livelock two threads which send a messages in a ping-pong matter on two different machines. Depending on the socket buffer size, message length and their amount threads may block quite easily. After some work I efectively ended up with the solution where dedicated thread reads the whole transaction from the network and queues it into the per-node list, which is processed by the configurable number of worker threads. Since receiving thread never sends any data back to the remote nodes, described livelock is not possible. This technique
noticebly bumped the performance, but it also introduced some bugs I
work on now. In a meantime I also
implemented server-side transformation
functions, which maybe used to force multiple data copies on behalf of
the server. Server could also return its transformation functions to
the client, so that it could use them to fetch data from different
nodes either in parallel or as a failover case when some nodes are not
accessible. Only the first part of this idea (having server-side
transformation functions) is implemented, they are not used by the
server right now. Those are the only tasks I have in mind for the next release. I also wrote a simple distributing 'expect' script, but cluster, I had access to, moved away and will be accessible again only in about a week, so no fancy numbers with hundred of nodes in the cloud for now, will play this game later. Those are the news, expect more soon! |
- [linuxkernelnewbies] Elliptics network: approaching the milesto... Peter Teoh
- [linuxkernelnewbies] Elliptics network: approaching the mi... Peter Teoh

Figure out what happened to Fengguang Wu's adaptive read ahead patches for the kernel. Those patches will really help the type of server you are working on. They don't seem to be in the current kernel.
It will not help in that workload, since it was write-only.
For the reading this could help especially if XFS exported directory readahead as some syscalls (it is not supported in ext* though).
Threads in the server are bad. Use libevent.
http://monkey.org/~provos/libevent/
It is quite questionable wether any technology has an absolute advantage compared to others. With multi-core systems equation moves into the thread area.
But given modern scheduler games and resulted huge regressions, this can be reconsidered.
It's not obvious that more cores help things like a file server which is usually IO bound. More cores definitely help with things like FastCGI processes. To do a file server on one core you need to use AIO. The process implementing the file server needs to never block while waiting for IO. Operations that are going to block are spun off into another process.
The fastest web servers (khttpd) run in interrupt context. The request comes in from the net card and causes an interrupt. In the interrupt handler a response is queued or a disk IO is scheduled. When the disk IO completes another interrupt is used to send the net traffic. Don't try writing one of these, they are way too hard to maintain. These are the ultimate event driven servers.
Don't forget about sendfile(). sendfile() is the most efficient way to get file data onto the wire from user space.
AIO would be a really good solution. But Linux AIO sucks for at least three reasons:
timerfd()and friendsI already wrote the whole kernel event based subsystem which among others supported disk and network AIO several years ago. It was called kevent and this project would be a good application for this.
Usual file descriptors are also loosely pollable compared to sockets or pipe for example, so we have to have pool of threads to do an IO and block when needed, although their number can be limited (right now there is at least one receiving thread and optionally global number of threads in IO pool).
As of
sendfile()- I use it in the file storage backend for the data reading command and for the file writing in the client. I could use more genericsplice()and improve things in some workloads (like data forwarding or writing from the socket into the file), but it is not portable, so I postponed this issue, but will try to extend design to support this kind of features.