http://www.ioremap.net/node/197

Elliptics network: approaching the milestone

Tagged:  

What did happen during the last 4 days? I did not blog for a while slacking and thinking about the next step I want to make with the elliptics network.
Let's look at the things in order of appearence.

First, I implemented set of testing tools for the performance and integrity testing. And while they are rather dumb, it is possible to check store/load sequence of the files of different sizes and manually check that they are the same (by looking at their md5 checksums). Thinking about set of fully automatic tests.

Second, I implemented configurable server backends to store data. Backend, used in the previous version, which stores transactions as files, was moved into example server. I have plans to add BerkleyDB one and SMTP-store/IMAP-load backends.

During the testing I found that file-based IO storage backend, when operated on top of XFS partition, has rather small performance with syncs turned on (sync of the file after object was written). Turning off syncs resulted in 10 times higher numbers - with 10kb writes it jumbed from 300 KB/s upto 2-3 MB/s. XFS is known to be slow in the workload where huge amount of rather small files is created in the directory (I use 256 directories indexed by the first byte of the transaction ID) though, for example /dev/shm used as a backend easily allows to suck in the whole 1GigE bandwidth (more than 110 MB/s of the pure data not counting headers).

And still I expected better results. Playing with the system I found (actually proved again) a simple way to livelock two threads which send a messages in a ping-pong matter on two different machines. Depending on the socket buffer size, message length and their amount threads may block quite easily.

After some work I efectively ended up with the solution where dedicated thread reads the whole transaction from the network and queues it into the per-node list, which is processed by the configurable number of worker threads. Since receiving thread never sends any data back to the remote nodes, described livelock is not possible.

This technique noticebly bumped the performance, but it also introduced some bugs I work on now.
Another issue is realted to the modulo operation over the ring of the IDs - right now we have problems fetching IDs during the joining handshake which are less than zero for example (which in the ring becomes less than 2^128-1).

In a meantime I also implemented server-side transformation functions, which maybe used to force multiple data copies on behalf of the server. Server could also return its transformation functions to the client, so that it could use them to fetch data from different nodes either in parallel or as a failover case when some nodes are not accessible. Only the first part of this idea (having server-side transformation functions) is implemented, they are not used by the server right now.
It could be also not a bad idea to have an object deletion command.

Those are the only tasks I have in mind for the next release.

I also wrote a simple distributing 'expect' script, but cluster, I had access to, moved away and will be accessible again only in about a week, so no fancy numbers with hundred of nodes in the cloud for now, will play this game later.

Those are the news, expect more soon!

Figure out what happened to Fengguang Wu's adaptive read ahead patches for the kernel. Those patches will really help the type of server you are working on. They don't seem to be in the current kernel.

It will not help in that workload, since it was write-only.

For the reading this could help especially if XFS exported directory readahead as some syscalls (it is not supported in ext* though).

Threads in the server are bad. Use libevent.
http://monkey.org/~provos/libevent/

It is quite questionable wether any technology has an absolute advantage compared to others. With multi-core systems equation moves into the thread area.

But given modern scheduler games and resulted huge regressions, this can be reconsidered.

It's not obvious that more cores help things like a file server which is usually IO bound. More cores definitely help with things like FastCGI processes. To do a file server on one core you need to use AIO. The process implementing the file server needs to never block while waiting for IO. Operations that are going to block are spun off into another process.

The fastest web servers (khttpd) run in interrupt context. The request comes in from the net card and causes an interrupt. In the interrupt handler a response is queued or a disk IO is scheduled. When the disk IO completes another interrupt is used to send the net traffic. Don't try writing one of these, they are way too hard to maintain. These are the ultimate event driven servers.

Don't forget about sendfile(). sendfile() is the most efficient way to get file data onto the wire from user space.

AIO would be a really good solution. But Linux AIO sucks for at least three reasons:

  1. it does not work with usual files, only block devices
  2. its performance is limited by the design model (of the call restarts)
  3. interface does not support vectorized submission (for the multiple contexts) of the works and iirc it is not pollable, but maybe aio file descriptor code was added with timerfd() and friends

I already wrote the whole kernel event based subsystem which among others supported disk and network AIO several years ago. It was called kevent and this project would be a good application for this.

Usual file descriptors are also loosely pollable compared to sockets or pipe for example, so we have to have pool of threads to do an IO and block when needed, although their number can be limited (right now there is at least one receiving thread and optionally global number of threads in IO pool).

As of sendfile() - I use it in the file storage backend for the data reading command and for the file writing in the client. I could use more generic splice() and improve things in some workloads (like data forwarding or writing from the socket into the file), but it is not portable, so I postponed this issue, but will try to extend design to support this kind of features.


Reply via email to