http://www.ioremap.net/taxonomy/term/3

Filesystems

Filesystems development

New filesystems in drivers/staging for 2.6.30

Tagged:  

I believe drivers/staging should be renamed into fs/staging, since this tree will likely contain CEPH and NILFS2, as long as POHMELFS and DST.

Ceph is a distributed file system designed for reliability, scalability, and performance. The storage system consists of some (potentially large) number of storage servers (bricks), a smaller set of metadata server daemons, and a few monitor daemons for managing cluster membership and state.
It relies on BTRFS to store data and closely works with its internal features like transactions and cloning.

NILFS2 is a log-structured file system supporting versioning of the entire file system and continuous snapshotting which allows users to even restore files mistakenly overwritten or destroyed just a few seconds ago.
NILFS2 lives in -mm tree for a while already, so this actually may be a call for the mainstream inclusion directly into fs/.

More filesystems - good and different!

Is overwrite a bad decision? Distributed transactional filesystem

Tagged:  

strugling Enjoying the muscle pain switches brain into the thinking mode compared to the usual slacking one. This brought me a nice idea of combining POSIX filesystem with the distributed transactional approach used in the elliptics network.

Every POSIX filesystem as long as usual write applications are supposed to overwrite the data placed in the middle of the object. Transactional storage actually does the same - the elliptics network overwrites the local object, but it also creates a new object which stores update transaction itself. It is potentially placed on the different nodes in the cloud. With the simple extension it is possible not to overwrite the original object and redirect all reads to fetch different transactions (or their parts) instead.

What if the POSIX filesystem will not actually overwrite the data, since it requires either a complex cache-coherecy protocol to be involved between multiple clients, working with the same object, and server, which complexity quickly grows when we want to have multiple servers; or use write-through cache (still with races though), which kills the performace compared to the write-back one for the local operations.

Basic idea is to never lock the object itself, it is never updated, only its history log, which is rather small and its updates can be serialized. Every transaction is placed in the different place (potentially - it depends on the network configuration), so when we want to read some data from the object, we check the history log instead, which contains sizes, offsets and the IDs of the data written, and fetch needed transactions (or their parts) from the network and not from the file itseld.

First, this allows to read data in parallel even if object itself was never mirrored to the different nodes.
Second, updates will lock the history index for the very short time, writes itself will not lock anything and will be done in parallel to the multiple nodes, since each transaction will move to the unique location.
Third, history lock may be done distributed, since overhead over its short aciquire time should still be small enough compared to the time needed to write huge transaction into the object and lock over this operation.

Moreover we can eliminate history update locking completely by using versioning of the object state, i.e. all clients who previously read that object still have a valid copy, but with the different version, and thier states are consistent, but not up-to-date. This may rise some concerns from the POSIX side, but overall idea looks very appealing.

As of negative sides, this will force POHMELFS server not to work with the local storage as we know it today - it will become part of the distributed network and thus will store all the data (when used in single node mode, i.e. as a network and not distributed filesystem) in a strange format used currently in the elliptics network - a directories full of files named as 40 chars instead of common names.

POSIX issues introduce potentially serious limitations, but idea looks very promising so far and I will definitely think about its implementation in the POHMELFS.

POHMELFS vs NFS: dbench and the power of the local metadata cache

Tagged:  

POHMELFS vs NFS dbench performance

Client and server machines have 8Gb of RAM and I ran single-threaded test to show the power of the local metadata cache, so get this data with a fair grain of salt.

POHMELFS served close to 160k operations over the network, while async in-kernel NFS ended up with this dump:

   1    643930    23.31 MB/sec  execute 149 sec   
   1    649695    23.37 MB/sec  cleanup 150 sec   
/bin/rm: cannot remove directory `/mnt//clients/client0/~dmtmp/ACCESS': Directory not empty
/bin/rm: cannot remove directory `/mnt//clients/client0/~dmtmp/SEED': Directory not empty
   1    649695    23.36 MB/sec  cleanup 150 sec   

More precise performance values:

POHMELFS   652.481 MB/sec
NFS         23.366 MB/sec

I've pushed POHMELFS update to the GIT, which includes a long-waiting switch from the own path cache to the system's dcache (kudos to the one who exported dcache_lock to the modules :)

It will be likely pushed into the drivers/staging in a day or so.

NFS/credentials leak in 2.6.29-rc1 and thoughts on NFS performance

Tagged:  

I decided to find out how NFS managed to have that fast random read performance (with so slow sequential read), and started a 8gb random IO test in IOzone. And machine started to die. I already killed it three times for this day, and reason is likely in the NFS server. That's what slabtop shows on the server:

 Active / Total Objects (% used)    : 4741969 / 4755356 (99.7%)
 Active / Total Slabs (% used)      : 201029 / 201049 (100.0%)
 Active / Total Caches (% used)     : 91 / 162 (56.2%)
 Active / Total Size (% used)       : 750871.15K / 753121.28K (99.7%)
 Minimum / Average / Maximum Object : 0.01K / 0.16K / 4096.00K
 
  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
1798890 1798672  99%    0.12K  59963       30    239852K cred_jar
1798320 1798307  99%    0.25K 119888       15    479552K size-256
1091430 1091401  99%    0.05K  16290       67     65160K buffer_head
  18824   17997  95%    0.28K   1448       13      5792K radix_tree_node

Both cred_jar and size-256 slabs constantly grew during the test, so I suppose there is a leak in the current kernel (iirc there were no leaks in .28), while I'm waiting for Trond Myklebust for comments, I thought on how NFS is capable to have higher random read than sequential one.

The main theory is its request combining on the client. I.e. when system joins two random but close enough requests, server will send not only requested data, but also additional region between them. Or some similar logic. I.e. essentially increased readahead by both client and server.
If this theory is correct, then simple way to solve it is to increase readahead in POHMELFS, or actually not to shrink it in some conditions. I will try this idea...

DST has been asked for drivers/staging merge

Tagged:  

DST is fully self-contained and really is not expected to get any changes in the future :)

POHMELFS is a bit more complex project, and it requires two exports from the Linux VFS, which are safe as is, but I'm waiting for Linus/Andrew to confirm that (we already talked about them with Andrew some time ago though).

In parallel I'm testing POHMELFS, and while it still shows superior perfromance compared to async in-kernel NFS, one of my systems refuse to mount it. It just says that it does not know 'pohmel' filesystem type, not even entering the kernel. Do not yet know what is the problem, but it worked ok with the previous kernel (it was some -rc tree). Will investigate further and prepare the patches.

Also I would like to know what benchmark could be used for the multi-user parallel testing. I use iozone for the single-user load.




Reply via email to