http://www.ioremap.net/taxonomy/term/3FilesystemsFilesystems development
New filesystems in drivers/staging for 2.6.30 By
zbr - Posted on March 10th, 2009
Tagged:
I believe drivers/staging should be renamed into fs/staging, since this tree will likely contain CEPH and NILFS2, as long as POHMELFS and DST. Ceph is a distributed file
system designed for reliability, scalability, and performance. The
storage system consists of some (potentially large) number of storage
servers (bricks), a smaller set of metadata server daemons, and a few
monitor daemons for managing cluster membership and state. NILFS2 is a log-structured file
system supporting versioning of the entire file system and continuous
snapshotting which allows users to even restore files mistakenly
overwritten or destroyed just a few seconds ago. More filesystems - good and different! Is overwrite a bad decision? Distributed transactional filesystem By
zbr - Posted on March 7th, 2009
Tagged:
Every POSIX filesystem as long as usual write applications are supposed to overwrite the data placed in the middle of the object. Transactional storage actually does the same - the elliptics network overwrites the local object, but it also creates a new object which stores update transaction itself. It is potentially placed on the different nodes in the cloud. With the simple extension it is possible not to overwrite the original object and redirect all reads to fetch different transactions (or their parts) instead. What if the POSIX filesystem will not actually overwrite the data, since it requires either a complex cache-coherecy protocol to be involved between multiple clients, working with the same object, and server, which complexity quickly grows when we want to have multiple servers; or use write-through cache (still with races though), which kills the performace compared to the write-back one for the local operations. Basic idea is to never lock the object itself, it is never updated, only its history log, which is rather small and its updates can be serialized. Every transaction is placed in the different place (potentially - it depends on the network configuration), so when we want to read some data from the object, we check the history log instead, which contains sizes, offsets and the IDs of the data written, and fetch needed transactions (or their parts) from the network and not from the file itseld. First, this allows to
read data in parallel even if object itself was never mirrored to the
different nodes. Moreover we can eliminate history update locking completely by using versioning of the object state, i.e. all clients who previously read that object still have a valid copy, but with the different version, and thier states are consistent, but not up-to-date. This may rise some concerns from the POSIX side, but overall idea looks very appealing. As of negative sides, this will force POHMELFS server not to work with the local storage as we know it today - it will become part of the distributed network and thus will store all the data (when used in single node mode, i.e. as a network and not distributed filesystem) in a strange format used currently in the elliptics network - a directories full of files named as 40 chars instead of common names. POSIX issues introduce potentially serious limitations, but idea looks very promising so far and I will definitely think about its implementation in the POHMELFS. POHMELFS vs NFS: dbench and the power of the local metadata cache By
zbr - Posted on February 2nd, 2009
Tagged:
![]() POHMELFS vs NFS dbench performance Client and server machines have 8Gb of RAM and I ran single-threaded test to show the power of the local metadata cache, so get this data with a fair grain of salt. POHMELFS served close to 160k operations over the network, while async in-kernel NFS ended up with this dump: 1 643930 23.31 MB/sec execute 149 sec 1 649695 23.37 MB/sec cleanup 150 sec /bin/rm: cannot remove directory `/mnt//clients/client0/~dmtmp/ACCESS': Directory not empty /bin/rm: cannot remove directory `/mnt//clients/client0/~dmtmp/SEED': Directory not empty 1 649695 23.36 MB/sec cleanup 150 sec More precise performance values: POHMELFS 652.481 MB/sec NFS 23.366 MB/sec I've pushed POHMELFS
update to the GIT,
which includes a long-waiting switch from the own path cache to the
system's dcache (kudos to the one who exported It will be likely
pushed into the NFS/credentials leak in 2.6.29-rc1 and thoughts on NFS performance By
zbr - Posted on January 20th, 2009
Tagged:
I decided
to find out how NFS managed to have that fast random read performance
(with so slow sequential read), and started a 8gb random IO test in
IOzone. And machine started to die. I already killed it three times for
this day, and reason is likely in the NFS server. That's what Active / Total Objects (% used) : 4741969 / 4755356 (99.7%) Active / Total Slabs (% used) : 201029 / 201049 (100.0%) Active / Total Caches (% used) : 91 / 162 (56.2%) Active / Total Size (% used) : 750871.15K / 753121.28K (99.7%) Minimum / Average / Maximum Object : 0.01K / 0.16K / 4096.00K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 1798890 1798672 99% 0.12K 59963 30 239852K cred_jar 1798320 1798307 99% 0.25K 119888 15 479552K size-256 1091430 1091401 99% 0.05K 16290 67 65160K buffer_head 18824 17997 95% 0.28K 1448 13 5792K radix_tree_node Both The main theory is its
request combining on the client. I.e. when
system joins two random but close enough requests, server will send not
only requested data, but also additional region between them. Or some
similar logic. I.e. essentially increased readahead by both client and
server. DST has been asked for drivers/staging merge By
zbr - Posted on January 14th, 2009
Tagged:
DST is fully self-contained and really is not expected to get any changes in the future :) POHMELFS is a bit more complex project, and it requires two exports from the Linux VFS, which are safe as is, but I'm waiting for Linus/Andrew to confirm that (we already talked about them with Andrew some time ago though). In parallel I'm testing POHMELFS, and while it still shows superior perfromance compared to async in-kernel NFS, one of my systems refuse to mount it. It just says that it does not know 'pohmel' filesystem type, not even entering the kernel. Do not yet know what is the problem, but it worked ok with the previous kernel (it was some -rc tree). Will investigate further and prepare the patches. Also I would like to know what benchmark could be used for the multi-user parallel testing. I use iozone for the single-user load. |

