On Friday 05 May 2006 11:36, Craig Tierney wrote: > My concern with this setup isn't xfs, it would be the stability of > the storage. Also, if there is a disk hiccup (which will happen) that > repairing a 16 TB filesystem takes a long time. A distributed > filesystem (PVFS2, Ibrix, etc) you would only have to fix the one > volume, not the entire filesystem. There may be some filesystem > consistency checks after repair, but not to the extent of a full > filesystem check.
We have a single 35TB Ibrix filesystem, served by 16 fileservers and backed by 64 SAN LUNs on a DataDirect Networks storage array. The fsck protocol today is to do a full filesystem check first, and then do fixes if necessary. The LUN filesystems are modified ext3, so the "Phase I" fsck is 64 ext3 fsck's in parallel. The check-only Phase I run takes quite a while (ext3 fsck is fairly slow). Once the damaged LUN filesystems are identified, the repairs can optionally be restricted to the damaged LUNs; fewer LUNs being accessed in parallel means that the repair run can actually go much faster than the check-only run. Usually a post-repair consistency check is not necessary (Ibrix tech support advises us what to do in each case, depending on what the logs show). There are two more fsck phases that are run separately; the second phase is somewhat faster than the first, and the third is very fast. I'll leave out any further details of the Ibrix filesystem architecture and fsck, since I'm not entirely clear how much they want to keep private in their conversations with their supported or pre-sales customers. You can talk to them yourself. :) David _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf