Date: Thu, 12 Dec 2002 20:00:52 -0500 From: "John A. Tamplin" <[EMAIL PROTECTED]> [...] Disks don't write one byte at a time, so a system crash during a write can result in indeterminate state for the entire block (and it gets worse when you go through the filesystem rather than raw access to the disk, since data important to your file could possibly share a physical disk block and be updated without your knowlege or control not to mention the out-of-order writes problem). I haven't looked into the skiplist implementation, but fixing that problem isn't easy without a pre-image log and some sort of timestamp/sequence number at both ends of the page. Once you head down that road, you get very close to building a full database system and then we are back to the SQL backend discussed earlier.
Obviously disks do not write one byte at a time. Writes happen to blocks of data. However, the operating system will issue the block of data identically to the old block except for the byte (or word, or whatever) that I've changed. I don't think I've ever heard of a filesystem that mingles more than one file in a single block. (If they do, it's certainly news to me, and no reasonable model can be made that will deal with it.) The "out of order writes problem" isn't a problem unless a disk claims to the operating system that it has performed a physical write when it has not; but if that is the case, obviously no durability claims can be ensured by higher level software. The question is whether or not a physical disk can fail to write a block in such a way such that the old block isn't completely there and the new block isn't completely there. Berkeley DB's documentation has a long discussion about various models software should draw from disks. No logging model is possible without some sort of model of the underlying device. The "single byte write" theory is actually not such a bad one. It inhibits optimizations you might make from "single block write" (but it's hard to figure out what the disk considers a "block") or even better the "single page write" (easier to figure out what a page is---but is it what the disk writes in?). Larry