Rich Freeman <ri...@gentoo.org> schrieb: > On Sun, Jun 22, 2014 at 7:44 AM, Kai Krakow <hurikha...@gmail.com> wrote: >> I don't see where you could lose the volume management features. You just >> add device on top of the bcache device after you initialized the raw >> device with a bcache superblock and attached it. The rest works the same, >> just that you use bcacheX instead of sdX devices. > > Ah, didn't realize you could attach/remove devices to bcache later. > Presumably it handles device failures gracefully, ie exposing them to > the underlying filesystem so that it can properly recover?
I'm not sure if multiple partitions can share the same cache device partition but more or less that's it: Initialize bcache, then attach your backing devices, then add those bcache devices to your btrfs. I don't know how errors are handled, tho. But as with every caching technique (even in ZFS) your data is likely toast if the cache device dies in the middle of action. Thus, you should put bcache on LVM RAID if you are going to use it for write caching (i.e. write-back mode). Read caching should be okay (write-through mode). Bcache is a little slower than other flash-cache implementations because it only reports data as written back to the FS if it reached stable storage (which can be the cache device, tho, if you are using write-back mode). It was also designed with unexpected reboots in mind, read. It will replay transactions from its log on reboot. This means, you can have unstable data conditions on the raw device which is why you should never try to use that directly, e.g. from a rescue disk. But since bcache wraps the partition with its own superblock this mistake should be impossible. I'm not sure how graceful device failures are handled. I suppose in write- back mode you can get into trouble because it's too late for bcache to tell the FS that there is a write error when it already confirmed that stable storage has been hit. Maybe it will just keep the data around so you could swap devices or will report the error next time when data is written to that location. It probably interferes with btrfs RAID logic on that matter. > The only problem with doing stuff like this at a lower level (both > write and read caching) is that it isn't RAID-aware. If you write > 10GB of data, you use 20GB of cache to do it if you're mirrored, > because the cache doesn't know about mirroring. Yes, it will write double the data to the cache then - but only if btrfs also did actually read both copies (which it probably does not because it has checksums and does not need to compare data, and lets just ignore the case that another process could try to read the same data from the other raid member later, that case should become optimized-out by the OS cache). Otherwise both caches should work pretty individually with their own set of data depending on how btrfs uses each device individually. Remember that btrfs raid is not a block-based raid where block locations would match 1:1 on each device. Btrfs raid can place one mirror of data in two completely different locations on each member device (which is actually a good thing in case block errors accumulate in specific locations for a "faulty" model of a disk). In case of write caching it will of course cache double the data (because both members will be written to). But I think that's okay for the same reasons, except it will wear your cache device faster. But in that case I suggest to use individual SSDs for each btrfs member device anyways. It's not optimal, I know. Could be useful to see some best practices and pros/cons on that topic (individual cache device per btrfs member vs. bcache on LVM RAID with bcache partitions on the RAID for all members). I think the best strategy depends on if you are write-most or read-most. Thanks for mentioning. Interesting thoughts. ;-) > Offhand I'm not sure > if there are any performance penalties as well around the need for > barriers/etc with the cache not being able to be relied on to do the > right thing in terms of what gets written out - also, the data isn't > redundant while it is on the cache, unless you mirror the cache. This is partialy what I outlined above. I think in case of write-caching, there is no barriers pass-thru needed. Bcache will confirm the barriers and that's all the FS needs to know (because bcache is supervising the FS, all requests go through the bcache layer, no direct access to the backing device). Of course, it's then bcache's job to ensure everything gets written out correctly in the background (whenever it feels to do so). But it can use its own write-barriers to ensure that for the underlying device - that's nothing the FS has to care about. Performance should be faster anyway because, well, you are writing to a faster device - that is what bcache is all about, isn't it? ;-) I don't think write-barriers for read caching are needed, at least not from point of view of the FS. The caching layer, tho, will use it internally for its caching structures. If that will have a bad effect on performance is probably dependent on the implementation, but my intuition says: No performance impact because putting read data in the cache can be defered and then data will be written in the background (write-behind). > Granted, if you're using it for write intent logging then there isn't > much getting around that. Well, sure for bcache. But I think in case of FS-internal write caching devices that case could be handled gracefully (the method which you'd prefer). Since in the internal case the cache has knowledge about the FS bad block handling, it can just retry writing data to another location/disk or keep it around until the admin fixed the problem with the backing device. BTW: SSD firmwares usually suffer similar problems like outlined above because they do writes in the background when they already confirmed persistence to the OS layer. This is why SSD failures are usually much severe compared HDD failures. Do some research, and you should find tests about that topic. Especially consumer SSD firmwares have a big problem with that. So I'm not sure if it really should be bcache's job to fix that particular problem. You should just ensure good firmware and proper failure protection at the hardware level if you want to do fancy caching stuff - the FTL should be able to hide those problems before the whole thing explodes, then report errors before it is able to no longer ensure correct persistence. I suppose that is also the detail where enterprise grade SSDs behave different. HDDs have related issues (SATA vs enterprise SCSI vs SAS, hotword: IO timeouts and bad blocks, and why you should not use consumer hardware for RAIDs). I think all the same holds true for ZFS. >> Having to prepare devices for bcache is kind of a show-stopper because it >> is no drop-in component that way. But OTOH I like that approach better >> than dm- cache because it protects from using the backing device without >> going through the caching layer which could otherwise severely damage >> your data, and you get along with fewer devices and don't need to size a >> meta device (which probably needs to grow later if you add devices, I >> don't know). > > And this is the main thing keeping me away from it. It is REALLY > painful to migrate to/from. Having it integrated into the filesystem > delivers all the same benefits of not being able to mount it without > the cache present. The migration pain is what currently keeps me away, too. Otherwise I would just buy one of those fancy new cheap but still speedy Crucial SSDs and "just enable" bcache... :-\ > Now excuse me while I go fix my btrfs (I tried re-enabling snapper and > it again got the filesystem into a worked-up state after trying to > clean up half a dozen snapshots at the same time - it works fine until > you go and try to write a lot of data to it, then it stops syncing > though you don't necessarily notice until a few hours later when the > write cache exhausts RAM and on reboot your disk reverts back a few > hours). I suspect that if I just treat it gently for a few hours > btrfs will clean up the mess and it will work normally again, but the > damage apparently persists after a reboot if you go heavy in the disk > too quickly... You should report that to the btrfs list. You could try to "echo w > /proc/sysrq-trigger" and look at the blocked processes list in dmesg afterwards. I'm sure one important btrfs thread is in blocked state then... -- Replies to list only preferred.