[gentoo-user] Re: Re: Re: Re: [Gentoo-User] emerge --sync likely to kill SSD?

Kai Krakow Tue, 24 Jun 2014 11:52:24 -0700

Rich Freeman <ri...@gentoo.org> schrieb:

> On Sun, Jun 22, 2014 at 7:44 AM, Kai Krakow <hurikha...@gmail.com> wrote:
>> I don't see where you could lose the volume management features. You just
>> add device on top of the bcache device after you initialized the raw
>> device with a bcache superblock and attached it. The rest works the same,
>> just that you use bcacheX instead of sdX devices.
> 
> Ah, didn't realize you could attach/remove devices to bcache later.
> Presumably it handles device failures gracefully, ie exposing them to
> the underlying filesystem so that it can properly recover?


I'm not sure if multiple partitions can share the same cache device 
partition but more or less that's it: Initialize bcache, then attach your 
backing devices, then add those bcache devices to your btrfs.

I don't know how errors are handled, tho. But as with every caching 
technique (even in ZFS) your data is likely toast if the cache device dies 
in the middle of action. Thus, you should put bcache on LVM RAID if you are 
going to use it for write caching (i.e. write-back mode). Read caching 
should be okay (write-through mode). Bcache is a little slower than other 
flash-cache implementations because it only reports data as written back to 
the FS if it reached stable storage (which can be the cache device, tho, if 
you are using write-back mode). It was also designed with unexpected reboots 
in mind, read. It will replay transactions from its log on reboot. This 
means, you can have unstable data conditions on the raw device which is why 
you should never try to use that directly, e.g. from a rescue disk. But 
since bcache wraps the partition with its own superblock this mistake should 
be impossible.

I'm not sure how graceful device failures are handled. I suppose in write-
back mode you can get into trouble because it's too late for bcache to tell 
the FS that there is a write error when it already confirmed that stable 
storage has been hit. Maybe it will just keep the data around so you could 
swap devices or will report the error next time when data is written to that 
location. It probably interferes with btrfs RAID logic on that matter.

> The only problem with doing stuff like this at a lower level (both
> write and read caching) is that it isn't RAID-aware.  If you write
> 10GB of data, you use 20GB of cache to do it if you're mirrored,
> because the cache doesn't know about mirroring.

Yes, it will write double the data to the cache then - but only if btrfs 
also did actually read both copies (which it probably does not because it 
has checksums and does not need to compare data, and lets just ignore the 
case that another process could try to read the same data from the other 
raid member later, that case should become optimized-out by the OS cache). 
Otherwise both caches should work pretty individually with their own set of 
data depending on how btrfs uses each device individually. Remember that 
btrfs raid is not a block-based raid where block locations would match 1:1 
on each device. Btrfs raid can place one mirror of data in two completely 
different locations on each member device (which is actually a good thing in 
case block errors accumulate in specific locations for a "faulty" model of a 
disk). In case of write caching it will of course cache double the data 
(because both members will be written to). But I think that's okay for the 
same reasons, except it will wear your cache device faster. But in that case 
I suggest to use individual SSDs for each btrfs member device anyways. It's 
not optimal, I know. Could be useful to see some best practices and 
pros/cons on that topic (individual cache device per btrfs member vs. bcache 
on LVM RAID with bcache partitions on the RAID for all members). I think the 
best strategy depends on if you are write-most or read-most.

Thanks for mentioning. Interesting thoughts. ;-)

> Offhand I'm not sure
> if there are any performance penalties as well around the need for
> barriers/etc with the cache not being able to be relied on to do the
> right thing in terms of what gets written out - also, the data isn't
> redundant while it is on the cache, unless you mirror the cache.

This is partialy what I outlined above. I think in case of write-caching, 
there is no barriers pass-thru needed. Bcache will confirm the barriers and 
that's all the FS needs to know (because bcache is supervising the FS, all 
requests go through the bcache layer, no direct access to the backing 
device). Of course, it's then bcache's job to ensure everything gets written 
out correctly in the background (whenever it feels to do so). But it can use 
its own write-barriers to ensure that for the underlying device - that's 
nothing the FS has to care about. Performance should be faster anyway 
because, well, you are writing to a faster device - that is what bcache is 
all about, isn't it? ;-)

I don't think write-barriers for read caching are needed, at least not from 
point of view of the FS. The caching layer, tho, will use it internally for 
its caching structures. If that will have a bad effect on performance is 
probably dependent on the implementation, but my intuition says: No 
performance impact because putting read data in the cache can be defered and 
then data will be written in the background (write-behind).

> Granted, if you're using it for write intent logging then there isn't
> much getting around that.

Well, sure for bcache. But I think in case of FS-internal write caching 
devices that case could be handled gracefully (the method which you'd 
prefer). Since in the internal case the cache has knowledge about the FS bad 
block handling, it can just retry writing data to another location/disk or 
keep it around until the admin fixed the problem with the backing device.

BTW: SSD firmwares usually suffer similar problems like outlined above 
because they do writes in the background when they already confirmed 
persistence to the OS layer. This is why SSD failures are usually much 
severe compared HDD failures. Do some research, and you should find tests 
about that topic. Especially consumer SSD firmwares have a big problem with 
that. So I'm not sure if it really should be bcache's job to fix that 
particular problem. You should just ensure good firmware and proper failure 
protection at the hardware level if you want to do fancy caching stuff - the 
FTL should be able to hide those problems before the whole thing explodes, 
then report errors before it is able to no longer ensure correct 
persistence. I suppose that is also the detail where enterprise grade SSDs 
behave different. HDDs have related issues (SATA vs enterprise SCSI vs SAS, 
hotword: IO timeouts and bad blocks, and why you should not use consumer 
hardware for RAIDs). I think all the same holds true for ZFS.

>> Having to prepare devices for bcache is kind of a show-stopper because it
>> is no drop-in component that way. But OTOH I like that approach better
>> than dm- cache because it protects from using the backing device without
>> going through the caching layer which could otherwise severely damage
>> your data, and you get along with fewer devices and don't need to size a
>> meta device (which probably needs to grow later if you add devices, I
>> don't know).
> 
> And this is the main thing keeping me away from it.  It is REALLY
> painful to migrate to/from.  Having it integrated into the filesystem
> delivers all the same benefits of not being able to mount it without
> the cache present.

The migration pain is what currently keeps me away, too. Otherwise I would 
just buy one of those fancy new cheap but still speedy Crucial SSDs and 
"just enable" bcache... :-\

> Now excuse me while I go fix my btrfs (I tried re-enabling snapper and
> it again got the filesystem into a worked-up state after trying to
> clean up half a dozen snapshots at the same time - it works fine until
> you go and try to write a lot of data to it, then it stops syncing
> though you don't necessarily notice until a few hours later when the
> write cache exhausts RAM and on reboot your disk reverts back a few
> hours).  I suspect that if I just treat it gently for a few hours
> btrfs will clean up the mess and it will work normally again, but the
> damage apparently persists after a reboot if you go heavy in the disk
> too quickly...

You should report that to the btrfs list. You could try to "echo w > 
/proc/sysrq-trigger" and look at the blocked processes list in dmesg 
afterwards. I'm sure one important btrfs thread is in blocked state then...

-- 
Replies to list only preferred.

[gentoo-user] Re: Re: Re: Re: [Gentoo-User] emerge --sync likely to kill SSD?

Reply via email to