On Fri, Jul 10, 2015 at 9:34 PM, Chris Cappuccio <ch...@nmedia.net> wrote: > My first impression, offlining the drive after a single chunk failure > may be too aggressive as some errors are a result of issues other than > drive failures.
Indeed, it may look as too aggressive, but is my analysis written in comment correct? I mean: if there is a write error for whatever reason to one or more chunk(s) and if we completely ignore it since at least one write succeed, then arrays is in incorrect state where some drive(s) hold(s) correct data and another drive(s) hold(s) previous data. Since reading is done in round-robin fashion, then there is a chance that you will read old data in the future. If this is correct, then I think it calls for fix. If you do not like off-lining drive(s) just after 1 failed read, then perhaps correct may be to restart whole work unit and enforce writing again? We can even have some threshold where we may stop and consider the problematic block really not writeable at the end. Is something like that better solution? Thanks, Karel