On Sun, Jun 28, 2009 at 12:17:50PM +0100, Dave Love wrote: > > and for disks run a smartctl test and see if a disk is showing > > symtopms which might make it fail in future. > > What I typically see from smartd is alerts when one or more sectors has > already gone bad, although that tends not to be something that will > clobber the running job. How should it be configured to do better > (without noise)?
That isn't noise, that's signal. You're just lucky that your running job doesn't need the data off the bad sector. You can try waiting until the job finishes before taking the node out of service; from the sounds of it, you will usually win. But if you don't have application-level end-to-end checksums of your data, how do you know if you won or not? In my big MapReduce cluster (800 data disks), about 2/3 of the time I'll see an I/O error in my application, or checksum failure, and 1/3 of the time I will see a smartd error and no application error. -- greg _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
