On Monday, 10 September 2018 2:23:18 PM AEST Jonathan Engwall wrote: > If it is helpful there are a few similar bugs, generally considered > unreproducible. One thread calls it bogus xcomp_bv...the kernel clobbers > itself writing zeroes when that is not the state. And spectre came up. One > suggestion is to disable IBRS; according to other sources IBRS is dangerous > to disable and should protect against Spectre. Maybe the OpenFOAM is to > blame.
Yeah, I suspect what we're seeing is different to that, it looks like something manages to generate a SIMD exception whilst the kernel is dealing with an APIC timer interrupt. A colleague has backported this patch that I found to our CentOS kernel in case it helps. https://lore.kernel.org/patchwork/patch/953364/ For now we've constrained this users workload on to a handful of nodes as they are trying to get some project work done. All the best! Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf