On Monday, 10 September 2018 2:23:18 PM AEST Jonathan Engwall wrote:

> If it is helpful there are a few similar bugs, generally considered
> unreproducible. One thread calls it bogus xcomp_bv...the kernel clobbers
> itself writing zeroes when that is not the state. And spectre came up. One
> suggestion is to disable IBRS; according to other sources IBRS is dangerous
> to disable and should protect against Spectre. Maybe the OpenFOAM is to
> blame.

Yeah, I suspect what we're seeing is different to that, it looks like 
something manages to generate a SIMD exception whilst the kernel is dealing 
with an APIC timer interrupt.   A colleague has backported this patch that I 
found to our CentOS kernel in case it helps.

https://lore.kernel.org/patchwork/patch/953364/

For now we've constrained this users workload on to a handful of nodes as they 
are trying to get some project work done.

All the best!
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to