I notice that the store crosses a cacheline boundary on an ARMv5 CPU with 32-byte cache lines.
I see that the xorin8 function on line 104 of https://fossies.org/linux/tor/src/ext/keccak-tiny/keccak-tiny-unrolled.c assumes that the 'dst' pointer has 8 byte alignment, but the gdb output only shows 4 byte alignment, which matches the data structure definition for keccak_state in https://fossies.org/linux/tor/src/ext/keccak-tiny/keccak-tiny.h I would suggest adding __attribute__((aligned(8))) to the structure definition to force 8-byte alignment, which would make the code more portable and avoid undefined behavior (casting a pointer to a type of higher alignment). I don't think this is actually supposed to be undefined behavior for an ARMv5 CPU, as long as the destination for the 'strd' instruction has at least four byte alignment, but since gcc never creates this instruction sequence on valid code, a hardware erratum may have gone unnoticed for a long time. Arnd