https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149
Arnd Bergmann <arnd at linaro dot org> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #40546|0 |1 is obsolete| | --- Comment #2 from Arnd Bergmann <arnd at linaro dot org> --- Created attachment 40554 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40554&action=edit wp512 reference source code, standalone version After checking a bit more, I found that the reference source code implementation does behave exactly like the in-kernel version after all, and I was able to do some performance timing (using qemu-user) on it as well. Building Whirlpool.c using "mips64el-linux-gnuabi64-gcc-5 -O2 -Wframe-larger-than=100 Whirlpool.c -o Whirlpool-mips-smallstack -fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic" in this case uses 256 bytes of stack in the processBuffer and run for 87 seconds doing 10000000 iterations in qemu, while the version without "fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic" takes 230 seconds and needs 1520 bytes of stack. The extra time is apparently spent spilling registers to the stack. The same test with arm32 shows a less significant version of the same behavior, with the stack shrinking from 832 to 352 bytes, and the time improving from 301 seconds to 217 seconds. Obviously it would be helpful to do the same tests on actual hardware, as benchmarking in an emulated machine can be very misleading.