On Mon, Oct 16, 2017 at 10:25:28 -0700, Richard Henderson wrote:
> From: Richard Henderson <[email protected]>
>
> This avoids having to allocate external memory for each temporary.
>
> Signed-off-by: Richard Henderson <[email protected]>
> ---
Unfortunately, this patch undoes the small perf gains we made so far in
this series.
We end up running more instructions, I guess due to the loops in
setting the per-temp states (whereas earlier we just had a memset).
Same aarch64 boot benchmark, 10 runs:
Before:
7125.400889 task-clock (msec) # 0.998 CPUs utilized
( +- 0.15% )
21,654 context-switches # 0.003 M/sec
( +- 0.12% )
1 cpu-migrations # 0.000 K/sec
8,034 page-faults # 0.001 M/sec
( +- 1.22% )
30,050,759,263 cycles # 4.217 GHz
( +- 0.15% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
53,764,201,351 instructions # 1.79 insns per cycle
( +- 0.09% )
9,677,042,191 branches # 1358.105 M/sec
( +- 0.09% )
170,903,903 branch-misses # 1.77% of all branches
( +- 0.16% )
7.136617151 seconds time elapsed
( +- 0.17% )
After:
7326.945822 task-clock (msec) # 0.999 CPUs utilized
( +- 0.24% )
21,997 context-switches # 0.003 M/sec
( +- 0.16% )
1 cpu-migrations # 0.000 K/sec
8,400 page-faults # 0.001 M/sec
( +- 4.63% )
30,900,509,346 cycles # 4.217 GHz
( +- 0.23% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
55,736,672,258 instructions # 1.80 insns per cycle
( +- 0.16% )
9,989,723,969 branches # 1363.423 M/sec
( +- 0.16% )
179,662,782 branch-misses # 1.80% of all branches
( +- 0.16% )
7.335805286 seconds time elapsed
( +- 0.24% )
I tried merging .state into the bitfield, but that didn't help (the dcache isn't
the issue here).
Anyway we use .state_ptr later in this series, so:
Reviewed-by: Emilio G. Cota <[email protected]>
E.