Mans Rullgard <mans.rullg...@linaro.org> writes:

> On 22 May 2013 05:13, Michael Hudson-Doyle <michael.hud...@canonical.com> 
> wrote:
>> Hi all,
>>
>> I've spent a little while porting an optimization from Python 3 to
>> Python 2.7 (http://bugs.python.org/issue4753).  The idea of the patch is
>> to improve performance by dispatching opcodes on computed labels rather
>> than a big switch -- and so confusing the branch predictor less.
>>
>> The problem with this is that the last bit of code for each opcode ends
>> up being the same, so common subexpression elimination wants to coalesce
>> all these bits, which neatly and completely nullifies the point of the
>> optimization.
>
> The branches added by this would be unconditional and should thus not
> add any load on the branch predictor.
>
>> Playing around just building from source directly, it
>> seems that -fno-gcse prevents gcc from doing this, and the resulting
>> interpreter shows a small performance improvement over a build that does
>> not include the patch.
>>
>> However, when I build a debian package containing the patch, I see no
>> improvement at all.  My theory, and I'd like you guys to tell me if this
>> makes sense, is that this is because the Debian package uses link time
>> optimization, and so even though I carefully compile ceval.c with
>> -fno-gcse, the common subexpression elimination happens anyway at link
>> time.  I've tried staring at disassembly to confirm or deny this but I
>> don't know ARM assembly very well and the compiled function is roughtly
>> 10k instructions long so I didn't get very far with this (I can supply
>> the disassembly if someone wants to see it!).
>>
>> Is there some way I can tell GCC to not compile perform CSE on a section
>> of code?  I guess I can make sure that the whole program, linker step
>> and all, is compiled with -fno-gcse but that seems a bit of a blunt
>> hammer.
>
> When using LTO, most of the optimisations happen, as the name implies,
> during linking.  The optimisation flags provided there, whether explicit
> or default, are used for everything.

OK.  I wasn't sure initially whether the optimizations that were
performed at link time were the same as the ones that are traditionally
performed at compile time, but reading the docs again makes it clear
(ish) that they are.

> If you need to disable CSE for part of the code, you might want to try
> your luck with __attribute__((optimize("no-gcse"))) on the relevant
> functions.
>
>> I'd also be interested if you think this class of optimization makes
>> little sense on ARM and then I'll stop and find something else to do :-)
>
> I suggest running some benchmarks under perf and counting branch prediction
> misses.  Maybe it's not as much of a problem as you think.

Well, I recompiled with -fno-gcse globally and the change now does
result in a reasonable performance increase, in the 3-7 % range.  perf
stat suggests that this is because it reduces the overall number of
branches rather than the rate of branch misses though...

Cheers,
mwh

_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain

Reply via email to