Enabling -ftree-slp-vectorize on -O2/Os
I brought this subject up earlier, and was told to suggest it again for gcc 9, so I have attached the preliminary changes. My studies have show that with generic x86-64 optimization it reduces binary size with around 0.5%, and when optimizing for x64 targets with SSE4 or better, it reduces binary size by 2-3% on average. The performance changes are negligible however*, and I haven't been able to detect changes in compile time big enough to penetrate general noise on my platform, but perhaps someone has a better setup for that? * I believe that is because it currently works best on non-optimized code, it is better at big basic blocks doing all kinds of things than tightly written inner loops. Anythhing else I should test or report? Best regards 'Allan diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index beba295bef5..05851229354 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -7612,6 +7612,7 @@ also turns on the following optimization flags: -fstore-merging @gol -fstrict-aliasing @gol -ftree-builtin-call-dce @gol +-ftree-slp-vectorize @gol -ftree-switch-conversion -ftree-tail-merge @gol -fcode-hoisting @gol -ftree-pre @gol @@ -7635,7 +7636,6 @@ by @option{-O2} and also turns on the following optimization flags: -floop-interchange @gol -floop-unroll-and-jam @gol -fsplit-paths @gol --ftree-slp-vectorize @gol -fvect-cost-model @gol -ftree-partial-pre @gol -fpeel-loops @gol @@ -8932,7 +8932,7 @@ Perform loop vectorization on trees. This flag is enabled by default at @item -ftree-slp-vectorize @opindex ftree-slp-vectorize Perform basic block vectorization on trees. This flag is enabled by default at -@option{-O3} and when @option{-ftree-vectorize} is enabled. +@option{-O2} or higher, and when @option{-ftree-vectorize} is enabled. @item -fvect-cost-model=@var{model} @opindex fvect-cost-model diff --git a/gcc/opts.c b/gcc/opts.c index 33efcc0d6e7..11027b847e8 100644 --- a/gcc/opts.c +++ b/gcc/opts.c @@ -523,6 +523,7 @@ static const struct default_options default_options_table[] = { OPT_LEVELS_2_PLUS, OPT_fipa_ra, NULL, 1 }, { OPT_LEVELS_2_PLUS, OPT_flra_remat, NULL, 1 }, { OPT_LEVELS_2_PLUS, OPT_fstore_merging, NULL, 1 }, +{ OPT_LEVELS_2_PLUS, OPT_ftree_slp_vectorize, NULL, 1 }, /* -O3 optimizations. */ { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 }, @@ -539,7 +540,6 @@ static const struct default_options default_options_table[] = { OPT_LEVELS_3_PLUS, OPT_floop_unroll_and_jam, NULL, 1 }, { OPT_LEVELS_3_PLUS, OPT_fgcse_after_reload, NULL, 1 }, { OPT_LEVELS_3_PLUS, OPT_ftree_loop_vectorize, NULL, 1 }, -{ OPT_LEVELS_3_PLUS, OPT_ftree_slp_vectorize, NULL, 1 }, { OPT_LEVELS_3_PLUS, OPT_fvect_cost_model_, NULL, VECT_COST_MODEL_DYNAMIC }, { OPT_LEVELS_3_PLUS, OPT_fipa_cp_clone, NULL, 1 }, { OPT_LEVELS_3_PLUS, OPT_ftree_partial_pre, NULL, 1 },
Re: Why is REG_ALLOC_ORDER not defined on Aarch64
Andrew Pinski writes: > On Fri, May 25, 2018 at 3:35 PM, Steve Ellcey wrote: >> I was curious if there was any reason that REG_ALLOC_ORDER is not >> defined for Aarch64. Has anyone tried this to see if it could help >> performance? It is defined for many other platforms. > > https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01815.html > https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01822.html It looks like the immediate reason for reverting was the effect of listing the argument registers in reverse order. I wonder how much that actually helps with IRA and LRA? They track per-register costs, and would be able to increase the cost of a pseudo that conflicts with a hard-register call argument. It just felt like it might have been a "best practice" idea passed down from the old local.c and global.c days. Thanks, Richard
Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP
Steve Ellcey writes: > On Wed, 2018-05-16 at 22:11 +0100, Richard Sandiford wrote: >> >> TARGET_HARD_REGNO_CALL_PART_CLOBBERED is the only current way >> of saying that an rtl instruction preserves the low part of a >> register but clobbers the high part. We would need something like >> Alan H's CLOBBER_HIGH patches to do it using explicit clobbers. >> >> Another approach would be to piggy-back on the -fipa-ra >> infrastructure >> and record that vector PCS functions only clobber Q0-Q7. If -fipa-ra >> knows that a function doesn't clobber Q8-Q15 then that should >> override >> TARGET_HARD_REGNO_CALL_PART_CLOBBERED. (I'm not sure whether it does >> in practice, but it should :-) And if it doesn't that's a bug that's >> worth fixing for its own sake.) >> >> Thanks, >> Richard > > Alan, > > I have been looking at your CLOBBER_HIGH patches to see if they > might be helpful in implementing the ARM SIMD Vector ABI in GCC. > I have also been looking at the -fipa-ra flag and how it works. > > I was wondering if you considered using the ipa-ra infrastructure > for the SVE work that you are currently trying to support with > the CLOBBER_HIGH macro? > > My current thought for the ABI work is to mark all the floating > point / vector registers as caller saved (the lower half of V8-V15 > are currently callee saved) and remove > TARGET_HARD_REGNO_CALL_PART_CLOBBERED. > This should work but would be inefficient. > > The next step would be to split get_call_reg_set_usage up into > two functions so that I don't have to pass in a default set of > registers. One function would return call_used_reg_set by > default (but could return a smaller set if it had actual used > register information) and the other would return regs_invalidated > by_call by default (but could also return a smaller set). > > Next I would add a 'largest mode used' array to call_cgraph_rtl_info > structure in addition to the current function_used_regs register > set. > > Then I could turn the get_call_reg_set_usage replacement functions > into target specific functions and with the information in the > call_cgraph_rtl_info structure and any simd attribute information on > a function I could modify what registers are really being used/invalidated > without being saved. > > If the called function only uses the bottom half of a register it would not > be marked as used/invalidated. If it uses the entire register and the > function is not marked as simd, then the register would marked as > used/invalidated. If the function was marked as simd the register would not > be marked because a simd function would save both the upper and lower halves > of a callee saved register (whereas a non simd function would only save the > lower half). > > Does this sound like something that could be used in place of your > CLOBBER_HIGH patch? One of the advantages of CLOBBER_HIGH is that it can be attached to arbitrary instructions, not just calls. The motivating example was tlsdesc_small_, which isn't treated as a call but as a normal instruction. (And I don't think we want to change that, since it's much easier for rtl optimisers to deal with normal instructions compared to calls. In general a call is part of a longer sequence of instructions that includes setting up arguments, etc.) The other use case (not implemented in the posted patches) would be to represent the effect of syscalls, which clobber the "SVE part" of all vector registers. In that case the clobber would need to be attached to an inline asm insn. On the wider point about changing the way call clobber information is represented: I agree it would be good to generalise what we have now. But if possible I think we should avoid target hooks that take a specific call, and instead make it an inherent part of the call insn itself, much like CALL_INSN_FUNCTION_USAGE is now. E.g. we could add a field that points to an ABI description, with -fipa-ra effectively creating ad-hoc ABIs. That ABI description could start out with whatever we think is relevant now and could grow over time. Thanks, Richard
Re: Enabling -ftree-slp-vectorize on -O2/Os
On May 26, 2018 11:32:29 AM GMT+02:00, Allan Sandfeld Jensen wrote: >I brought this subject up earlier, and was told to suggest it again for >gcc 9, >so I have attached the preliminary changes. > >My studies have show that with generic x86-64 optimization it reduces >binary >size with around 0.5%, and when optimizing for x64 targets with SSE4 or > >better, it reduces binary size by 2-3% on average. The performance >changes are >negligible however*, and I haven't been able to detect changes in >compile time >big enough to penetrate general noise on my platform, but perhaps >someone has >a better setup for that? > >* I believe that is because it currently works best on non-optimized >code, it >is better at big basic blocks doing all kinds of things than tightly >written >inner loops. > >Anythhing else I should test or report? If you have access to SPEC CPU I'd like to see performance, size and compile-time effects of the patch on that. Embedded folks may want to rhn their favorite benchmark and report results as well. Richard. >Best regards >'Allan > > >diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi >index beba295bef5..05851229354 100644 >--- a/gcc/doc/invoke.texi >+++ b/gcc/doc/invoke.texi >@@ -7612,6 +7612,7 @@ also turns on the following optimization flags: > -fstore-merging @gol > -fstrict-aliasing @gol > -ftree-builtin-call-dce @gol >+-ftree-slp-vectorize @gol > -ftree-switch-conversion -ftree-tail-merge @gol > -fcode-hoisting @gol > -ftree-pre @gol >@@ -7635,7 +7636,6 @@ by @option{-O2} and also turns on the following >optimization flags: > -floop-interchange @gol > -floop-unroll-and-jam @gol > -fsplit-paths @gol >--ftree-slp-vectorize @gol > -fvect-cost-model @gol > -ftree-partial-pre @gol > -fpeel-loops @gol >@@ -8932,7 +8932,7 @@ Perform loop vectorization on trees. This flag is > >enabled by default at > @item -ftree-slp-vectorize > @opindex ftree-slp-vectorize >Perform basic block vectorization on trees. This flag is enabled by >default >at >-@option{-O3} and when @option{-ftree-vectorize} is enabled. >+@option{-O2} or higher, and when @option{-ftree-vectorize} is enabled. > > @item -fvect-cost-model=@var{model} > @opindex fvect-cost-model >diff --git a/gcc/opts.c b/gcc/opts.c >index 33efcc0d6e7..11027b847e8 100644 >--- a/gcc/opts.c >+++ b/gcc/opts.c >@@ -523,6 +523,7 @@ static const struct default_options >default_options_table[] = > { OPT_LEVELS_2_PLUS, OPT_fipa_ra, NULL, 1 }, > { OPT_LEVELS_2_PLUS, OPT_flra_remat, NULL, 1 }, > { OPT_LEVELS_2_PLUS, OPT_fstore_merging, NULL, 1 }, >+{ OPT_LEVELS_2_PLUS, OPT_ftree_slp_vectorize, NULL, 1 }, > > /* -O3 optimizations. */ >{ OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 }, >@@ -539,7 +540,6 @@ static const struct default_options >default_options_table[] = > { OPT_LEVELS_3_PLUS, OPT_floop_unroll_and_jam, NULL, 1 }, > { OPT_LEVELS_3_PLUS, OPT_fgcse_after_reload, NULL, 1 }, > { OPT_LEVELS_3_PLUS, OPT_ftree_loop_vectorize, NULL, 1 }, >-{ OPT_LEVELS_3_PLUS, OPT_ftree_slp_vectorize, NULL, 1 }, >{ OPT_LEVELS_3_PLUS, OPT_fvect_cost_model_, NULL, >VECT_COST_MODEL_DYNAMIC >}, > { OPT_LEVELS_3_PLUS, OPT_fipa_cp_clone, NULL, 1 }, > { OPT_LEVELS_3_PLUS, OPT_ftree_partial_pre, NULL, 1 },
Re: Enabling -ftree-slp-vectorize on -O2/Os
* Allan Sandfeld Jensen: > Anythhing else I should test or report? Interaction with -mstackrealign on i386, where it is required for system libraries to support applications which use the legacy ABI without stack alignment if you compile with -msse2 or -march=x86-64 -mtune=generic (and -mfpmath=sse).
RISC-V ELF multilibs
Hello, I built a riscv64-rtems5 GCC (it uses gcc/config/riscv/t-elf-multilib). The following multilibs are built: riscv64-rtems5-gcc -print-multi-lib .; rv32i/ilp32;@march=rv32i@mabi=ilp32 rv32im/ilp32;@march=rv32im@mabi=ilp32 rv32iac/ilp32;@march=rv32iac@mabi=ilp32 rv32imac/ilp32;@march=rv32imac@mabi=ilp32 rv32imafc/ilp32f;@march=rv32imafc@mabi=ilp32f rv64imac/lp64;@march=rv64imac@mabi=lp64 rv64imafdc/lp64d;@march=rv64imafdc@mabi=lp64d If I print out the builtin defines and search paths for the default settings and the -march=rv64imafdc and compare the results I get: riscv64-rtems5-gcc -E -P -v -dD empty.c > def.txt 2>&1 riscv64-rtems5-gcc -E -P -v -dD empty.c -march=rv64imafdc > rv64imafdc.txt 2>&1 diff -u def.txt rv64imafdc.txt --- def.txt 2018-05-26 14:53:26.277760090 +0200 +++ rv64imafdc.txt 2018-05-26 14:53:47.705638409 +0200 @@ -4,8 +4,8 @@ Configured with: ../gcc-7.3.0/configure --prefix=/opt/rtems/5 --bindir=/opt/rtems/5/bin --exec_prefix=/opt/rtems/5 --includedir=/opt/rtems/5/include --libdir=/opt/rtems/5/lib --libexecdir=/opt/rtems/5/libexec --mandir=/opt/rtems/5/share/man --infodir=/opt/rtems/5/share/info --datadir=/opt/rtems/5/share --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=riscv64-rtems5 --disable-libstdcxx-pch --with-gnu-as --with-gnu-ld --verbose --with-newlib --disable-nls --without-included-gettext --disable-win32-registry --enable-version-specific-runtime-libs --disable-lto --enable-newlib-io-c99-formats --enable-newlib-iconv --enable-newlib-iconv-encodings=big5,cp775,cp850,cp852,cp855,cp866,euc_jp,euc_kr,euc_tw,iso_8859_1,iso_8859_10,iso_8859_11,iso_8859_13,iso_8859_14,iso_8859_15,iso_8859_2,iso_8859_3,iso_8859_4,iso_8859_5,iso_8859_6,iso_8859_7,iso_8859_8,iso_8859_9,iso_ir_111,koi8_r,koi8_ru,koi8_u,koi8_uni,ucs_2,ucs_2_internal,ucs_2be,ucs_2le,ucs_4,ucs_4_internal,ucs_4be,ucs_4le,us_ascii,utf_16,utf_16be,utf_16le,utf_8,win_1250,win_1251,win_1252,win_1253,win_1254,win_1255,win_1256,win_1257,win_1258 --enable-threads --disable-plugin --enable-libgomp --enable-languages=c,c++,ada Thread model: rtems gcc version 7.3.0 20180125 (RTEMS 5, RSB a3a6c34c150a357e57769a26a460c475e188438f, Newlib 3.0.0) (GCC) -COLLECT_GCC_OPTIONS='-E' '-P' '-v' '-dD' '-march=rv64gc' '-mabi=lp64d' - /opt/rtems/5/libexec/gcc/riscv64-rtems5/7.3.0/cc1 -E -quiet -v -P -imultilib rv64imafdc/lp64d empty.c -march=rv64gc -mabi=lp64d -dD +COLLECT_GCC_OPTIONS='-E' '-P' '-v' '-dD' '-march=rv64imafdc' '-mabi=lp64d' + /opt/rtems/5/libexec/gcc/riscv64-rtems5/7.3.0/cc1 -E -quiet -v -P -imultilib rv64imafdc/lp64d empty.c -march=rv64imafdc -mabi=lp64d -dD ignoring nonexistent directory "/opt/rtems/5/lib/gcc/riscv64-rtems5/7.3.0/../../../../riscv64-rtems5/sys-include" #include "..." search starts here: #include <...> search starts here: @@ -338,4 +338,4 @@ #define __ELF__ 1 COMPILER_PATH=/opt/rtems/5/libexec/gcc/riscv64-rtems5/7.3.0/:/opt/rtems/5/libexec/gcc/riscv64-rtems5/7.3.0/:/opt/rtems/5/libexec/gcc/riscv64-rtems5/:/opt/rtems/5/lib/gcc/riscv64-rtems5/7.3.0/:/opt/rtems/5/lib/gcc/riscv64-rtems5/:/opt/rtems/5/lib/gcc/riscv64-rtems5/7.3.0/../../../../riscv64-rtems5/bin/ LIBRARY_PATH=/opt/rtems/5/lib/gcc/riscv64-rtems5/7.3.0/rv64imafdc/lp64d/:/opt/rtems/5/lib/gcc/riscv64-rtems5/7.3.0/../../../../riscv64-rtems5/lib/rv64imafdc/lp64d/:/opt/rtems/5/lib/gcc/riscv64-rtems5/7.3.0/:/opt/rtems/5/lib/gcc/riscv64-rtems5/7.3.0/../../../../riscv64-rtems5/lib/:/lib/:/usr/lib/ -COLLECT_GCC_OPTIONS='-E' '-P' '-v' '-dD' '-march=rv64gc' '-mabi=lp64d' +COLLECT_GCC_OPTIONS='-E' '-P' '-v' '-dD' '-march=rv64imafdc' '-mabi=lp64d' This looks pretty much the same and the documentation says that G == IMAFD. Why is the default multilib and a variant identical? Most variants include the C extension. Would it be possible to add -march=rv32g and -march=rv64g variants? -- Sebastian Huber, embedded brains GmbH Address : Dornierstr. 4, D-82178 Puchheim, Germany Phone : +49 89 189 47 41-16 Fax : +49 89 189 47 41-09 Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.
Re: PR80155: Code hoisting and register pressure
On Fri, May 25, 2018 at 5:54 PM, Richard Biener wrote: > On May 25, 2018 6:57:13 PM GMT+02:00, Jeff Law wrote: >>On 05/25/2018 03:49 AM, Bin.Cheng wrote: >>> On Fri, May 25, 2018 at 10:23 AM, Prathamesh Kulkarni >>> wrote: On 23 May 2018 at 18:37, Jeff Law wrote: > On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote: >> On 23 May 2018 at 13:58, Richard Biener wrote: >>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote: >>> Hi, I am trying to work on PR80155, which exposes a problem with >>code hoisting and register pressure on a leading embedded benchmark >>for ARM cortex-m7, where code-hoisting causes an extra register spill. I have attached two test-cases which (hopefully) are >>representative of the original test-case. The first one (trans_dfa.c) is bigger and somewhat similar to >>the original test-case and trans_dfa_2.c is hand-reduced version of trans_dfa.c. There's 2 spills caused with trans_dfa.c and one spill with trans_dfa_2.c due to lesser amount of cases. The test-cases in the PR are probably not relevant. Initially I thought the spill was happening because of "too many hoistings" taking place in original test-case thus increasing >>the register pressure, but it seems the spill is possibly caused >>because expression gets hoisted out of a block that is on loop exit. For example, the following hoistings take place with >>trans_dfa_2.c: (1) Inserting expression in block 4 for code hoisting: {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005) (2) Inserting expression in block 4 for code hoisting: >>{plus_expr,_4,1} (0006) (3) Inserting expression in block 4 for code hoisting: {pointer_plus_expr,s_33,1} (0023) (4) Inserting expression in block 3 for code hoisting: {pointer_plus_expr,s_33,1} (0023) The issue seems to be hoisting of (*tab + 1) which consists of >>first two hoistings in block 4 from blocks 5 and 9, which causes the extra spill. I verified >>that by disabling hoisting into block 4, which resulted in no extra spills. I wonder if that's because the expression (*tab + 1) is getting hoisted from blocks 5 and 9, which are on loop exit ? So the expression that was previously computed in a block on loop exit, gets hoisted outside that >>block which possibly makes the allocator more defensive ? Similarly disabling hoisting of expressions which appeared in blocks on >>loop exit in original test-case prevented the extra spill. The other hoistings didn't seem to matter. >>> >>> I think that's simply co-incidence. The only thing that makes >>> a block that also exits from the loop special is that an >>> expression could be sunk out of the loop and hoisting (commoning >>> with another path) could prevent that. But that isn't what is >>> happening here and it would be a pass ordering issue as >>> the sinking pass runs only after hoisting (no idea why exactly >>> but I guess there are cases where we want to prefer CSE over >>> sinking). So you could try if re-ordering PRE and sinking helps >>> your testcase. >> Thanks for the suggestions. Placing sink pass before PRE works >> for both these test-cases! Sadly it still causes the spill for the >>benchmark -:( >> I will try to create a better approximation of the original >>test-case. >>> >>> What I do see is a missed opportunity to merge the successors >>> of BB 4. After PRE we have >>> >>> [local count: 159303558]: >>> : >>> pretmp_123 = *tab_37(D); >>> _87 = pretmp_123 + 1; >>> if (c_36 == 65) >>> goto ; [34.00%] >>> else >>> goto ; [66.00%] >>> >>> [local count: 54163210]: >>> *tab_37(D) = _87; >>> _96 = MEM[(char *)s_57 + 1B]; >>> if (_96 != 0) >>> goto ; [89.00%] >>> else >>> goto ; [11.00%] >>> >>> [local count: 105140348]: >>> *tab_37(D) = _87; >>> _56 = MEM[(char *)s_57 + 1B]; >>> if (_56 != 0) >>> goto ; [89.00%] >>> else >>> goto ; [11.00%] >>> >>> here at least the stores and loads can be hoisted. Note this >>> may also point at the real issue of the code hoisting which is >>> tearing apart the RMW operation? >> Indeed, this possibility seems much more likely than block being >>on loop exit. >> I will try to "hardcode" the load/store hoists into block 4 for >>this >> specific test-case to check >> if that prevents the spill. > Even if it prevents the spill in this case, it's likely a good >>thing to > do. The statements prior to the conditional in bb5 and bb8 should >>be > hoisted, leaving bb5 and bb8 with just
Re: Enabling -ftree-slp-vectorize on -O2/Os
On Sat, May 26, 2018 at 11:32:29AM +0200, Allan Sandfeld Jensen wrote: > I brought this subject up earlier, and was told to suggest it again for gcc > 9, > so I have attached the preliminary changes. > > My studies have show that with generic x86-64 optimization it reduces binary > size with around 0.5%, and when optimizing for x64 targets with SSE4 or > better, it reduces binary size by 2-3% on average. The performance changes > are > negligible however*, and I haven't been able to detect changes in compile > time > big enough to penetrate general noise on my platform, but perhaps someone has > a better setup for that? > > * I believe that is because it currently works best on non-optimized code, it > is better at big basic blocks doing all kinds of things than tightly written > inner loops. > > Anythhing else I should test or report? What does it do on other architectures? Segher
Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP
On Sat, May 26, 2018 at 11:09:24AM +0100, Richard Sandiford wrote: > On the wider point about changing the way call clobber information > is represented: I agree it would be good to generalise what we have > now. But if possible I think we should avoid target hooks that take > a specific call, and instead make it an inherent part of the call insn > itself, much like CALL_INSN_FUNCTION_USAGE is now. E.g. we could add > a field that points to an ABI description, with -fipa-ra effectively > creating ad-hoc ABIs. That ABI description could start out with > whatever we think is relevant now and could grow over time. Somewhat related: there still is PR68150 open for problems with HARD_REGNO_CALL_PART_CLOBBERED in postreload-gcse (it ignores it). Segher
Re: Enabling -ftree-slp-vectorize on -O2/Os
On Sonntag, 27. Mai 2018 00:05:32 CEST Segher Boessenkool wrote: > On Sat, May 26, 2018 at 11:32:29AM +0200, Allan Sandfeld Jensen wrote: > > I brought this subject up earlier, and was told to suggest it again for > > gcc 9, so I have attached the preliminary changes. > > > > My studies have show that with generic x86-64 optimization it reduces > > binary size with around 0.5%, and when optimizing for x64 targets with > > SSE4 or better, it reduces binary size by 2-3% on average. The > > performance changes are negligible however*, and I haven't been able to > > detect changes in compile time big enough to penetrate general noise on > > my platform, but perhaps someone has a better setup for that? > > > > * I believe that is because it currently works best on non-optimized code, > > it is better at big basic blocks doing all kinds of things than tightly > > written inner loops. > > > > Anythhing else I should test or report? > > What does it do on other architectures? > > I believe NEON would do the same as SSE4, but I can do a check. For architectures without SIMD it essentially does nothing. 'Allan
Re: Enabling -ftree-slp-vectorize on -O2/Os
On Sun, May 27, 2018 at 01:25:25AM +0200, Allan Sandfeld Jensen wrote: > On Sonntag, 27. Mai 2018 00:05:32 CEST Segher Boessenkool wrote: > > On Sat, May 26, 2018 at 11:32:29AM +0200, Allan Sandfeld Jensen wrote: > > > I brought this subject up earlier, and was told to suggest it again for > > > gcc 9, so I have attached the preliminary changes. > > > > > > My studies have show that with generic x86-64 optimization it reduces > > > binary size with around 0.5%, and when optimizing for x64 targets with > > > SSE4 or better, it reduces binary size by 2-3% on average. The > > > performance changes are negligible however*, and I haven't been able to > > > detect changes in compile time big enough to penetrate general noise on > > > my platform, but perhaps someone has a better setup for that? > > > > > > * I believe that is because it currently works best on non-optimized code, > > > it is better at big basic blocks doing all kinds of things than tightly > > > written inner loops. > > > > > > Anythhing else I should test or report? > > > > What does it do on other architectures? > > > I believe NEON would do the same as SSE4, but I can do a check. For > architectures without SIMD it essentially does nothing. Sorry, I wasn't clear. What does it do to performance on other architectures? Is it (almost) always a win (or neutral)? If not, it doesn't belong in -O2, not for the generic options at least. (We'll test it on Power soon, it's weekend now :-) ). Segher
Re: Enabling -ftree-slp-vectorize on -O2/Os
On May 27, 2018 1:25:25 AM GMT+02:00, Allan Sandfeld Jensen wrote: >On Sonntag, 27. Mai 2018 00:05:32 CEST Segher Boessenkool wrote: >> On Sat, May 26, 2018 at 11:32:29AM +0200, Allan Sandfeld Jensen >wrote: >> > I brought this subject up earlier, and was told to suggest it again >for >> > gcc 9, so I have attached the preliminary changes. >> > >> > My studies have show that with generic x86-64 optimization it >reduces >> > binary size with around 0.5%, and when optimizing for x64 targets >with >> > SSE4 or better, it reduces binary size by 2-3% on average. The >> > performance changes are negligible however*, and I haven't been >able to >> > detect changes in compile time big enough to penetrate general >noise on >> > my platform, but perhaps someone has a better setup for that? >> > >> > * I believe that is because it currently works best on >non-optimized code, >> > it is better at big basic blocks doing all kinds of things than >tightly >> > written inner loops. >> > >> > Anythhing else I should test or report? >> >> What does it do on other architectures? >> >> >I believe NEON would do the same as SSE4, but I can do a check. For >architectures without SIMD it essentially does nothing. By default it combines integer ops where possible into word_mode registers. So yes, almost nothing. Richard. >'Allan