Enabling -ftree-slp-vectorize on -O2/Os

2018-05-26 Thread Allan Sandfeld Jensen
I brought this subject up earlier, and was told to suggest it again for gcc 9, 
so I have attached the preliminary changes.

My studies have show that with generic x86-64 optimization it reduces binary 
size with around 0.5%, and when optimizing for x64 targets with SSE4 or 
better, it reduces binary size by 2-3% on average. The performance changes are 
negligible however*, and I haven't been able to detect changes in compile time 
big enough to penetrate general noise on my platform, but perhaps someone has 
a better setup for that?

* I believe that is because it currently works best on non-optimized code, it 
is better at big basic blocks doing all kinds of things than tightly written 
inner loops.

Anythhing else I should test or report?

Best regards
'Allan


diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index beba295bef5..05851229354 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -7612,6 +7612,7 @@ also turns on the following optimization flags:
 -fstore-merging @gol
 -fstrict-aliasing @gol
 -ftree-builtin-call-dce @gol
+-ftree-slp-vectorize @gol
 -ftree-switch-conversion -ftree-tail-merge @gol
 -fcode-hoisting @gol
 -ftree-pre @gol
@@ -7635,7 +7636,6 @@ by @option{-O2} and also turns on the following 
optimization flags:
 -floop-interchange @gol
 -floop-unroll-and-jam @gol
 -fsplit-paths @gol
--ftree-slp-vectorize @gol
 -fvect-cost-model @gol
 -ftree-partial-pre @gol
 -fpeel-loops @gol
@@ -8932,7 +8932,7 @@ Perform loop vectorization on trees. This flag is 
enabled by default at
 @item -ftree-slp-vectorize
 @opindex ftree-slp-vectorize
 Perform basic block vectorization on trees. This flag is enabled by default 
at
-@option{-O3} and when @option{-ftree-vectorize} is enabled.
+@option{-O2} or higher, and when @option{-ftree-vectorize} is enabled.
 
 @item -fvect-cost-model=@var{model}
 @opindex fvect-cost-model
diff --git a/gcc/opts.c b/gcc/opts.c
index 33efcc0d6e7..11027b847e8 100644
--- a/gcc/opts.c
+++ b/gcc/opts.c
@@ -523,6 +523,7 @@ static const struct default_options 
default_options_table[] =
 { OPT_LEVELS_2_PLUS, OPT_fipa_ra, NULL, 1 },
 { OPT_LEVELS_2_PLUS, OPT_flra_remat, NULL, 1 },
 { OPT_LEVELS_2_PLUS, OPT_fstore_merging, NULL, 1 },
+{ OPT_LEVELS_2_PLUS, OPT_ftree_slp_vectorize, NULL, 1 },
 
 /* -O3 optimizations.  */
 { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
@@ -539,7 +540,6 @@ static const struct default_options 
default_options_table[] =
 { OPT_LEVELS_3_PLUS, OPT_floop_unroll_and_jam, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_fgcse_after_reload, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_ftree_loop_vectorize, NULL, 1 },
-{ OPT_LEVELS_3_PLUS, OPT_ftree_slp_vectorize, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_fvect_cost_model_, NULL, VECT_COST_MODEL_DYNAMIC 
},
 { OPT_LEVELS_3_PLUS, OPT_fipa_cp_clone, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_ftree_partial_pre, NULL, 1 },





Re: Why is REG_ALLOC_ORDER not defined on Aarch64

2018-05-26 Thread Richard Sandiford
Andrew Pinski  writes:
> On Fri, May 25, 2018 at 3:35 PM, Steve Ellcey  wrote:
>> I was curious if there was any reason that REG_ALLOC_ORDER is not
>> defined for Aarch64.  Has anyone tried this to see if it could help
>> performance?  It is defined for many other platforms.
>
> https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01815.html
> https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01822.html

It looks like the immediate reason for reverting was the effect of
listing the argument registers in reverse order.  I wonder how much that
actually helps with IRA and LRA?  They track per-register costs, and
would be able to increase the cost of a pseudo that conflicts with a
hard-register call argument.

It just felt like it might have been a "best practice" idea passed down
from the old local.c and global.c days.

Thanks,
Richard


Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-05-26 Thread Richard Sandiford
Steve Ellcey  writes:
> On Wed, 2018-05-16 at 22:11 +0100, Richard Sandiford wrote:
>> 
>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED is the only current way
>> of saying that an rtl instruction preserves the low part of a
>> register but clobbers the high part.  We would need something like
>> Alan H's CLOBBER_HIGH patches to do it using explicit clobbers.
>> 
>> Another approach would be to piggy-back on the -fipa-ra
>> infrastructure
>> and record that vector PCS functions only clobber Q0-Q7.  If -fipa-ra
>> knows that a function doesn't clobber Q8-Q15 then that should
>> override
>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  (I'm not sure whether it does
>> in practice, but it should :-)  And if it doesn't that's a bug that's
>> worth fixing for its own sake.)
>> 
>> Thanks,
>> Richard
>
> Alan,
>
> I have been looking at your CLOBBER_HIGH patches to see if they
> might be helpful in implementing the ARM SIMD Vector ABI in GCC.
> I have also been looking at the -fipa-ra flag and how it works.
>
> I was wondering if you considered using the ipa-ra infrastructure
> for the SVE work that you are currently trying to support with 
> the CLOBBER_HIGH macro?
>
> My current thought for the ABI work is to mark all the floating
> point / vector registers as caller saved (the lower half of V8-V15
> are currently callee saved) and remove
> TARGET_HARD_REGNO_CALL_PART_CLOBBERED.
> This should work but would be inefficient.
>
> The next step would be to split get_call_reg_set_usage up into
> two functions so that I don't have to pass in a default set of
> registers.  One function would return call_used_reg_set by
> default (but could return a smaller set if it had actual used
> register information) and the other would return regs_invalidated
> by_call by default (but could also return a smaller set).
>
> Next I would add a 'largest mode used' array to call_cgraph_rtl_info
> structure in addition to the current function_used_regs register
> set.
>
> Then I could turn the get_call_reg_set_usage replacement functions
> into target specific functions and with the information in the
> call_cgraph_rtl_info structure and any simd attribute information on
> a function I could modify what registers are really being used/invalidated
> without being saved.
>
> If the called function only uses the bottom half of a register it would not
> be marked as used/invalidated.  If it uses the entire register and the
> function is not marked as simd, then the register would marked as
> used/invalidated.  If the function was marked as simd the register would not
> be marked because a simd function would save both the upper and lower halves
> of a callee saved register (whereas a non simd function would only save the
> lower half).
>
> Does this sound like something that could be used in place of your 
> CLOBBER_HIGH patch?

One of the advantages of CLOBBER_HIGH is that it can be attached to
arbitrary instructions, not just calls.  The motivating example was
tlsdesc_small_, which isn't treated as a call but as a normal
instruction.  (And I don't think we want to change that, since it's much
easier for rtl optimisers to deal with normal instructions compared to
calls.  In general a call is part of a longer sequence of instructions
that includes setting up arguments, etc.)

The other use case (not implemented in the posted patches) would be
to represent the effect of syscalls, which clobber the "SVE part"
of all vector registers.  In that case the clobber would need to be
attached to an inline asm insn.

On the wider point about changing the way call clobber information
is represented: I agree it would be good to generalise what we have
now.  But if possible I think we should avoid target hooks that take
a specific call, and instead make it an inherent part of the call insn
itself, much like CALL_INSN_FUNCTION_USAGE is now.  E.g. we could add
a field that points to an ABI description, with -fipa-ra effectively
creating ad-hoc ABIs.  That ABI description could start out with
whatever we think is relevant now and could grow over time.

Thanks,
Richard


Re: Enabling -ftree-slp-vectorize on -O2/Os

2018-05-26 Thread Richard Biener
On May 26, 2018 11:32:29 AM GMT+02:00, Allan Sandfeld Jensen 
 wrote:
>I brought this subject up earlier, and was told to suggest it again for
>gcc 9, 
>so I have attached the preliminary changes.
>
>My studies have show that with generic x86-64 optimization it reduces
>binary 
>size with around 0.5%, and when optimizing for x64 targets with SSE4 or
>
>better, it reduces binary size by 2-3% on average. The performance
>changes are 
>negligible however*, and I haven't been able to detect changes in
>compile time 
>big enough to penetrate general noise on my platform, but perhaps
>someone has 
>a better setup for that?
>
>* I believe that is because it currently works best on non-optimized
>code, it 
>is better at big basic blocks doing all kinds of things than tightly
>written 
>inner loops.
>
>Anythhing else I should test or report?

If you have access to SPEC CPU I'd like to see performance, size and 
compile-time effects of the patch on that. Embedded folks may want to rhn their 
favorite benchmark and report results as well. 

Richard. 

>Best regards
>'Allan
>
>
>diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
>index beba295bef5..05851229354 100644
>--- a/gcc/doc/invoke.texi
>+++ b/gcc/doc/invoke.texi
>@@ -7612,6 +7612,7 @@ also turns on the following optimization flags:
> -fstore-merging @gol
> -fstrict-aliasing @gol
> -ftree-builtin-call-dce @gol
>+-ftree-slp-vectorize @gol
> -ftree-switch-conversion -ftree-tail-merge @gol
> -fcode-hoisting @gol
> -ftree-pre @gol
>@@ -7635,7 +7636,6 @@ by @option{-O2} and also turns on the following 
>optimization flags:
> -floop-interchange @gol
> -floop-unroll-and-jam @gol
> -fsplit-paths @gol
>--ftree-slp-vectorize @gol
> -fvect-cost-model @gol
> -ftree-partial-pre @gol
> -fpeel-loops @gol
>@@ -8932,7 +8932,7 @@ Perform loop vectorization on trees. This flag is
>
>enabled by default at
> @item -ftree-slp-vectorize
> @opindex ftree-slp-vectorize
>Perform basic block vectorization on trees. This flag is enabled by
>default 
>at
>-@option{-O3} and when @option{-ftree-vectorize} is enabled.
>+@option{-O2} or higher, and when @option{-ftree-vectorize} is enabled.
> 
> @item -fvect-cost-model=@var{model}
> @opindex fvect-cost-model
>diff --git a/gcc/opts.c b/gcc/opts.c
>index 33efcc0d6e7..11027b847e8 100644
>--- a/gcc/opts.c
>+++ b/gcc/opts.c
>@@ -523,6 +523,7 @@ static const struct default_options 
>default_options_table[] =
> { OPT_LEVELS_2_PLUS, OPT_fipa_ra, NULL, 1 },
> { OPT_LEVELS_2_PLUS, OPT_flra_remat, NULL, 1 },
> { OPT_LEVELS_2_PLUS, OPT_fstore_merging, NULL, 1 },
>+{ OPT_LEVELS_2_PLUS, OPT_ftree_slp_vectorize, NULL, 1 },
> 
> /* -O3 optimizations.  */
>{ OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
>@@ -539,7 +540,6 @@ static const struct default_options 
>default_options_table[] =
> { OPT_LEVELS_3_PLUS, OPT_floop_unroll_and_jam, NULL, 1 },
> { OPT_LEVELS_3_PLUS, OPT_fgcse_after_reload, NULL, 1 },
> { OPT_LEVELS_3_PLUS, OPT_ftree_loop_vectorize, NULL, 1 },
>-{ OPT_LEVELS_3_PLUS, OPT_ftree_slp_vectorize, NULL, 1 },
>{ OPT_LEVELS_3_PLUS, OPT_fvect_cost_model_, NULL,
>VECT_COST_MODEL_DYNAMIC 
>},
> { OPT_LEVELS_3_PLUS, OPT_fipa_cp_clone, NULL, 1 },
> { OPT_LEVELS_3_PLUS, OPT_ftree_partial_pre, NULL, 1 },



Re: Enabling -ftree-slp-vectorize on -O2/Os

2018-05-26 Thread Florian Weimer
* Allan Sandfeld Jensen:

> Anythhing else I should test or report?

Interaction with -mstackrealign on i386, where it is required for
system libraries to support applications which use the legacy ABI
without stack alignment if you compile with -msse2 or -march=x86-64
-mtune=generic (and -mfpmath=sse).


RISC-V ELF multilibs

2018-05-26 Thread Sebastian Huber
Hello,

I built a riscv64-rtems5 GCC (it uses gcc/config/riscv/t-elf-multilib). The 
following multilibs are built:

riscv64-rtems5-gcc -print-multi-lib
.;
rv32i/ilp32;@march=rv32i@mabi=ilp32
rv32im/ilp32;@march=rv32im@mabi=ilp32
rv32iac/ilp32;@march=rv32iac@mabi=ilp32
rv32imac/ilp32;@march=rv32imac@mabi=ilp32
rv32imafc/ilp32f;@march=rv32imafc@mabi=ilp32f
rv64imac/lp64;@march=rv64imac@mabi=lp64
rv64imafdc/lp64d;@march=rv64imafdc@mabi=lp64d

If I print out the builtin defines and search paths for the default settings 
and the -march=rv64imafdc and compare the results I get:

riscv64-rtems5-gcc -E -P -v -dD empty.c > def.txt 2>&1
riscv64-rtems5-gcc -E -P -v -dD empty.c -march=rv64imafdc > rv64imafdc.txt 2>&1
diff -u def.txt rv64imafdc.txt 
--- def.txt 2018-05-26 14:53:26.277760090 +0200
+++ rv64imafdc.txt  2018-05-26 14:53:47.705638409 +0200
@@ -4,8 +4,8 @@
 Configured with: ../gcc-7.3.0/configure --prefix=/opt/rtems/5 
--bindir=/opt/rtems/5/bin --exec_prefix=/opt/rtems/5 
--includedir=/opt/rtems/5/include --libdir=/opt/rtems/5/lib 
--libexecdir=/opt/rtems/5/libexec --mandir=/opt/rtems/5/share/man 
--infodir=/opt/rtems/5/share/info --datadir=/opt/rtems/5/share 
--build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=riscv64-rtems5 
--disable-libstdcxx-pch --with-gnu-as --with-gnu-ld --verbose --with-newlib 
--disable-nls --without-included-gettext --disable-win32-registry 
--enable-version-specific-runtime-libs --disable-lto 
--enable-newlib-io-c99-formats --enable-newlib-iconv 
--enable-newlib-iconv-encodings=big5,cp775,cp850,cp852,cp855,cp866,euc_jp,euc_kr,euc_tw,iso_8859_1,iso_8859_10,iso_8859_11,iso_8859_13,iso_8859_14,iso_8859_15,iso_8859_2,iso_8859_3,iso_8859_4,iso_8859_5,iso_8859_6,iso_8859_7,iso_8859_8,iso_8859_9,iso_ir_111,koi8_r,koi8_ru,koi8_u,koi8_uni,ucs_2,ucs_2_internal,ucs_2be,ucs_2le,ucs_4,ucs_4_internal,ucs_4be,ucs_4le,us_ascii,utf_16,utf_16be,utf_16le,utf_8,win_1250,win_1251,win_1252,win_1253,win_1254,win_1255,win_1256,win_1257,win_1258
 --enable-threads --disable-plugin --enable-libgomp --enable-languages=c,c++,ada
 Thread model: rtems
 gcc version 7.3.0 20180125 (RTEMS 5, RSB 
a3a6c34c150a357e57769a26a460c475e188438f, Newlib 3.0.0) (GCC) 
-COLLECT_GCC_OPTIONS='-E' '-P' '-v' '-dD' '-march=rv64gc' '-mabi=lp64d'
- /opt/rtems/5/libexec/gcc/riscv64-rtems5/7.3.0/cc1 -E -quiet -v -P -imultilib 
rv64imafdc/lp64d empty.c -march=rv64gc -mabi=lp64d -dD
+COLLECT_GCC_OPTIONS='-E' '-P' '-v' '-dD' '-march=rv64imafdc' '-mabi=lp64d'
+ /opt/rtems/5/libexec/gcc/riscv64-rtems5/7.3.0/cc1 -E -quiet -v -P -imultilib 
rv64imafdc/lp64d empty.c -march=rv64imafdc -mabi=lp64d -dD
 ignoring nonexistent directory 
"/opt/rtems/5/lib/gcc/riscv64-rtems5/7.3.0/../../../../riscv64-rtems5/sys-include"
 #include "..." search starts here:
 #include <...> search starts here:
@@ -338,4 +338,4 @@
 #define __ELF__ 1
 
COMPILER_PATH=/opt/rtems/5/libexec/gcc/riscv64-rtems5/7.3.0/:/opt/rtems/5/libexec/gcc/riscv64-rtems5/7.3.0/:/opt/rtems/5/libexec/gcc/riscv64-rtems5/:/opt/rtems/5/lib/gcc/riscv64-rtems5/7.3.0/:/opt/rtems/5/lib/gcc/riscv64-rtems5/:/opt/rtems/5/lib/gcc/riscv64-rtems5/7.3.0/../../../../riscv64-rtems5/bin/
 
LIBRARY_PATH=/opt/rtems/5/lib/gcc/riscv64-rtems5/7.3.0/rv64imafdc/lp64d/:/opt/rtems/5/lib/gcc/riscv64-rtems5/7.3.0/../../../../riscv64-rtems5/lib/rv64imafdc/lp64d/:/opt/rtems/5/lib/gcc/riscv64-rtems5/7.3.0/:/opt/rtems/5/lib/gcc/riscv64-rtems5/7.3.0/../../../../riscv64-rtems5/lib/:/lib/:/usr/lib/
-COLLECT_GCC_OPTIONS='-E' '-P' '-v' '-dD' '-march=rv64gc' '-mabi=lp64d'
+COLLECT_GCC_OPTIONS='-E' '-P' '-v' '-dD' '-march=rv64imafdc' '-mabi=lp64d'

This looks pretty much the same and the documentation says that G == IMAFD.

Why is the default multilib and a variant identical?

Most variants include the C extension. Would it be possible to add -march=rv32g 
and -march=rv64g variants?

-- 
Sebastian Huber, embedded brains GmbH

Address : Dornierstr. 4, D-82178 Puchheim, Germany
Phone   : +49 89 189 47 41-16
Fax : +49 89 189 47 41-09

Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.


Re: PR80155: Code hoisting and register pressure

2018-05-26 Thread Bin.Cheng
On Fri, May 25, 2018 at 5:54 PM, Richard Biener  wrote:
> On May 25, 2018 6:57:13 PM GMT+02:00, Jeff Law  wrote:
>>On 05/25/2018 03:49 AM, Bin.Cheng wrote:
>>> On Fri, May 25, 2018 at 10:23 AM, Prathamesh Kulkarni
>>>  wrote:
 On 23 May 2018 at 18:37, Jeff Law  wrote:
> On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote:
>> On 23 May 2018 at 13:58, Richard Biener  wrote:
>>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote:
>>>
 Hi,
 I am trying to work on PR80155, which exposes a problem with
>>code
 hoisting and register pressure on a leading embedded benchmark
>>for ARM
 cortex-m7, where code-hoisting causes an extra register spill.

 I have attached two test-cases which (hopefully) are
>>representative of
 the original test-case.
 The first one (trans_dfa.c) is bigger and somewhat similar to
>>the
 original test-case and trans_dfa_2.c is hand-reduced version of
 trans_dfa.c. There's 2 spills caused with trans_dfa.c
 and one spill with trans_dfa_2.c due to lesser amount of cases.
 The test-cases in the PR are probably not relevant.

 Initially I thought the spill was happening because of "too many
 hoistings" taking place in original test-case thus increasing
>>the
 register pressure, but it seems the spill is possibly caused
>>because
 expression gets hoisted out of a block that is on loop exit.

 For example, the following hoistings take place with
>>trans_dfa_2.c:

 (1) Inserting expression in block 4 for code hoisting:
 {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)

 (2) Inserting expression in block 4 for code hoisting:
>>{plus_expr,_4,1} (0006)

 (3) Inserting expression in block 4 for code hoisting:
 {pointer_plus_expr,s_33,1} (0023)

 (4) Inserting expression in block 3 for code hoisting:
 {pointer_plus_expr,s_33,1} (0023)

 The issue seems to be hoisting of (*tab + 1) which consists of
>>first
 two hoistings in block 4
 from blocks 5 and 9, which causes the extra spill. I verified
>>that by
 disabling hoisting into block 4,
 which resulted in no extra spills.

 I wonder if that's because the expression (*tab + 1) is getting
 hoisted from blocks 5 and 9,
 which are on loop exit ? So the expression that was previously
 computed in a block on loop exit, gets hoisted outside that
>>block
 which possibly makes the allocator more defensive ? Similarly
 disabling hoisting of expressions which appeared in blocks on
>>loop
 exit in original test-case prevented the extra spill. The other
 hoistings didn't seem to matter.
>>>
>>> I think that's simply co-incidence.  The only thing that makes
>>> a block that also exits from the loop special is that an
>>> expression could be sunk out of the loop and hoisting (commoning
>>> with another path) could prevent that.  But that isn't what is
>>> happening here and it would be a pass ordering issue as
>>> the sinking pass runs only after hoisting (no idea why exactly
>>> but I guess there are cases where we want to prefer CSE over
>>> sinking).  So you could try if re-ordering PRE and sinking helps
>>> your testcase.
>> Thanks for the suggestions. Placing sink pass before PRE works
>> for both these test-cases! Sadly it still causes the spill for the
>>benchmark -:(
>> I will try to create a better approximation of the original
>>test-case.
>>>
>>> What I do see is a missed opportunity to merge the successors
>>> of BB 4.  After PRE we have
>>>
>>>  [local count: 159303558]:
>>> :
>>> pretmp_123 = *tab_37(D);
>>> _87 = pretmp_123 + 1;
>>> if (c_36 == 65)
>>>   goto ; [34.00%]
>>> else
>>>   goto ; [66.00%]
>>>
>>>  [local count: 54163210]:
>>> *tab_37(D) = _87;
>>> _96 = MEM[(char *)s_57 + 1B];
>>> if (_96 != 0)
>>>   goto ; [89.00%]
>>> else
>>>   goto ; [11.00%]
>>>
>>>  [local count: 105140348]:
>>> *tab_37(D) = _87;
>>> _56 = MEM[(char *)s_57 + 1B];
>>> if (_56 != 0)
>>>   goto ; [89.00%]
>>> else
>>>   goto ; [11.00%]
>>>
>>> here at least the stores and loads can be hoisted.  Note this
>>> may also point at the real issue of the code hoisting which is
>>> tearing apart the RMW operation?
>> Indeed, this possibility seems much more likely than block being
>>on loop exit.
>> I will try to "hardcode" the load/store hoists into block 4 for
>>this
>> specific test-case to check
>> if that prevents the spill.
> Even if it prevents the spill in this case, it's likely a good
>>thing to
> do.  The statements prior to the conditional in bb5 and bb8 should
>>be
> hoisted, leaving bb5 and bb8 with just

Re: Enabling -ftree-slp-vectorize on -O2/Os

2018-05-26 Thread Segher Boessenkool
On Sat, May 26, 2018 at 11:32:29AM +0200, Allan Sandfeld Jensen wrote:
> I brought this subject up earlier, and was told to suggest it again for gcc 
> 9, 
> so I have attached the preliminary changes.
> 
> My studies have show that with generic x86-64 optimization it reduces binary 
> size with around 0.5%, and when optimizing for x64 targets with SSE4 or 
> better, it reduces binary size by 2-3% on average. The performance changes 
> are 
> negligible however*, and I haven't been able to detect changes in compile 
> time 
> big enough to penetrate general noise on my platform, but perhaps someone has 
> a better setup for that?
> 
> * I believe that is because it currently works best on non-optimized code, it 
> is better at big basic blocks doing all kinds of things than tightly written 
> inner loops.
> 
> Anythhing else I should test or report?

What does it do on other architectures?


Segher


Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-05-26 Thread Segher Boessenkool
On Sat, May 26, 2018 at 11:09:24AM +0100, Richard Sandiford wrote:
> On the wider point about changing the way call clobber information
> is represented: I agree it would be good to generalise what we have
> now.  But if possible I think we should avoid target hooks that take
> a specific call, and instead make it an inherent part of the call insn
> itself, much like CALL_INSN_FUNCTION_USAGE is now.  E.g. we could add
> a field that points to an ABI description, with -fipa-ra effectively
> creating ad-hoc ABIs.  That ABI description could start out with
> whatever we think is relevant now and could grow over time.

Somewhat related: there still is PR68150 open for problems with
HARD_REGNO_CALL_PART_CLOBBERED in postreload-gcse (it ignores it).


Segher


Re: Enabling -ftree-slp-vectorize on -O2/Os

2018-05-26 Thread Allan Sandfeld Jensen
On Sonntag, 27. Mai 2018 00:05:32 CEST Segher Boessenkool wrote:
> On Sat, May 26, 2018 at 11:32:29AM +0200, Allan Sandfeld Jensen wrote:
> > I brought this subject up earlier, and was told to suggest it again for
> > gcc 9, so I have attached the preliminary changes.
> > 
> > My studies have show that with generic x86-64 optimization it reduces
> > binary size with around 0.5%, and when optimizing for x64 targets with
> > SSE4 or better, it reduces binary size by 2-3% on average. The
> > performance changes are negligible however*, and I haven't been able to
> > detect changes in compile time big enough to penetrate general noise on
> > my platform, but perhaps someone has a better setup for that?
> > 
> > * I believe that is because it currently works best on non-optimized code,
> > it is better at big basic blocks doing all kinds of things than tightly
> > written inner loops.
> > 
> > Anythhing else I should test or report?
> 
> What does it do on other architectures?
> 
> 
I believe NEON would do the same as SSE4, but I can do a check. For 
architectures without SIMD it essentially does nothing.

'Allan




Re: Enabling -ftree-slp-vectorize on -O2/Os

2018-05-26 Thread Segher Boessenkool
On Sun, May 27, 2018 at 01:25:25AM +0200, Allan Sandfeld Jensen wrote:
> On Sonntag, 27. Mai 2018 00:05:32 CEST Segher Boessenkool wrote:
> > On Sat, May 26, 2018 at 11:32:29AM +0200, Allan Sandfeld Jensen wrote:
> > > I brought this subject up earlier, and was told to suggest it again for
> > > gcc 9, so I have attached the preliminary changes.
> > > 
> > > My studies have show that with generic x86-64 optimization it reduces
> > > binary size with around 0.5%, and when optimizing for x64 targets with
> > > SSE4 or better, it reduces binary size by 2-3% on average. The
> > > performance changes are negligible however*, and I haven't been able to
> > > detect changes in compile time big enough to penetrate general noise on
> > > my platform, but perhaps someone has a better setup for that?
> > > 
> > > * I believe that is because it currently works best on non-optimized code,
> > > it is better at big basic blocks doing all kinds of things than tightly
> > > written inner loops.
> > > 
> > > Anythhing else I should test or report?
> > 
> > What does it do on other architectures?
> > 
> I believe NEON would do the same as SSE4, but I can do a check. For 
> architectures without SIMD it essentially does nothing.

Sorry, I wasn't clear.  What does it do to performance on other
architectures?  Is it (almost) always a win (or neutral)?  If not, it
doesn't belong in -O2, not for the generic options at least.

(We'll test it on Power soon, it's weekend now :-) ).


Segher


Re: Enabling -ftree-slp-vectorize on -O2/Os

2018-05-26 Thread Richard Biener
On May 27, 2018 1:25:25 AM GMT+02:00, Allan Sandfeld Jensen 
 wrote:
>On Sonntag, 27. Mai 2018 00:05:32 CEST Segher Boessenkool wrote:
>> On Sat, May 26, 2018 at 11:32:29AM +0200, Allan Sandfeld Jensen
>wrote:
>> > I brought this subject up earlier, and was told to suggest it again
>for
>> > gcc 9, so I have attached the preliminary changes.
>> > 
>> > My studies have show that with generic x86-64 optimization it
>reduces
>> > binary size with around 0.5%, and when optimizing for x64 targets
>with
>> > SSE4 or better, it reduces binary size by 2-3% on average. The
>> > performance changes are negligible however*, and I haven't been
>able to
>> > detect changes in compile time big enough to penetrate general
>noise on
>> > my platform, but perhaps someone has a better setup for that?
>> > 
>> > * I believe that is because it currently works best on
>non-optimized code,
>> > it is better at big basic blocks doing all kinds of things than
>tightly
>> > written inner loops.
>> > 
>> > Anythhing else I should test or report?
>> 
>> What does it do on other architectures?
>> 
>> 
>I believe NEON would do the same as SSE4, but I can do a check. For 
>architectures without SIMD it essentially does nothing.

By default it combines integer ops where possible into word_mode registers. So 
yes, almost nothing. 

Richard. 

>'Allan