Frame pointer optimization issues

2014-08-20 Thread Wilco Dijkstra
Hi,

Various targets implement -momit-leaf-frame-pointer to avoid using a frame 
pointer in leaf
functions. Currently the GCC mid-end does not provide a way of doing this, so 
targets have resorted
to hacks. Typically this involves forcing flag_omit_frame_pointer to be true in 
the
_option_override callback. The issue is that this doesn't work as it 
modifies the actual
option variable. As a result the callback is not idempotent, so option 
save/restore when using
function attributes fail as the callback is called multiple times on the 
modified options. Note this
bug exists on all targets which override options in _option_override (and 
despite claims to
the contrary in BZ 60580 this bug exists on all targets that implement 
-fomit-leaf-frame-pointer).

One could hack this a bit further and set flag_omit_frame_pointer = 2 to 
differentiate between a
user setting and the override hack, but that's just making things even worse. 
So I see 3 possible
solutions:

1. Add a copy of flag_omit_frame_pointer, and only modify that in the override. 
This is the generic
correct solution that allows any kind of modifications on the copies. This 
could be done by making
all flags separate variables and automating the copy in the options parsing 
code. Any code that
writes the x_flag_ variables should eventually be fixed to stop doing this to 
avoid these bugs (i386
does this 22 times and c6x 2x).

2. Change the mid-end to call _frame_pointer_required even when 
!flag_omit_frame_pointer. This
is a generic solution which allows targets to decide when exactly to optimize 
frame pointers.
However it does mean all implementations of _frame_pointer_required must 
be updated (the
trivial safe fix is to add "if (!flag_omit_frame_pointer) return true;" at the 
start).

3. Add a new target callback to avoid having to update all targets. This 
replaces the existing
_frame_pointer_required if implemented and avoids having to update all 
targets in one go.


A second issue with frame pointers is that update_eliminables() in reload1.c 
might set
frame_pointer_needed to false without any checks. This can't be used to 
implement
-momit-leaf-frame-pointer as it doesn't always happen (ie. when 
_can_eliminate always returns
true). However assuming it does trigger in some circumstances, the bug is that 
it does not check
that the frame pointer really isn't required. Even if the frame pointer is 
completely unused and
thus eliminable from the function, the frame pointer setup might still be 
required by external
agents for debugging, unwinding and/or profiling. I believe a more elaborate 
check is needed, at a
minimum a call to _frame_pointer_required.

What do people think? My preference is for option 1 as it fixes all current and 
future issues with
option overrides, plus option 3 to make the frame pointer callback more generic.

Wilco





RE: Frame pointer optimization issues

2014-08-21 Thread Wilco Dijkstra
> Richard Henderson wrote:
> On 08/20/2014 08:22 AM, Wilco Dijkstra wrote:
> > 2. Change the mid-end to call _frame_pointer_required even when
> > !flag_omit_frame_pointer.
> 
> Um, it does that already.  At least as far as I can see from
> ira_setup_eliminable_regset and update_eliminables.

No, in ira_setup_eliminable_regset the frame pointer is always forced if 
!flag_omit_frame_pointer without allowing frame_pointer_required to override it:

  frame_pointer_needed
= (! flag_omit_frame_pointer
   ...
   || targetm.frame_pointer_required ());

This would allow targets to choose whether to do leaf tail pointer optimization:

  frame_pointer_needed
= ((! flag_omit_frame_pointer && targetm.frame_pointer_required ())

> It turns out to be much easier to re-enable a frame pointer for a given
> function than to disable a frame pointer.  Thus I believe that you should
> approach -momit_leaf_frame_pointer as setting flag_omit_frame_pointer, and 
> then
> re-enabling it in frame_pointer_required.  This requires more than one line in
> common/config/arch/arch.c, but it shouldn't be much more than ten.

As I explained it is not correct to force flag_omit_frame_pointer to be true. 
This is what is done today and it fails in various cases. So unless the way 
options
are handled is changed, this possibility is out.

> > A second issue with frame pointers is that update_eliminables() in 
> > reload1.c might set
> > frame_pointer_needed to false without any checks.
> 
> How?  I don't see that path, since the very first thing update_eliminables 
> does
> is call frame_pointer_required -- even before it calls can_eliminate.

Update_eliminables() does indeed call frame_pointer_required at the start, 
however this 
only blocks elimination *from* HARD_FRAME_POINTER_REGNUM, while the code at the 
end clears 
frame_pointer_needed if FRAME_POINTER_REGNUM can be eliminated into any 
register other 
than HARD_FRAME_POINTER_REGNUM. The middle bit of the function is not relevant 
as 
HARD_FRAME_POINTER_REGNUM should only be eliminable into SP (but even if say it 
could be 
eliminable into another register X, it will only block eliminations of X to SP).
 
So frame_pointer_needed can be cleared even when frame_pointer_required is 
true...

In principle if this function worked reliably then we could implement leaf FPO 
using this 
mechanism. Unfortunately it doesn't, update_eliminables is not called in 
trivial leaf 
functions even when can_eliminate always returns true, so the frame pointer is 
never removed. 
Additionally I'd be worried about compilation performance as it would introduce 
extra
register allocation passes for ~50% of functions.

Wilco





Register allocation: caller-save vs spilling

2014-08-27 Thread Wilco Dijkstra
Hi,

I'm investigating various register allocation inefficiencies. The first thing 
that stands out is
that GCC both supports caller-saves as well as spilling. Spilling seems to 
spill all definitions and
all uses of a liverange. This means you often end up with multiple reloads 
close together, while it
would be more efficient to do a single load and then reuse the loaded value 
several times.
Caller-save does better in that case, but it is inefficient in that it 
repeatedly stores registers
across every call even if unchanged. If both were fixed to minimise the number 
of loads/stores I
can't see how one could beat the other, so you'd no longer need both.

Anyway due to the current implementation there are clearly cases where 
caller-save is best and cases
where spilling is best. However I do not see it making the correct decision 
despite trying to
account for the costs - some code is significantly faster with 
-fno-caller-saves, other code wins
with -fcaller-saves. As an example, I see code like this on AArch64:

ldr s4, .LC20
fmuls0, s0, s4
str s4, [x29, 104]
bl  f
ldr s4, [x29, 104]
fmuls0, s0, s4

With -fno-caller-saves it spills and rematerializes the constant as you'd 
expect:

ldr s1, .LC20
fmuls0, s0, s1
bl  f
ldr s5, .LC20
fmuls0, s0, s5

So given this, is the cost calculation correct and does it include 
rematerialization? The spill code
understands how to rematerialize so it should take this into account in the 
costs. I did find some
code in ira-costs.c in scan_one_insn() that attempts something that looks like 
an adjustment for
rematerialization but it doesn't appear to handle all cases (simple immediates, 
2-instruction
immediates, address-constants, non-aliased loads such as literal pool and const 
data loads).

Also the hook CALLER_SAVE_PROFITABLE appears to have disappeared - overall 
performance improves
significantly if I add this (basically the default heuristic used on 
instruction frequencies):

--- a/gcc/ira-costs.c
+++ b/gcc/ira-costs.c
@@ -2230,6 +2230,8 @@ ira_tune_allocno_costs (void)
   * ALLOCNO_FREQ (a)
   * IRA_HARD_REGNO_ADD_COST_MULTIPLIER (regno) / 2);
 #endif
+  if (ALLOCNO_FREQ (a) < 4 * ALLOCNO_CALL_FREQ (a))
+cost = INT_MAX;
}
  if (INT_MAX - cost < reg_costs[j])
reg_costs[j] = INT_MAX;

If such a simple heuristic can beat the costs, they can't be quite right. 

Is there anyone who understands the cost calculations?

Wilco




RE: Register allocation: caller-save vs spilling

2014-09-04 Thread Wilco Dijkstra
Hi Vlad,

I added you directly in case you hadn't spotted my original post.

A simple example for AArch64 trunk is as follows:

// Compile with: -O2 -fomit-frame-pointer -ffixed-d8 -ffixed-d9 -ffixed-d10 
-ffixed-d11 -ffixed-d12
-ffixed-d13 -ffixed-d14 -ffixed-d15 -f(no-)caller-saves
void g(void);

float f(float x)
{
  x += 3.0;
  g();
  x *= 3.0;
  return x;
}

It seems that reload only ever considers rematerialization of spilled 
liveranges, not caller-saved
ones. That means the caller-save code should either reject constants outright 
or the memory spill
cost for these should always be lower than that of a caller-save (given 
memory_move_cost=4 and
register_move_cost=2 as commonly used by targets, anything that can be 
rematerialized should have
less than half the cost of being spilled or caller-saved).

Wilco

> -Original Message-
> From: Wilco Dijkstra [mailto:wdijk...@arm.com]
> Sent: 27 August 2014 17:25
> To: 'gcc@gcc.gnu.org'
> Subject: Register allocation: caller-save vs spilling
> 
> Hi,
> 
> I'm investigating various register allocation inefficiencies. The first thing 
> that stands out
> is that GCC both supports caller-saves as well as spilling. Spilling seems to 
> spill all
> definitions and all uses of a liverange. This means you often end up with 
> multiple reloads
> close together, while it would be more efficient to do a single load and then 
> reuse the loaded
> value several times. Caller-save does better in that case, but it is 
> inefficient in that it
> repeatedly stores registers across every call even if unchanged. If both were 
> fixed to
> minimise the number of loads/stores I can't see how one could beat the other, 
> so you'd no
> longer need both.
> 
> Anyway due to the current implementation there are clearly cases where 
> caller-save is best and
> cases where spilling is best. However I do not see it making the correct 
> decision despite
> trying to account for the costs - some code is significantly faster with 
> -fno-caller-saves,
> other code wins with -fcaller-saves. As an example, I see code like this on 
> AArch64:
> 
> ldr s4, .LC20
> fmuls0, s0, s4
> str s4, [x29, 104]
> bl  f
> ldr s4, [x29, 104]
> fmuls0, s0, s4
> 
> With -fno-caller-saves it spills and rematerializes the constant as you'd 
> expect:
> 
> ldr s1, .LC20
> fmuls0, s0, s1
> bl  f
> ldr s5, .LC20
> fmuls0, s0, s5
> 
> So given this, is the cost calculation correct and does it include 
> rematerialization? The
> spill code understands how to rematerialize so it should take this into 
> account in the costs.
> I did find some code in ira-costs.c in scan_one_insn() that attempts 
> something that looks like
> an adjustment for rematerialization but it doesn't appear to handle all cases 
> (simple
> immediates, 2-instruction immediates, address-constants, non-aliased loads 
> such as literal
> pool and const data loads).
> 
> Also the hook CALLER_SAVE_PROFITABLE appears to have disappeared - overall 
> performance
> improves significantly if I add this (basically the default heuristic used on 
> instruction
> frequencies):
> 
> --- a/gcc/ira-costs.c
> +++ b/gcc/ira-costs.c
> @@ -2230,6 +2230,8 @@ ira_tune_allocno_costs (void)
>* ALLOCNO_FREQ (a)
>* IRA_HARD_REGNO_ADD_COST_MULTIPLIER (regno) / 2);
>  #endif
> +  if (ALLOCNO_FREQ (a) < 4 * ALLOCNO_CALL_FREQ (a))
> +cost = INT_MAX;
> }
>   if (INT_MAX - cost < reg_costs[j])
> reg_costs[j] = INT_MAX;
> 
> If such a simple heuristic can beat the costs, they can't be quite right.

Note if (ALLOCNO_FREQ (a) < 2 * ALLOCNO_CALL_FREQ (a)) turns out to be best 
overall.

> Is there anyone who understands the cost calculations?
> 
> Wilco




IRA preferencing issues

2015-04-17 Thread Wilco Dijkstra
Hi,

While investigating why the IRA preferencing algorithm often chooses incorrect 
preferences from the
costs, I noticed this thread: https://gcc.gnu.org/ml/gcc/2011-05/msg00186.html

I am seeing the exact same issue on AArch64 - during the final preference 
selection ira-costs takes
the union of any register classes that happen to have equal cost. As a result 
many registers get
ALL_REGS as the preferred register eventhough its cost is much higher than 
either GENERAL_REGS or
FP_REGS. So we end up with lots of scalar SIMD instructions and expensive 
int<->FP moves in integer
code when register pressure is high. When the preference is computed correctly 
as in the proposed
patch (choosing the first class with lowest cost, ie. GENERAL_REGS) the 
resulting code is much more
efficient, and there are no spurious SIMD instructions.

Choosing a preferred class when it doesn't have the lowest cost is clearly 
incorrect. So is there a
good reason why the proposed patch should not be applied? I actually wonder why 
we'd ever need to do
a union - if there are 2 classes with equal cost, you'd use the 2nd as the 
alternative class.

The other question I had is whether there is a good way to get improve the 
preference in cases like
this and avoid classes with equal cost altogether. The costs are clearly not 
equal: scalar SIMD
instructions have higher latency and require extra int<->FP moves. It is 
possible to mark variants
in the MD patterns using '?' to discourage them but that seems like a hack, 
just like '*'. Is there
a general way to say that GENERAL_REGS is preferred over FP_REGS for SI/DI mode?

Wilco




RE: IRA preferencing issues

2015-04-17 Thread Wilco Dijkstra
> Matthew Fortune wrote:
> Wilco Dijkstra  writes:
> > While investigating why the IRA preferencing algorithm often chooses
> > incorrect preferences from the costs, I noticed this thread:
> > https://gcc.gnu.org/ml/gcc/2011-05/msg00186.html
> >
> > I am seeing the exact same issue on AArch64 - during the final
> > preference selection ira-costs takes the union of any register classes
> > that happen to have equal cost. As a result many registers get ALL_REGS
> > as the preferred register eventhough its cost is much higher than either
> > GENERAL_REGS or FP_REGS. So we end up with lots of scalar SIMD
> > instructions and expensive int<->FP moves in integer code when register
> > pressure is high. When the preference is computed correctly as in the
> > proposed patch (choosing the first class with lowest cost, ie.
> > GENERAL_REGS) the resulting code is much more efficient, and there are
> > no spurious SIMD instructions.
> >
> > Choosing a preferred class when it doesn't have the lowest cost is
> > clearly incorrect. So is there a good reason why the proposed patch
> > should not be applied? I actually wonder why we'd ever need to do a
> > union - if there are 2 classes with equal cost, you'd use the 2nd as the
> > alternative class.
> >
> > The other question I had is whether there is a good way to get improve
> > the preference in cases like this and avoid classes with equal cost
> > altogether. The costs are clearly not equal: scalar SIMD instructions
> > have higher latency and require extra int<->FP moves. It is possible to
> > mark variants in the MD patterns using '?' to discourage them but that
> > seems like a hack, just like '*'. Is there a general way to say that
> > GENERAL_REGS is preferred over FP_REGS for SI/DI mode?
> 
> MIPS has the same problem here and we have been looking at ways to address
> it purely via costings rather than changing IRA. What we have done so
> far is to make the cost of a move from GENERAL_REGS to FP_REGS more
> expensive than memory if the move has an integer mode. The goal for MIPS
> is to never allocate an FP register to an integer mode unless it was
> absolutely necessary owing to an integer to fp conversion where the
> integer has to be put in an FP register. Ideally I'd like a guarantee
> that FP registers will never be used unless a floating point type is
> present in the source but I haven't found a way to do that given the
> FP-int conversion issue requiring SImode to be allowed in FP regs.

I adjusted the costs like that already on AArch64 but while this reduced 
the crazy spilling of integer values to FP registers and visa versa, it 
doesn't fix it completely. However it should not be necessary to lie about
the move cost and use an unrealistically high value to get decent code...

There are other issues in ira-costs that cause preferences to be incorrect:
you'll find that after you increase the move costs that explicit int<->fp
moves start to go via memory due to memory cost being hardcoded as 1 if an
instruction pattern contains 'm' somewhere - oops... I also posted a patch 
that fixes the preference for new registers created by live-range splitting.

> The patch for MIPS is not submitted yet but has eliminated the final
> two uses of FP registers when building the whole Linux kernel with
> hard-float enabled. I am however still not confident enough to say
> you can build integer only code with hard-float and never touch an FP
> register.

Correct, as long as the preference calculations are not correct and there
is no good way to influence the costs reliably, GCC will continue to use
FP registers in cases when it shouldn't. It's obvious that integer
operations should prefer integer registers and FP operations FP registers,
so why is there no easy way to tell GCC?!?

> Since there are multiple architectures suffering from this I guess we
> should look at properly addressing it in generic code.

Agreed.

Wilco




RE: IRA preferencing issues

2015-04-20 Thread Wilco Dijkstra

Interestingly even when the preferences are accurate, lra_constraints
completely ignores the preferred/allocno class. If the cost of 2 alternatives
is equal in every way (which will be the case if they are both legal matches
as the standard cost functions are not used at all), the wrong one may be 
chosen depending on the order in the MD file. This is particularly bad when the
spill optimization pass later removes some of the spills and correctly allocates
them to their preferred allocno class...

Forcing win = true if the register class of the alternative intersects with 
the preferred class generates significantly better spill code for cases where
the preference is accurate (ie. not just ALL_REGS), resulting in far less
confusion between integer and FP registers. 

So shouldn't get_reg_class return the preference/allocno class like below rather
than NO_REGS?

diff --git a/gcc/lra-constraints.c b/gcc/lra-constraints.c
index 0ddd842..f38914a 100644
--- a/gcc/lra-constraints.c
+++ b/gcc/lra-constraints.c
@@ -263,7 +263,10 @@ get_reg_class (int regno)
 }
   if (regno >= new_regno_start)
 return lra_get_allocno_class (regno);
-  return NO_REGS;
+  return reg_preferred_class (regno);
}

Wilco





RFC: Creating a more efficient sincos interface

2018-09-13 Thread Wilco Dijkstra
Hi,

The existing sincos functions use 2 pointers to return the sine and cosine 
result. In
most cases 4 memory accesses are necessary per call. This is inefficient and 
often
significantly slower than returning values in registers. I ran a few 
experiments on the
new optimized sincosf implementation in GLIBC using the following interface:

__complex__ float sincosf2 (float);

This has 50% higher throughput and a 25% reduction in latency on Cortex-A72 for
random inputs in the range +-PI/4. Larger inputs take longer and thus have lower
gains, but there is still a 5% gain on the (rarely used) path with full range 
reduction.
Given sincos is used in various HPC applications this can give a worthwile 
speedup.

LLVM already supports something similar for OSX using a struct of 2 floats.
Using complex float is better since not all targets may support returning 
structures in
floating point registers and GCC generates very inefficient code on targets 
that do
(PR86145).

What do people think? Ideally I'd like to support this in a generic way so all 
targets can
benefit, but it's also feasible to enable it on a per-target basis. Also since 
not all libraries
will support the new interface, there would have to be a flag or configure 
option to switch
the new interface off if not supported (maybe automatically based on the math.h 
header).

Wilco

Re: multiple definition of symbols" when linking executables on ARM32 and AArch64

2020-01-06 Thread Wilco Dijkstra
On 06.01.20 11:03, Andrew Pinski wrote:
> +GCC
> 
> On Mon, Jan 6, 2020 at 1:52 AM Matthias Klose  wrote:
>>
>> In an archive test rebuild with binutils and GCC trunk, I see a lot of build
>> failures on both aarch64-linux-gnu and arm-linux-gnueabihf failing with
>> "multiple definition of symbols" when linking executables, e.g.
> 
> THIS IS NOT A BINUTILS OR GCC BUG.
> GCC changed the default to -fno-common.
> It seems like for some reason, your non-aarch64/arm builds had changed
> the default back to being with -fcommon turned on.

> what would that be?  I'm not aware of any active change doing that.  Packages
> build on x86, ppc64el and s390x at least.

Well if you want to build old archived code using latest GCC then you may need 
to
force -fcommon just like you need to add many warning disables. Maybe you were
using an older GCC for the other targets? As Andrew notes, this isn't 
Arm-specific.

Wilco


Re: multiple definition of symbols" when linking executables on ARM32 and AArch64

2020-01-06 Thread Wilco Dijkstra






Hi,

> However, this is an undocumented change in the current NEWS, and seeing
>> literally hundreds of package failures, I doubt that's the right thing to 
>> do, at
>> least without any deprecation warning first.  Could that be handled, 
>> deprecating
>> in GCC 10 first, and the changing that for GCC 11?

This change was first proposed for GCC8, and rejected because of failures in the
distros. Two years have passed, and there are still failures... Would this 
change if
we postpone it even longer? My feeling is that nobody is going to actively fix 
their
code if the default isn't changed first.

> It is hard to get a warning for things like this.

Could the linker warn whenever it merges common symbols or would that give
many false positives?

Wilco

Re: [ARM] LLVM's -arm-assume-misaligned-load-store equivalent in GCC?

2020-01-09 Thread Wilco Dijkstra
Hi Christophe,

> Actually I got a confirmation of what I suspected: the offending function 
> foo()
> is part of ARM CMSIS libraries, although the users are able to recompile them,
> they don't want to modify that source code. Having a compilation option to
> avoid generating problematic code sequences would be OK for them.

Well if LDRD instructions are incorrectly generated in those libraries for 
unaligned
accesses then why not report it as a CMSIS bug? Adding a complex new option like
this (with high impact to code quality) to workaround what seems a simple bug is
way overkill.

> So from the user's perspective, the wrong code is part of a 3rd party library
> which they can recompile but do not want to modify.

Would we expect every user of CMSIS to find this bug, find its cause, figure out
that a future GCC may have a new special option to avoid it, wait until that GCC
is released and then recompile?

Really, the best option is to report and modify the source until it is fixed.

Cheers,
Wilco


Re: help with PR78809 - inline strcmp for small constant strings

2017-08-04 Thread Wilco Dijkstra
Richard Henderson wrote:    
> On 08/04/2017 05:59 AM, Prathamesh Kulkarni wrote:

> > For i386, it seems strcmp is expanded inline via cmpstr optab by 
> > expand_builtin_strcmp if one of the strings is constant. Could we similarly
> > define cmpstr pattern for AArch64?
> 
> Certainly that's possible.

I'd suggest to do it as a target independent way, this is not a target specific
optimization and shouldn't be done in the target unless there are special
strcmp instructions.

> For constant strings of small length (upto 3?), I was wondering if it'd be a
> good idea to manually unroll strcmp loop, similar to __strcmp_* macros in
> bits/string.h?>
> For eg in gimple-fold, transform
> x = __builtin_strcmp(s, "ab")
> to
> x = s[0] - 'a';
> if (x == 0)
> {
>   x = s[1] - 'b';
>   if (x == 0)
> x = s[2];
> }

If there is already code that does something similar (see comment #1 in 
PR78809),
it could be easily adapted to handle more cases.

>  if (memcmp(s, "ab", 3) != 0)
>
> to be implemented with cmp+ccmp+ccmp and one branch.

Even better would be wider loads if you either know the alignment of s or it's 
max size
(although given the overhead of creating the return value that works best for 
equality).

Wilco

RFC: Improving GCC8 default option settings

2017-09-12 Thread Wilco Dijkstra
Hi all,

At the GNU Cauldron I was inspired by several interesting talks about improving
GCC in various ways. While GCC has many great optimizations, a common theme is
that its default settings are rather conservative. As a result users are 
required to enable several additional optimizations by hand to get good code.
Other compilers enable more optimizations at -O2 (loop unrolling in LLVM was
mentioned repeatedly) which GCC could/should do as well.

Here are a few concrete proposals to improve GCC's option settings which will
enable better code generation for most targets:

* Make -fno-math-errno the default - this mostly affects the code generated for
  sqrt, which should be treated just like floating point division and not set
  errno by default (unless you explicitly select C89 mode).

* Make -fno-trapping-math the default - another obvious one. From the docs:
  "Compile code assuming that floating-point operations cannot generate 
   user-visible traps."
  There isn't a lot of code that actually uses user-visible traps (if any -
  many CPUs don't even support user traps as it's an optional IEEE feature). 
  So assuming trapping math by default is way too conservative since there is
  no obvious benefit to users. 

* Make -fno-common the default - this was originally needed for pre-ANSI C, but
  is optional in C (not sure whether it is still in C99/C11). This can
  significantly improve code generation on targets that use anchors for globals
  (note the linker could report a more helpful message when ancient code that
  requires -fcommon fails to link).

* Make -fomit-frame-pointer the default - various targets already do this at
  higher optimization levels, but this could easily be done for all targets.
  Frame pointers haven't been needed for debugging for decades, however if there
  are still good reasons to keep it enabled with -O0 or -O1 (I can't think of 
any
  unless it is for last-resort backtrace when there is no unwind info at a 
crash),
  we could just disable the frame pointer from -O2 onwards.

These are just a few ideas to start. What do people think? I'd welcome 
discussion
and other proposals for similar improvements.

Wilco


Re: [RFC] type promotion pass

2017-09-15 Thread Wilco Dijkstra
Hi Prathamesh,

I've tried out the latest version and it works really well. It built and ran 
SPEC2017 without any issues or regressions (I didn't do a detailed comparison 
which would mean multiple runs, however a single run showed performance is 
pretty much the same on INT and 0.1% faster on FP). 

Codesize reduces in almost all cases (only xalancbmk increases by 600 bytes), 
sometimes by a huge amount. For example in gcc_r around 20% of all AND 
immediate instructions are removed, clear proof it removes many redundant 
zero/sign extensions.

So consider this a big +1 from me! GCC is behind other compilers with respect 
to this kind of optimization and it looks like this phase does a major catchup. 
Like I mentioned, it doesn't have to be 100% perfect, once it has been 
committed, we can fine tune it and add more optimizations.

Wilco


Re: [RFC] type promotion pass

2017-09-15 Thread Wilco Dijkstra
David Edelsohn wrote:

> Why does AArch64 define PROMOTE_MODE as SImode?  GCC ports for other
> RISC targets mostly seem to use a 64-bit mode.  Maybe SImode is the
> correct definition based on the current GCC optimization
> infrastructure, but this seems like a change that should be applied to
> all 64 bit RISC targets.

The reason is that AArch64 supports both 32-bit registers, so when using 
char/short
you want 32-bit operations. There is an issue in that WORD_REGISTER_OPERATIONS
isn't set on AArch64, but it should be. Maybe that requires some cleanups and 
ensure it
correctly interacts with PROMOTE_MODE. There are way too many confusing target
defines like this and no general mechanism that just works like you'd expect. 
Promoting
to an orthogonal set of registers is not something particularly unusual, so 
it's something
GCC should support well by default...

Wilco



Re: Possible gcc 4.8.5 bug about RELOC_HIDE marcro in latest kernel code

2017-09-21 Thread Wilco Dijkstra
Hi Justin,

> I tried centos 7.4 gcc 4.8.5-16, which seems to announce to fix this issue.
> And I checked the source code, the patch had been included in.
> But no luck, the bug is still there.
>
> Could you please please any advice to me? eg. Is there any ways to disable 
> such
> reload compilation procedure?

Reload is an intrinsic part of register allocation in GCC, it cannot be 
disabled.

My advice would be to use GCC7 - there are many more issues in GCC4.8
which you will run into sooner or later. I've done many backports for AArch64 
and
generally stopped at GCC6, so please don't consider using anything older.

A recent GCC will also generate MUCH more efficient code for AArch64. The
same is true for GLIBC. So why use something as ancient as GCC4.8???

Wilco



Re: Possible gcc 4.8.5 bug about RELOC_HIDE marcro in latest kernel code

2017-09-21 Thread Wilco Dijkstra
Hi Justin,

> The 4.8.5 is default gcc version for centos 7.x

If there is no newer version available you should talk to your distro. 
It is worth reporting this bug to them as more of their users may be
affected by it.

Wilco


Re: "GOT" under aarch64

2017-09-22 Thread Wilco Dijkstra
Hi,

You'll get GOT relocations to globals when you use -fpic:

int x;
int f(void) { return x; }
>gcc -O2 -S -o- -fpic

f:
adrpx0, :got:x
ldr x0, [x0, #:got_lo12:x]
ldr w0, [x0]
ret

So it doesn't depend on the compiler but what options you compile for.
There may be an issue with your setup, -fpic shouldn't be on by default.
Use gcc -v -Q -c testfile.c to list all the default settings - there could be
more non-standard or inefficient options enabled.

Wilco

Re: Potential bug on Cortex-M due to used registers/interrupts.

2017-11-17 Thread Wilco Dijkstra
Hi,

> These other registers - r4 to r12 - are "callee saved".

To be precise, R4-R11 are callee-saved, R0-R3, R12, LR are caller-saves
and LR and PSR are clobbered by calls. LR is slightly odd in that it is
a callee-save in the prolog, but not in the epilog (since LR is assumed
clobbered after a call, it doesn't need to be restored, so you can use
pop {regs,PC} to return).

Cortex-M hardware will automatically save/restore R0-R3, R12, LR, PC, PSR
on interrupts. That perfectly matches the caller-saves and clobbered
registers, so there is no potential bug.

Wilco


Fortran array slices and -frepack-arrays

2018-04-13 Thread Wilco Dijkstra
Hi,

I looked at a few performance anomalies between gfortran and Flang - it appears 
array slices
are treated differently. Using -frepack-arrays fixed a performance issue in 
gfortran and didn't
cause any regressions. Making input array slices contiguous helps both locality 
and enables
more vectorization.

So I wonder whether it should be made the default (-O3 or just -Ofast)? 
Alternatively would
it be feasible in Fortran to version functions or loops if all arguments are 
contiguous slices?

Wilco

Re: Fortran array slices and -frepack-arrays

2018-04-13 Thread Wilco Dijkstra
Bin.Cheng wrote:
  
> I don't know the implementation of the option, so two questions:
> 1) When the repack is done during compilation?  Is new code
> manipulating data layout added
> by frontend?  If yes, better to do it during optimization thus is
> can be on demanding?  This
> looks like one case of data layout transformation.  Not sure if
> there is enough information
> to do that in optimizer.

Yes it adds a runtime check at function entry and packs array slices which
have a non-unity step. Currently it uses a call to _gfortran_internal_pack,
however this could be inlined and use alloca rather than malloc for small 
slices.

It might be possible to check which parameters are used a lot (or benefit
from vectorization) and only pack those.

> 2) For now, does this option force array repacking unconditionally?  I
> think it won't be too hard
> to model when such data layout transformation is beneficial by
> looking at loop (nest) accessing
> the array and comparing against the overhead.

Yes, it ensures all slices are packed, but that isn't strictly necessary.

>> it be feasible in Fortran to version functions or loops if all arguments are 
>> contiguous slices?
> I think a cost model is still needed for function/loop versioning.

Absolutely. If you staticially know at the call that all slices are contiguous 
you could
compile a version of the function using the contiguous attribute and skip all 
runtime
checks. Such function versioning would require LTO to work well.

Wilco

Re: How to get GCC on par with ICC?

2018-06-15 Thread Wilco Dijkstra
Martin wrote:

> Keep in mind that when discussing FP benchmarks, the used math library
> can be (almost) as important as the compiler.  In the case of 481.wrf,
> we found that the GCC 8 + glibc 2.26 (so the "out-of-the box" GNU)
> performance is about 70% of ICC's.  When we just linked against AMD's
> libm, we got to 83%. When we instructed GCC to generate calls to Intel's
> SVML library and linked against it, we got to 91%.  Using both SVML and
> AMD's libm, we achieved 93%.
>
> That means that there likely still is 7% to be gained from more clever
> optimizations in GCC but the real problem is in GNU libm.  And 481.wrf
> is perhaps the most extreme example but definitely not the only one.

You really should retry with GLIBC 2.27 since several key math functions were
rewritten from scratch by Szabolcs Nagy (all in generic C code), resulting in 
huge
performance gains on all targets (eg. wrf improved over 50%).

I fixed several double precision functions in current GLIBC to avoid extremely 
bad
performance which had been complained about for years. There are more math
functions on the way, so the GNU libm will not only catch up, but become the 
fastest
math library available.

Wilco

Missing optimization: mempcpy(3) vs memcpy(3)

2022-12-12 Thread Wilco Dijkstra via Gcc
Hi,

I don't believe there is a missing optimization here: compilers expand mempcpy
by default into memcpy since that is the standard library call. That means even
if your source code contains mempcpy, there will never be any calls to mempcpy.

The reason is obvious: most targets support optimized memcpy in the C library
while very few optimize mempcpy. The same is true for bzero, bcmp and bcopy.

Targets can do it differently, IIRC x86 is the only target that emits calls 
both to
memcpy and mempcpy.

Cheers,
Wilco