[Bug target/94121] ICE on aarch64-linux-gnu: in abs_hwi, at hwint.h:324

2020-03-10 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94121

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #3 from Wilco  ---
(In reply to Jakub Jelinek from comment #2)
> Comment on attachment 48010 [details]
> [PATCH PR94121] fix ICE on aarch64 in abs_hwi, at hwint.h:324
> 
> Can't you instead just use absu_hwi instead of abs_hwi and change moffset
> type to unsigned HOST_WIDE_INT?  With the latter, the moffset < 0x100
> comparison will DTRT and for HOST_WIDE_INT_MIN DImode it really doesn't
> matter if we use addition or subtraction.

Yes that works fine.

[Bug middle-end/94172] [arm-none-eabi] ICE in expand_debug_locations, at cfgexpand.c:5403

2020-03-16 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94172

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #2 from Wilco  ---
It's a generic issue with -fshort-enums (which is the default in arm-none-eabi)
- it fails since GCC6 on every target.

[Bug middle-end/94172] [arm-none-eabi] ICE in expand_debug_locations, at cfgexpand.c:5403

2020-03-16 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94172

--- Comment #6 from Wilco  ---
(In reply to Jakub Jelinek from comment #3)
> Can't reproduce on the trunk, neither on x86_64-linux with -Os -g3
> -fshort-enums, nor on arm-linux-gnueabi with -Os -g3 -fshort-enums
> -mcpu=cortex-m0 -mthumb

I tried -O2 -g -fshort-enums and this fails on AArch64:

extern enum my_enum_type extern_enum;
extern void bar(int a);

enum my_enum_type {
my_entry
};

void g(void);
void foo(int a)
{
int local_enum = extern_enum;

if (a) {
g();
local_enum = 0;
}
bar(local_enum);
}

The issue is the placement of the extern enum declaration. Move it after the
enum type and all is well - the assert in cfgexpand seems to not allow SI/QI
combination.

[Bug debug/94502] [aarch64] Missing LR register location in FDE

2020-04-08 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94502

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #2 from Wilco  ---
(In reply to Luis Machado from comment #1)
> CC-ing ARM folks so they can assign this to whoever is more appropriate.

Can you list the assembly code? My understanding is that unless LR is saved
there is no entry needed as the default action is to return to LR.

[Bug debug/94502] [aarch64] Missing LR register location in FDE

2020-04-08 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94502

--- Comment #4 from Wilco  ---
(In reply to Luis Machado from comment #3)

> The lack of a rule for LR means GDB will assume the register is UNSPECIFIED.
> Is GCC assuming this register is considered to have the same value as an
> inner frame?

Right so it's a leaf function like I suspected. The default rule has always
been to use the return register LR if it isn't stored (and that doesn't change
if you adjust the stack). Leaf functions have always worked, so I'm surprised
you are seeing an issue.

[Bug tree-optimization/91322] [10 regression] g++.dg/lto/alias-4_0.C test failure

2020-04-09 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91322

--- Comment #15 from Wilco  ---
(In reply to Richard Biener from comment #14)

> So I'm quite sure the missed optimization isn't a regression?  (can somebody
> quickly check GCC 9 whether the testcase is optimized there on ARM?)

It fails on both AArch64 and Arm all the way back to GCC6 (oldest compiler I
tried). So it's not a regression but this is all target independent so I
wouldn't expect this to fail.

[Bug target/94538] [9/10 Regression] ICE: in extract_constrain_insn_cached, at recog.c:2223 (insn does not satisfy its constraints) with -mcpu=cortex-m23 -mslow-flash-data

2020-04-09 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538

Wilco  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 CC||wilco at gcc dot gnu.org
   Last reconfirmed||2020-04-09
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org

--- Comment #1 from Wilco  ---
Thanks for the concise testcase. -mslow-flash-data disables MOVW generation
which exactly the opposite of what we want. I'm testing a trivial fix.

[Bug target/94538] [10 Regression] ICE: in extract_constrain_insn_cached, at recog.c:2223 (insn does not satisfy its constraints) with -mcpu=cortex-m23 -mslow-flash-data

2020-04-09 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538

Wilco  changed:

   What|Removed |Added

Summary|[9/10 Regression] ICE: in   |[10 Regression] ICE: in
   |extract_constrain_insn_cach |extract_constrain_insn_cach
   |ed, at recog.c:2223 (insn   |ed, at recog.c:2223 (insn
   |does not satisfy its|does not satisfy its
   |constraints) with   |constraints) with
   |-mcpu=cortex-m23|-mcpu=cortex-m23
   |-mslow-flash-data   |-mslow-flash-data

--- Comment #2 from Wilco  ---
This was introduced by commit e24f6408d so only in GCC10.

[Bug target/94538] [10 Regression] ICE: in extract_constrain_insn_cached, at recog.c:2223 (insn does not satisfy its constraints) with -mcpu=cortex-m23 -mslow-flash-data

2020-04-09 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538

--- Comment #4 from Wilco  ---
(In reply to Zdenek Sojka from comment #3)
> (In reply to Wilco from comment #2)
> > This was introduced by commit e24f6408d so only in GCC10.
> 
> Thank you for checking this!
> 
> I am quite sure this fails in gcc-9 as well:
...
> Perhaps the offending commit, or part of it, was backported to gcc-9 as well?

It's possible it was recently backported and our GCC9 builds don't yet have it.
But that whole patch is badly broken and introduces multiple issues...

[Bug target/94538] [10 Regression] ICE: in extract_constrain_insn_cached, at recog.c:2223 (insn does not satisfy its constraints) with -mcpu=cortex-m23 -mslow-flash-data

2020-04-14 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538

--- Comment #7 from Wilco  ---
(In reply to Wilco from comment #4)
> (In reply to Zdenek Sojka from comment #3)
> > (In reply to Wilco from comment #2)
> > > This was introduced by commit e24f6408d so only in GCC10.
> > 
> > Thank you for checking this!
> > 
> > I am quite sure this fails in gcc-9 as well:
> ...
> > Perhaps the offending commit, or part of it, was backported to gcc-9 as 
> > well?
> 
> It's possible it was recently backported and our GCC9 builds don't yet have
> it. But that whole patch is badly broken and introduces multiple issues...

Adding Christophe. I'm thinking the best approach right now is to revert given
-mpure-code doesn't work at all on Thumb-1 targets - it still emits literal
pools, switch tables etc. That's not pure code!

[Bug target/94538] [10 Regression] ICE: in extract_constrain_insn_cached, at recog.c:2223 (insn does not satisfy its constraints) with -mcpu=cortex-m23 -mslow-flash-data

2020-04-14 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538

--- Comment #10 from Wilco  ---
(In reply to Christophe Lyon from comment #8)
> > Adding Christophe. I'm thinking the best approach right now is to revert
> > given -mpure-code doesn't work at all on Thumb-1 targets - it still emits
> > literal pools, switch tables etc. That's not pure code!
> 
> Do you have testcases that show these failures?
> 
> I did check some of the problematic testcases in the GCC testsuite when I
> committed that patch. Did I miss some of them?
> 
> Can you point me to offending testcases and compiler options so that I can
> reproduce them?

For example:

int x;
int f1 (void) { return x; }

with eg. -O2 -mcpu=cortex-m0 -mpure-code I get:

movsr3, #:upper8_15:#.LC1
lslsr3, #8
addsr3, #:upper0_7:#.LC1
lslsr3, #8
addsr3, #:lower8_15:#.LC1
lslsr3, #8
addsr3, #:lower0_7:#.LC1
@ sp needed
ldr r3, [r3]
ldr r0, [r3, #40]
bx  lr

That's an extra indirection through a literal... There should only be one ldr
to read x.

Big switch tables are produced for any Thumb-1 core, however I would expect
Cortex-m0/m23 versions to look almost identical to the Cortex-m3 one, and use a
sequence of comparisons instead of tables.

int f2 (int x, int y)
{
  switch (x)
  {
case 0: return y + 0;
case 1: return y + 1;
case 2: return y + 2;
case 3: return y + 3;
case 4: return y + 4;
case 5: return y + 5;
  }
  return y;
}

Immediate generation for common cases seems to be screwed up:

int f3 (void) { return 0x1100; }

-O2 -mcpu=cortex-m0 -mpure-code:

movsr0, #17
lslsr0, r0, #8
lslsr0, r0, #8
lslsr0, r0, #8
bx  lr

This also regressed Cortex-m23 which previously generated:

movsr0, #136
lslsr0, r0, #21
bx  lr

Similar regressions happen with other immediates:

int f3 (void) { return 0x12345678; }

-O2 -mcpu=cortex-m23 -mpure-code:

movsr0, #86
lslsr0, r0, #8
addsr0, r0, #120
movtr0, 4660
bx  lr

Previously it was:

movwr0, #22136
movtr0, 4660
bx  lr

Also relocations with a small offset should be handled within the relocation.
I'd expect this to never generate an extra addition, let alone an extra literal
pool entry:

int arr[10];
int *f4 (void) { return &arr[1]; }

-O2 -mcpu=cortex-m3 -mpure-code generates the expected:

movwr0, #:lower16:.LANCHOR0+4
movtr0, #:upper16:.LANCHOR0+4
bx  lr

-O2 -mcpu=cortex-m23 -mpure-code generates this:

movwr0, #:lower16:.LANCHOR0
movtr0, #:upper16:.LANCHOR0
addsr0, r0, #4
bx  lr

And cortex-m0 again inserts an extra literal load:

movsr3, #:upper8_15:#.LC0
lslsr3, #8
addsr3, #:upper0_7:#.LC0
lslsr3, #8
addsr3, #:lower8_15:#.LC0
lslsr3, #8
addsr3, #:lower0_7:#.LC0
ldr r0, [r3]
addsr0, r0, #4
bx  lr

[Bug target/94538] [10 Regression] ICE: in extract_constrain_insn_cached, at recog.c:2223 (insn does not satisfy its constraints) with -mcpu=cortex-m23 -mslow-flash-data

2020-04-16 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538

--- Comment #13 from Wilco  ---
(In reply to Christophe Lyon from comment #12)
> I've posted a patch to fix the regression for your f3() examples:
> https://gcc.gnu.org/pipermail/gcc-patches/2020-April/543993.html

Yes that improves some of the examples, but I still see regressions on m0 for
eg. -1 or 511. The splitter must have the right priority or it will continue to
cause regressions. That means placing it around line 790 after the existing
movsi splitters.

[Bug target/94538] [10 Regression] ICE: in extract_constrain_insn_cached, at recog.c:2223 (insn does not satisfy its constraints) with -mcpu=cortex-m23 -mslow-flash-data

2020-04-16 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538

--- Comment #14 from Wilco  ---
(In reply to Christophe Lyon from comment #11)
> (In reply to Wilco from comment #10)

> Right, but the code is functional.

It doesn't avoid the literal load from flash which is exactly what pure-code
and slow-flash-data is all about.

That brings me to another question, why does -mslow-flash-data give an error
when used with Cortex-M0? If the goal was to make things more orthogonal then
surely we should not give an error and allow both options on all M-class cores.

> I believe this is expected: as I wrote in my commit message
> "CASE_VECTOR_PC_RELATIVE is now false with -mpure-code, to avoid generating
> invalid assembly code with differences from symbols from two different
> sections (the difference cannot be computed by the assembler)."
> 
> Maybe there's a possibility to tune this to detect cases where we can do
> better?

The best option is to do the same as Cortex-M3: just switch off branch tables
altogether and fall back to compare+branch. That completely avoids loading data
from flash and is always smaller than emitting 32-bit tables.


> > Also relocations with a small offset should be handled within the
> > relocation. I'd expect this to never generate an extra addition, let alone
> > an extra literal pool entry:
> > 
> > int arr[10];
> > int *f4 (void) { return &arr[1]; }
> > 
> > -O2 -mcpu=cortex-m3 -mpure-code generates the expected:
> > 
> > movwr0, #:lower16:.LANCHOR0+4
> > movtr0, #:upper16:.LANCHOR0+4
> > bx  lr
> > 
> > -O2 -mcpu=cortex-m23 -mpure-code generates this:
> > 
> > movwr0, #:lower16:.LANCHOR0
> > movtr0, #:upper16:.LANCHOR0
> > addsr0, r0, #4
> > bx  lr
> 
> For cortex-m23, I get the same code with and without -mpure-code.

GCC9 emits something different for Cortex-M0 and Cortex-M23 so this was changed
by the patch somehow even when -mpure-code is not enabled. So it is a
regression from what we used to generate.

Similarly I would not expect pure-code to change the decision of whether to
emit immediates as part of the relocation as long as they are within range. The
existing implementation for Cortex-M3 does this correctly.

[Bug target/94538] [10 Regression] ICE: in extract_constrain_insn_cached, at recog.c:2223 (insn does not satisfy its constraints) with -mcpu=cortex-m23 -mslow-flash-data

2020-04-17 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538

--- Comment #16 from Wilco  ---
(In reply to Christophe Lyon from comment #15)
> (In reply to Wilco from comment #14)
> > (In reply to Christophe Lyon from comment #11)
> > > (In reply to Wilco from comment #10)
> > 
> > > Right, but the code is functional.
> > 
> > It doesn't avoid the literal load from flash which is exactly what pure-code
> > and slow-flash-data is all about.
> 
> For f1 on M0, I can see:
> .section.rodata.cst4,"aM",%progbits,4
> .align  2
> .LC0:
> .word   .LANCHOR0
> .section .text,"0x2006",%progbits
> [...]
> f1:
> movsr3, #:upper8_15:#.LC0
> lslsr3, #8
> addsr3, #:upper0_7:#.LC0
> lslsr3, #8
> addsr3, #:lower8_15:#.LC0
> lslsr3, #8
> addsr3, #:lower0_7:#.LC0
> ldr r3, [r3]@ 6 [c=10 l=2]  *thumb1_movsi_insn/8
> ldr r0, [r3]@ 7 [c=10 l=2]  *thumb1_movsi_insn/8
> bx  lr
> [...]
> .bss
> .align  2
> .set.LANCHOR0,. + 0
> .type   x, %object
> .size   x, 4
> x:
> .space  4
> 
> So the 1st load is from .rodata.cst4 and the 2nd load is from bss, both of
> which do not have the purecode bit set (unlike .text). Isn't that OK?

No, it will create a lot of complaints and support queries due to the obvious
regressions. It goes against the definition of pure-code and slow-flash-data
which is to remove the literal loads. And given the sequence is already
inefficient, we should do everything to remove the indirection which increases
the codesize overhead by 75%...

Another aspect that needs to be checked is that GCC correctly spills addresses
and complex constants instead of rematerializing them. This is basic minimal
quality that one expects for a feature like this.

[Bug middle-end/94715] New: Squared multiplies are incorrectly signextended

2020-04-22 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94715

Bug ID: 94715
   Summary: Squared multiplies are incorrectly signextended
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

The following example generates incorrect code with -O2:

unsigned long long f (int x)
{
  unsigned int t = x * x;
  return t;
}

On AArch64 I get:

  mul w0, w0, w0
  sxtw x0, w0
  ret

It's correct if you do x * y or x * 100.

[Bug middle-end/94715] Squared multiplies are incorrectly signextended

2020-04-23 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94715

Wilco  changed:

   What|Removed |Added

   Last reconfirmed||2020-04-23
 Ever confirmed|0   |1
 Status|RESOLVED|REOPENED
 Resolution|INVALID |---

--- Comment #2 from Wilco  ---
(In reply to Richard Biener from comment #1)
> I think GCC is correct in assuming that x * x is positive since overflow
> with signed arithmetic is undefined.  Thus on GIMPLE we elide
> 
>   _1 = x_2(D) * x_2(D);
>   t_3 = (unsigned int) _1;
>   _4 = (long long unsigned int) t_3;
> 
> to
> 
>   _1 = x_2(D) * x_2(D);
>   _4 = (long long unsigned int) _1;

If we assume x * x is always positive, using unsigned extension would make more
sense. It still adds an unnecessary extra instruction on most targets which
cannot be removed in RTL.

[Bug tree-optimization/94787] Failure to detect single bit popcount pattern

2020-04-29 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94787

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #3 from Wilco  ---
(In reply to Gabriel Ravier from comment #1)
> Inversely, I'd also suggest doing the opposite. That is, if there is no
> hardware popcount instruction, `__builtin_popcount(v) == 1` should be
> optimized to `v && !(v & (v - 1))`

I actually posted a patch for this and popcount(x) > 1 given the reverse
transformation is faster on all targets - even if they have popcount
instruction (since they are typically more expensive). This is true on x86 as
well, (x-1) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90693

[Bug target/94789] Failure to take advantage of shift operand semantics to turn subtraction into negate

2020-04-29 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94789

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #4 from Wilco  ---
(In reply to Gabriel Ravier from comment #0)
> int r(int x, unsigned b)
> {
> int const m = CHAR_BIT * sizeof(x) - b;
> return (x << m);
> }
> 
> `CHAR_BIT * sizeof(x) - b;` can be optimized to `-b`. LLVM does this
> transformation, not GCC.
> 
> Comparison here : https://godbolt.org/z/5byJ2E

AArch64 already generates:

  neg w1, w1
  lsl w0, w0, w1
  ret

[Bug target/94538] [9/10 Regression] ICE: in extract_constrain_insn_cached, at recog.c:2223 (insn does not satisfy its constraints) with -mcpu=cortex-m23 -mslow-flash-data

2020-04-30 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538

--- Comment #19 from Wilco  ---
Yes I have a GCC9.3 build now, this fails too.

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-26 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #2 from Wilco  ---
(In reply to Bu Le from comment #0)
> Created attachment 48584 [details]
> proposed patch
> 
> I would like to propose an implementation of the medium code model in
> aarch64. A prototype is attached, passed bootstrap and the regression test.
> 
> Mcmodel = medium is a missing code model in aarch64 architecture, which is
> supported in x86. This code model describes a situation that some small data
> is relocated by small code model while large data is relocated by large code
> model. The official statement about medium code model in x86 ABI file page
> 34 URL : https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf
> 
> The key difference between x86 and aarch64 is that x86 can use lea+movabs
> instruction to implement a dynamic relocatable large code model. Currently,
> large code model in AArch64 relocate the symbol using ldr instruction, which
> can only be static linked. However, the small code mode use adrp + ldr
> instruction, which can be dynamic linked. Therefore, the medium code model
> cannot be implemented directly by simply setting a threshold. As a result a
> dynamic reloadable large code model is needed first for a functional medium
> code model.
> 
> I met this problem when compiling CESM, which is a climate forecast software
> that widely used in hpc field. In some configure case, when the manipulating
> large arrays, the large code model with dynamic relocation is needed. The
> following case is abstract from CESM for this scenario.
> 
> program main
>  common/baz/a,b,c
>  real a,b,c
>  b = 1.0
>  call foo()
>  print*, b
>  end
> 
>  subroutine foo()
>  common/baz/a,b,c
>  real a,b,c
> 
>  integer, parameter :: nx = 1024
>  integer, parameter :: ny = 1024
>  integer, parameter :: nz = 1024
>  integer, parameter :: nf = 1
>  real :: bar(nf,nx*ny*nz)
>  real :: bar1(nf,nx*ny*nz)
>  bar = 0.0
>  bar1 =0.0
>  b = bar(1,1024*1024*100)
>  b = bar1(1,1)
> 
>  return
>  end
> 
> compile with -mcmodel=small -fPIC will give following error due to the
> access of bar1 array
> test.f90:(.text+0x28): relocation truncated to fit:
> R_AARCH64_ADR_PREL_PG_HI21 against `.bss'
> test.f90:(.text+0x6c): relocation truncated to fit:
> R_AARCH64_ADR_PREL_PG_HI21 against `.bss'
> 
> compile with -mcmodel=large -fPIC will give unsupported error:
> f951: sorry, unimplemented: code model ‘large’ with ‘-fPIC’
> 
> As discussed in the beginning, to tackle this problem we have to solve the
> static large code model problem. My solution here is to use
> R_AARCH64_MOVW_PREL_Gx group relocation with instructions to calculate the
> current PC value.
> 
> Before change (mcmodel=small) :
> adrpx0, bar1.2782
> add x0, x0, :lo12:bar1.2782
> 
> After change:(mcmodel = medium proposed):
> movzx0, :prel_g3:bar1.2782
> movk  x0, :prel_g2_nc:bar1.2782
> movk  x0, :prel_g1_nc:bar1.2782
> movk  x0, :prel_g0_nc:bar1.2782
> adr   x1, .
> sub   x1, x1, 0x4
> add   x0, x0, x1
> 
> The first 4 movk instruction will calculate the offset between bar1 and the
> last movk instruction in 64-bits, which fulfil the requirement of large code
> model(64-bit relocation).
> The adr+sub instruction will calculate the pc-address of the last movk
> instruction. By adding the offset with the PC address, bar1 can be
> dynamically located.
> 
> Because this relocation is time consuming, a threshold is set to classify
> the size of the data to be relocated, like x86. The default value of the
> threshold is set to 65536, which is max relocation capability of small code
> model.
> This implementation will also need to amend the linker in binutils so that
> the4 movk can calculated the same pc-offset of the last movk instruction.
> 
> The good side of this implementation is that it can use existed relocation
> type to prototype a medium code model.
> 
> The drawback of this implementation also exists. 
> For start, these 4movk instructions and the adr instruction must be combined
> in this order. No other instruction should insert in between the sequence,
> which will leads to mistake symbol address. This might impede the insn
> schedule optimizations. 
> Secondly, the linker need to make the change correspondingly so that every
> mov instruction calculate the same pc-offset. For example, in my
> implementation, the fisrt movz instruction will need to add 12 to the result
> of ":prel_g3:bar1.2782" to make up t

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-26 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #4 from Wilco  ---
(In reply to Bu Le from comment #3)
> (In reply to Wilco from comment #2)
> 
> > Is the main usage scenario huge arrays? If so, these could easily be
> > allocated via malloc at startup rather than using bss. It means an extra
> > indirection in some cases (to load the pointer), but it should be much more
> > efficient than using a large code model with all the overheads.
> 
> Thanks for the reply. 
> 
> The large array is just used to construct the test case. It is not a
> neccessary condition for this scenario. The common scenario is that the
> symbol is too far away for small code model to reach it, which cloud also
> result from large amount of small arrays, structures, etc. Meanwhile, the
> large code model is able to reach the symbol but can not be position
> independent, which cause the problem. 
> 
> Besides, the code in CESM is quiet complicated to reconstruct with malloc,
> which is also not an acceptable option for my customer.
> 
> Clear enough for your concern?

Well the question is whether we're talking about more than 4GB of code or more
than 4GB of data. With >4GB code you're indeed stuck with the large model. With
data it is feasible to automatically use malloc for arrays when larger than a
certain size, so there is no need to change the application at all. Something
like that could be the default in the small model so that you don't have any
extra overhead unless you have huge arrays. Making the threshold configurable
means you can tune it for a specific application.

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-26 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #5 from Wilco  ---
(In reply to Bu Le from comment #0)

Also it would be much more efficient to have a relocation like this if you
wanted a 48-bit PC-relative offset:

adrpx0, bar1.2782
add x0, x0, :lo12:bar1.2782
movkx0, :high32_47:bar1.2782

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-27 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #8 from Wilco  ---
(In reply to Bu Le from comment #6)
> (In reply to Wilco from comment #4)
> > (In reply to Bu Le from comment #3)
> > > (In reply to Wilco from comment #2)
> 
> > Well the question is whether we're talking about more than 4GB of code or
> > more than 4GB of data. With >4GB code you're indeed stuck with the large
> > model. With data it is feasible to automatically use malloc for arrays when
> > larger than a certain size, so there is no need to change the application at
> > all. Something like that could be the default in the small model so that you
> > don't have any extra overhead unless you have huge arrays. Making the
> > threshold configurable means you can tune it for a specific application.
> 
> 
> Is this automatic malloc already avaiable on some target? I haven't found an
> example that works in that way. Would you mind provide an example?

Fortran already has -fstack-arrays to decide between allocating arrays on the
heap or on the stack.

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-27 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #9 from Wilco  ---
(In reply to Bu Le from comment #7)
> (In reply to Wilco from comment #5)
> > (In reply to Bu Le from comment #0)
> > 
> > Also it would be much more efficient to have a relocation like this if you
> > wanted a 48-bit PC-relative offset:
> > 
> > adrpx0, bar1.2782
> > add x0, x0, :lo12:bar1.2782
> > movk  x0, :high32_47:bar1.2782
> 
> I am afraid that put the PC-relative offset into x0 is not correct, because
> x0 issuppose to be the final address of bar1 rather than an PC offset.
> Therefore an extra register is needed to hold the offest temporarily.

You're right, we need an extra add, so it's like this:

adrpx0, bar1.2782
movkx1, :high32_47:bar1.2782
add x0, x0, x1
add x0, x0, :lo12:bar1.2782

> (By the way, the high32_47 relocation you suggested is the prel_g2 in the
> officail aarch64 ABI released)

It needs a new relocation because of the ADRP. ADR could be used so the
existing R__MOVW_PREL_G0-3 work, but then you need 5 instructions.

> And in terms of engineering, you idea can save the trouble to modify the
> linker for calculating the offset for 3 movks. But we still need to make a
> new relocation type for ADRP, because it currently checking the overflow of
> address and gives the "relocation truncated to fit" error. Therefore, both
> idea need to do works in binutils, which make it also equivalent.

There is relocation 276 (R__ADR_PREL_PG_HI21_NC).

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

2020-05-27 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #29 from Wilco  ---
(In reply to Jiu Fu Guo from comment #28)
> (In reply to Jiu Fu Guo from comment #27)
> 
> > > 12: 1.2
> > > 13: 0.9
> > > 14: 0.8
> > > 15: 0.7
> > > 16: 2.1
> > > 
> > 
> > Find one interesting thing:
> > If using widen reading for the run which > 16 iterations, we can see the
> > performance is significantly improved(>18%) for xz_r in spec.
> > This means that the frequency is small for >16, while it still costs a big
> > part of the runtime.
> > 
> 
> Oh, Recheck frequency in my test, the frequency is big (99.8%) for >16
> iterations.

The frequency for >16 iterations is small, 2.1%. The limit is generally large,
but the actual number of iterations is what matters because of the early exit.

The key question remains whether it is legal to assume the limit implies the
memory is valid and use wider accesses.

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-27 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #12 from Wilco  ---
(In reply to Bu Le from comment #10)
> > Fortran already has -fstack-arrays to decide between allocating arrays on
> > the heap or on the stack.
> 
> I tried the flag with my example. The fstack-array seems cannot move the
> array in the bss to the heap. The problem is still there. 

It is an existing feature that chooses between malloc and stack. It would need
modification to do the same for large data/bss objects.

> Anyway, my point is that the size of single data does't affact the fact that
> medium code model is missing in aarch64 and aarch64 is lack of PIC large
> code model.

What is missing is efficient support for >4GB of data, right? How that is
implemented is a different question - my point is that it does not require a
new code model. It would be much better if it just worked without users even
needing to think about code models.

Also, what is the purpose of a large fpic model? Are there any applications
that use shared libraries larger than 4GB?

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-27 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #13 from Wilco  ---
(In reply to Bu Le from comment #11)
>  
> > You're right, we need an extra add, so it's like this:
> > 
> > adrpx0, bar1.2782
> > movkx1, :high32_47:bar1.2782
> > add x0, x0, x1
> > add x0, x0, :lo12:bar1.2782
> > 
> > > (By the way, the high32_47 relocation you suggested is the prel_g2 in the
> > > officail aarch64 ABI released)
> > 
> > It needs a new relocation because of the ADRP. ADR could be used so the
> > existing R__MOVW_PREL_G0-3 work, but then you need 5 instructions.
> 
> So you suggest a new relocation type "high32_47" to calculate the offset
> between ADRP and bar1. Am I right?

Yes. It needs to have an offset to the adrp instruction so it can compute the
correct ADRP offset and then extract bits 32-47.

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

2020-05-27 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #31 from Wilco  ---
(In reply to Jiu Fu Guo from comment #30)
> (In reply to Wilco from comment #29)

> > The key question remains whether it is legal to assume the limit implies the
> > memory is valid and use wider accesses.

> If unaligned access is support, it would be valid to access the memory.
> Otherwise, checking like ((pb&7) == (cur & 7)) would cost an additional
> test, and it may not sure likely to be true.

If both pointers are aligned it would be safe indeed, but if unaligned they
will access up to 7 bytes after the first match. And that's not safe without
knowing the underlying buffer bounds. See eg. comment #3.

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

2020-05-28 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #34 from Wilco  ---
(In reply to Jiu Fu Guo from comment #33)
> It would be relatively easy if the target supports unaligned access.  like
> read64ne in
> https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/common/memcmplen.h
> Then the alignment issue is relaxed.   It may be safer if we can
> prove/assume the underlying buffer is enough, like array accessing or
> pointer+index accessing in a loop.

Yes, without unaligned support you can't use a wider access.

If we can't prove the buffer bounds then we'd have to use a page cross check
before every unaligned access.

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-28 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #15 from Wilco  ---
(In reply to Bu Le from comment #14)
> > > Anyway, my point is that the size of single data does't affact the fact 
> > > that
> > > medium code model is missing in aarch64 and aarch64 is lack of PIC large
> > > code model.
> > 
> > What is missing is efficient support for >4GB of data, right? How that is
> > implemented is a different question - my point is that it does not require a
> > new code model. It would be much better if it just worked without users even
> > needing to think about code models.
> > 
> > Also, what is the purpose of a large fpic model? Are there any applications
> > that use shared libraries larger than 4GB?
> 
> Yes, I understand, and I am grateful for you suggestion. I have to say it is
> not a critical problem. After all, most applications works fine with
> curreent code modes. 
> 
> But there are some cases, like CESM with certain configuration, or my test
> case, which cannot be compiled with current gcc compiler on aarch64.
> Unfortunately, applications that large than 4GB is quiet normal in HPC
> feild. In the meantime, x86 and llvm-aarch64 can compile it, with medium or
> large-pic code model. That is the purpose I am proposing it. By adding this
> feature, we can make a step forward for aarch64 gcc compiler, making it more
> powerful and robust.
> 
> Clear enough for your concern? 

Yes but such a feature needs to be defined in an ABI and well specified. This
is why I'm trying to get the underlying requirements first. Note that while
LLVM allows -fpic in large model, it doesn't correctly implement it. The large
model shouldn't ever be needed by actual applications.

> And for the implementation you suggested, I believe it is a promissing plan.
> I would like to try to implement it first. Might take weeks of development.
> I will see what I can get. I will give you update with progress.
> 
> Thanks for the suggestion again.

As discussed, there are many different ways of supporting the requirement of
>4GB of data, so I wouldn't start on the implementation before there is a good
specification. GCC and LLVM would need to implement it in the same way after
all.

[Bug target/94986] missing diagnostic on ARM thumb2 compilation with -pg when using r7 in inline asm

2020-06-03 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94986

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #1 from Wilco  ---
(In reply to Arnd Bergmann from comment #0)
> I reported a bug against clang for a Linux kernel failure, but 
>  it was suggested that the clang behavior is probably correct in this corner
> case while gcc gets it wrong, see https://bugs.llvm.org/show_bug.cgi?id=45826
> 
> echo 'void f(void) { asm("mov r7, #0" ::: "r7"); }' | arm-linux-gnueabi-gcc
> -march=armv7-a -O2  -mthumb -pg -S -xc -
> 
> silently accepts an inline asm statement that clobbers the frame pointer,
> but gcc rejects the same code if any of '-O0', '-fomit-frame-pointer' or
> 'fno-omit-frame-pointer' are used:
> 
> : In function 'f':
> :1:44: error: r7 cannot be used in 'asm' here
> 
> If using r7 in this case is indeed invalid, we need to ensure the kernel
> does not do this, and having gcc reject it would be helpful.

GCC will reject it if you explicitly enable the frame pointer. The logic seems
wrong in that it doesn't report an error if the frame pointer is implicitly
enabled via -pg. As a workaround for the kernel, just use -pg and
-fno-omit-frame-pointer together.

Corrupting a frame pointer loses the ability to follow the frame chain, similar
to a function built with -fomit-frame-pointer which will use r7 as a general
purpose register.

However this always reports an error since this corruption of the frame pointer
will cause a crash:

int *f(int x) { asm("mov r7, #0" ::: "r7"); return __builtin_alloca (x); }

[Bug target/94986] missing diagnostic on ARM thumb2 compilation with -pg when using r7 in inline asm

2020-06-03 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94986

Wilco  changed:

   What|Removed |Added

 Resolution|--- |INVALID
 Status|UNCONFIRMED |RESOLVED

--- Comment #3 from Wilco  ---
(In reply to nsz from comment #2)
> on arm the -pg abi is
> 
> func:
>   push {lr}
>   bl _gnu_mcount_nc
>   ...
> 
> so no frame pointer is involved, -pg implying
> -fno-omit-frame-pointer is a historical mistake i think
> (because some targets required fp for -pg, but most don't).

Right, so the claim that -pg implies -fno-omit-frame-pointer is wrong, and that
means there is no bug in GCC. Looking at the latest docs, there is no mention
of frame pointer for the -pg option, and neither does -fomit-frame-pointer
discuss -pg. So this dependency must have been removed some time ago.

> ideally r7 clobber would just work with -pg -fomit-frame-pointer.
> the alloca problem is a separate issue (that r7 clobber may not
> work with alloca).

GCC correctly reports the error for that. So this can be closed then.

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

2020-06-08 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #40 from Wilco  ---
(In reply to Jiu Fu Guo from comment #39)
> I’m thinking to draft a patch for this optimization.  If any suggestions,
> please point out, thanks.

Which optimization to be precise? Besides unrolling I haven't seen a proposal
for an optimization which is both safe and generally applicable.

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

2020-06-10 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #44 from Wilco  ---
(In reply to Jiu Fu Guo from comment #43)
> To handle vectorization for this kind of code, it needs to overcome the hard
> issue mentioned in comment #5: the loop has 2 exits.

Yes and that also implies vector loads are unsafe unless they are non-faulting.
Few ISAs have such support.

[Bug target/95650] aarch64: Missed optimization storing addition of two shorts

2020-06-12 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95650

Wilco  changed:

   What|Removed |Added

   Last reconfirmed||2020-06-12
 Ever confirmed|0   |1
 CC||wilco at gcc dot gnu.org
 Status|UNCONFIRMED |NEW

--- Comment #4 from Wilco  ---
(In reply to Alex Coplan from comment #3)
> I think clang's optimisation is sound here.
> 
> C says that we add two shorts as int and then truncate to short (i.e. reduce
> mod 16).
> 
> The question is whether the top bits being set (which the ABI allows) can
> influence the result. I don't think it can.
> 
> The observation is that the "top bits being set" are just extra multiples of
> 2^16 in the addition, which just disappear when we reduce mod 2^16. That is:
> 
> (x_1 + x_2 + y_1 + y_2) % 2^16 = (x_1 + x_2) % 2^16
> 
> where x_1,x_2 are arbitrary integers and y_1,y_2 are multiples of 2^16 (the
> top bits).

Confirmed. It works for signed as well and any operator except right shift and
division. Basically the store requires only the bottom 16 bits to be valid, and
a backwards dataflow can propagate this to remove unnecessary zero and sign
extends.

[Bug target/96191] aarch64 stack_protect_test canary leak

2020-07-14 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96191

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #2 from Wilco  ---
(In reply to Jim Wilson from comment #0)
> Given a simple testcase
> extern int sub (int);
> 
> int
> main (void)
> {
>   sub (10);
>   return 0;
> }
> commpiling with -O -S -fstack-protector-all -mstack-protector-guard=global
> in the epilogue for the canary check I see
>   ldr x1, [sp, 40]
>   ldr x0, [x19, #:lo12:__stack_chk_guard]
>   eor x0, x1, x0
>   cbnzx0, .L4
> Both x0 and x1 have the stack protector canary loaded into them, and the eor
> clobbers x0, but x1 is left alone.  This means the value of the canary is
> leaking from the epilogue.  The canary value is never supposed to survive in
> a register outside the stack protector patterns.
> 
> A powerpc64-linux toolchain build with the same testcase and options
> generates
>   lwz 9,28(1)
>   lwz 10,0(31)
>   xor. 9,9,10
>   li 10,0
>   bne- 0,.L4
> and note that it clears the second register after the xor to prevent the
> canary leak.  The aarch64 stack_protect_test pattern should do the same
> thing.

The canary value is not a secret. What would the purpose of clearing the
register be given the stack slot containing the canary is not cleared as well?
And register could potentially contain the address of the canary or that of a
global nearby, making reading the canary value really easy.

[Bug tree-optimization/95731] Faiilure to optimize a >= 0 && b >= 0 to (a | b) >= 0

2020-08-04 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95731

Wilco  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2020-08-04
 CC||wilco at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #3 from Wilco  ---
(In reply to Gabriel Ravier from comment #0)
> bool f(int a, int b)
> {
> return a >= 0 && b >= 0;
> }
> 
> This can be optimized to `return (a | b) >= 0;`. LLVM does this
> transformation, but GCC does not.

For orthogonality you also want:

a < 0 && b < 0   -> (a & b) < 0
a >= 0 || b >= 0 -> (a & b) >= 0
a < 0 || b < 0   -> (a | b) < 0

[Bug target/96768] -mpure-code produces switch tables for thumb-1

2020-08-28 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96768

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #5 from Wilco  ---
(In reply to Christophe Lyon from comment #4)
> That's what I replied in the original PR94538, but Wilco said the best
> option was to turn off switch tables:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538#c14
> 
> See also another comment from him:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538#c16 related to immediate
> loads from rodata/bss.

Yes, switching off switch tables is obviously the best option - just look at
the m0 vs m3 code above. It's the fastest and smallest, and is also "pure code"
since there are no extra literals generated anywhere.

In AArch32 switch tables are very inefficient. An easy improvement would be to
copy the AArch64 case_values_threshold to disable tables if there are fewer
than 16 cases.

[Bug c/92172] ARM Thumb2 frame pointers inconsistent with clang

2019-10-21 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92172

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #1 from Wilco  ---
Firstly it's important to be clear this is about adding support for a frame
chain for unwinding.

A frame pointer is something different since it is used for addressing local
variables. Historically Arm compilers only supported a frame pointer, not a
frame chain for unwinding. So different Arm backends use different frame
pointer registers and there is no defined layout since it is not designed for
unwinding.

Why does this matter? Well as your examples show, if you want to emit a frame
chain using standard push/pop, it typically ends up pointing to the top of the
frame. That is the worst possible position for a frame pointer on Thumb - while
Arm supports negative immediate offsets up to 4KB, Thumb-1 doesn't support
negative offsets at all, and Thumb-2 supports offsets up to -255 but only with
32-bit instructions. So the result of conflating the frame chain and frame
pointer implies a terrible codesize hit for Thumb.

There is also an issue with using r7 as the frame chain in that you're now
reserving a precious low register callee-save and just use it once in a typical
function. So using r7 is a very bad idea for Thumb.

Your examples suggest LLVM suffers from both of these issues, and IIRC it still
uses r11 on Arm but r7 on Thumb. That is way too inefficient/incorrect to
consider a defacto standard.

[Bug c/92172] ARM Thumb2 frame pointers inconsistent with clang

2019-10-23 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92172

--- Comment #4 from Wilco  ---
(In reply to Seth LaForge from comment #2)
> Good point on frame pointers vs a frame chain for unwinding. I'm looking for
> the unwindable frame chain.
> 
> Wilco:
> > Why does this matter? Well as your examples show, if you want to emit a 
> > frame
> > chain using standard push/pop, it typically ends up pointing to the top of 
> > the
> > frame. That is the worst possible position for a frame pointer on Thumb - 
> > while
> > Arm supports negative immediate offsets up to 4KB, Thumb-1 doesn't support
> > negative offsets at all, and Thumb-2 supports offsets up to -255 but only 
> > with
> > 32-bit instructions. So the result of conflating the frame chain and frame
> > pointer implies a terrible codesize hit for Thumb.
> 
> Well, there's really no need for a frame pointer for efficiency, since the
> stack frame can be efficiently accessed with positive immediate accesses
> relative to the stack pointer. There are even special encodings for Thumb-2
> 16-bit LDR/STR which allow an immediate offset of 0 to 1020 when relative to
> SP - much larger than other registers. You're saying using a frame pointer
> implies a terrible codesize hit for Thumb, but I don't see how that can be -
> stack access will continue to go through SP, and the only code size hit
> should be pushing/popping R7 (~2 cycles), computing R7 as a frame pointer
> (~1 cycle), and potential register spills due to one less register
> available. That's a pretty small amount of overhead for a non-leaf function.

On GCC10 the codesize overhead of -fno-omit-frame-pointer is 4.1% for Arm and
4.8% for Thumb-2 (measured on SPEC2006). That's already a large overhead,
especially since this feature doesn't do anything useful besides adding
overhead...

The key is that GCC uses the frame pointer for every stack access, and thus the
placement of the frame pointer within a frame matters. Thumb compilers place
the frame pointer at the bottom of the frame so they can efficiently access
locals using positive offsets. Despite that the overhead is significant
already.

If GCC would emit a frame chain like the LLVM sequence this means placing the
frame pointer at the top of the stack. This forces negative frame offsets for
all frame accesses. Getting a 10% overhead is being lucky, I've seen worse...

So this is something that needs to be properly designed and carefully
implemented.

> Baseline: With gcc 4.7, -fomit-frame-pointer, -mthumb: 384016 bytes, 110.943
> s.

Thanks for posting actual numbers, but GCC 4.7?!? It might be time to try
GCC9...

[Bug c/92172] ARM Thumb2 frame pointers inconsistent with clang

2019-10-23 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92172

--- Comment #6 from Wilco  ---
(In reply to Seth LaForge from comment #5)

> GCC 8:
> push{r7, lr}
> sub sp, sp, #8
> add r7, sp, #0
> str r0, [r7, #4]
> ...
> 
> Clang 9:
> push{r7, lr}
> mov r7, sp
> sub sp, #8
> str r0, [sp, #4]
> ...

Crazy yes, but it's due to historical reasons. Originally GCC could only emit
code using a frame pointer. Later the frame pointer could be switched off
(hence -fomit-frame-pointer), but you still needed it for debug tables. Then
there was Dwarf which didn't need a frame pointer anymore. And today the frame
pointer is off by default globally in GCC.

> - GCC ARM and Clang ARM use R11 for frame pointer, pointing to the stacked 
> R11. Useful.

Well Clang does this:

   push{r4, r10, r11, lr}
   add r11, sp, #8

but GCC does something different:

push{r4, r5, fp, lr}
add fp, sp, #12

Ie. FP points to saved LR with GCC but saved FP with Clang, so it's not
possible for a generic unwinder to follow the chain, even ignoring Arm/Thumb
interworking (which is a real issue when an application is Thumb-2 but various
library functions use Arm assembly).

[Bug target/91766] -fvisibility=hidden during -fpic still uses GOT indirection on arm64

2019-10-24 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91766

--- Comment #12 from Wilco  ---
(In reply to Andrew Pinski from comment #10)

> This should be a global change and not just an aarch64 change.  The reason
> is because then aarch64 is the odd man out when it comes to this.

Agreed, see https://gcc.gnu.org/ml/gcc-patches/2017-11/msg01549.html. It would
be great to sort that out so C and C++ finally address globals identically.

[Bug c/85678] -fno-common should be default

2019-10-25 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85678

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #6 from Wilco  ---
(In reply to Jonathan Wakely from comment #5)
> The other bug links to a patch to change the default:
> 
> https://gcc.gnu.org/ml/gcc-patches/2017-11/msg01549.html

Updated patch: https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01847.html

[Bug target/91766] -fvisibility=hidden during -fpic still uses GOT indirection on arm64

2019-10-25 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91766

--- Comment #13 from Wilco  ---
(In reply to Wilco from comment #12)
> (In reply to Andrew Pinski from comment #10)
> 
> > This should be a global change and not just an aarch64 change.  The reason
> > is because then aarch64 is the odd man out when it comes to this.
> 
> Agreed, see https://gcc.gnu.org/ml/gcc-patches/2017-11/msg01549.html. It
> would be great to sort that out so C and C++ finally address globals
> identically.

Patch: https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01847.html

[Bug target/91927] -mstrict-align doesn't prevent unaligned accesses at -O2 and -O3 on AARCH64 targets

2019-10-30 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91927

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #8 from Wilco  ---
Note gcc.target/aarch64/pr71727.c fails when compiled with -mstrict-align
-fno-common -O3:

adrpx2, .LC0
adrpx3, .LC1
adrpx1, xarray
add x0, x1, :lo12:xarray
ldr q1, [x2, #:lo12:.LC0]
mov x2, 5
ldr q0, [x3, #:lo12:.LC1]
str x2, [x0, 32]
str q1, [x1, #:lo12:xarray]
str q0, [x0, 16]
ret

.bss
.align  4
.type   xarray, %object
.size   xarray, 5120
xarray:
.zero   5120

[Bug rtl-optimization/92294] New: alias attribute generates incorrect code

2019-10-30 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92294

Bug ID: 92294
   Summary: alias attribute generates incorrect code
   Product: gcc
   Version: 4.8.4
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

The following example (from gcc.c-torture/execute/alias-2.c) always calls abort
on any AArch64 compiler with -O1 or -O2:

static int a[10];
extern int b[10] __attribute__ ((alias("a")));
int off = 0;
void f(void)
{
  b[off]=1;
  a[off]=2;
  if (b[off]!=2)
   __builtin_abort ();
}

Using extern linkage for 'a' avoids the problem, as is doing off = 1 or static
int off = 0. It may only affect targets which use section anchors since
-fno-section-anchors avoids the issue.

[Bug rtl-optimization/92294] alias attribute generates incorrect code

2019-10-30 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92294

Wilco  changed:

   What|Removed |Added

 Target||aarch64
   Target Milestone|--- |10.0

[Bug rtl-optimization/92294] alias attribute generates incorrect code

2019-10-31 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92294

Wilco  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-10-31
 Ever confirmed|0   |1

--- Comment #2 from Wilco  ---
Confirmed then

[Bug c++/92425] Incorrect logical AND on 64bit variable using 32bit register

2019-11-08 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92425

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #1 from Wilco  ---
Where do you see a problem? and eax, 4095 clears the top 32 bits of rax.

[Bug target/92462] [arm32] -ftree-pre makes a variable to be wrongly hoisted out

2019-11-11 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92462

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #1 from Wilco  ---
(In reply to Aleksei Voitylov from comment #0)
> Created attachment 47212 [details]
> reduced testcase from the openjdk sources
> 
> While compiling openjdk sources I bumped into a bug: when -ftree-pre is
> enabled the optimizer hoists out reload of a variable which subsequently
> leads to the infinite loop situation.
> 
> Below is the relevant piece of code and "new_value" is the variable that
> gets hoisted out.
> 
> template
> inline T Atomic::CmpxchgByteUsingInt::operator()(T exchange_value,
> T volatile* dest,
> T compare_value,
> atomic_memory_order order) const {
> printf ("Atomic::CmpxchgByteUsingInt::operator: 1: %d, 2: %d\n",
> exchange_value, compare_value);
> uint8_t canon_exchange_value = exchange_value;
> uint8_t canon_compare_value = compare_value;
> volatile uint32_t* aligned_dest
> = reinterpret_cast(align_down(dest,
> sizeof(uint32_t)));
> size_t offset = (uintptr_t)dest  - (uintptr_t)aligned_dest;
> uint32_t cur = *aligned_dest;
> uint8_t* cur_as_bytes = reinterpret_cast(&cur);
> cur_as_bytes[offset] = canon_compare_value;
> do {
> uint32_t new_value = cur;
> reinterpret_cast(&new_value)[offset] =
> canon_exchange_value;
> printf ("Atomic::CmpxchgByteUsingInt::operator2: 1: %d, 2: %d\n",
> new_value, cur);
> uint32_t res = cmpxchg(new_value, aligned_dest, cur, order);
> if (res == cur) break;
> cur = res;
> } while (cur_as_bytes[offset] == canon_compare_value);
> return PrimitiveConversions::cast(cur_as_bytes[offset]);
> }
> 
> $ g++ -O1 -ftree-pre t.cpp
> $ ./a.out
> Atomic::CmpxchgByteUsingInt::operator: 1: 0, 2: 1
> Atomic::CmpxchgByteUsingInt::operator2: 1: 0, 2: 0
> 
> $ g++ -O1 t.cpp
> $ ./a.out
> Atomic::CmpxchgByteUsingInt::operator: 1: 0, 2: 1
> Atomic::CmpxchgByteUsingInt::operator2: 1: 0, 2: 256
> 
> Below is the assembler of the loop for the correct version:
> 
> .L7:
> ldr r4, [sp]
> str r4, [sp, #4]
> strbr7, [r5, #-4]
> mov r2, r4
> ldr r1, [sp, #4]
> mov r0, r6
> bl  printf
> cbz r4, .L6
> movsr3, #0
> str r3, [sp]
> ldrbr3, [r8]@ zero_extendqisi2
> cmp r3, #1
> beq .L7
> 
> and for the incorrect one:
> 
> .L7:
> str r4, [sp, #4]
> strbr8, [r6]
> mov r2, r4
> ldr r1, [sp, #4]
> mov r0, r5
> bl  printf
> cbz r4, .L6
> movsr4, #0
> str r4, [sp]
> ldrbr3, [r7]@ zero_extendqisi2
> cmp r3, #1
> beq .L7

There are serious aliasing bugs in the source - GCC is quite correct in
assuming that cur and cur_as_bytes[offset] never alias (obviously) and even
optimize away the cmpxchg (no idea why, that appears wrong).

Even if you fix the aliasing bugs, it won't emulate a byte-oriented cmpxchg
correctly, there are bugs in the logic too.

So I suggest to go back to the drawing board - you can't hack your own atomic
operations and just hope for the best. GCC supports a standard set of atomic
operations for a good reason!

[Bug target/92462] [arm32] -ftree-pre makes a variable to be wrongly hoisted out

2019-11-12 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92462

Wilco  changed:

   What|Removed |Added

 Resolution|INVALID |FIXED

--- Comment #7 from Wilco  ---
(In reply to Andrew Pinski from comment #2)
> (In reply to Wilco from comment #1)
> > Even if you fix the aliasing bugs, it won't emulate a byte-oriented cmpxchg
> > correctly, there are bugs in the logic too.
> 
> More than that, it will never be atomic.  You can't do byte wise cmpxchg for
> an word size and think it will be atomic.

And since both Arm and AArch64 have byte-wise __atomic_compare_exchange_n, I
can't see a reason to even try when the compiler already does it correctly and
efficiently.

So my advice is to remove this function altogether and all related code from
openjdk - it's completely incorrect and counterproductive.(In reply to Aleksei
Voitylov from comment #4)
> (In reply to Wilco from comment #1)
> > (In reply to Aleksei Voitylov from comment #0)
> > > Created attachment 47212 [details]
> > > reduced testcase from the openjdk sources
> > > 
> > > While compiling openjdk sources I bumped into a bug: when -ftree-pre is
> > > enabled the optimizer hoists out reload of a variable which subsequently
> > > leads to the infinite loop situation.
> > > 
> > > Below is the relevant piece of code and "new_value" is the variable that
> > > gets hoisted out.
> > > 
> > > template
> > > inline T Atomic::CmpxchgByteUsingInt::operator()(T exchange_value,
> > > T volatile* dest,
> > > T compare_value,
> > > atomic_memory_order order) const {
> > > printf ("Atomic::CmpxchgByteUsingInt::operator: 1: %d, 2: %d\n",
> > > exchange_value, compare_value);
> > > uint8_t canon_exchange_value = exchange_value;
> > > uint8_t canon_compare_value = compare_value;
> > > volatile uint32_t* aligned_dest
> > > = reinterpret_cast(align_down(dest,
> > > sizeof(uint32_t)));
> > > size_t offset = (uintptr_t)dest  - (uintptr_t)aligned_dest;
> > > uint32_t cur = *aligned_dest;
> > > uint8_t* cur_as_bytes = reinterpret_cast(&cur);
> > > cur_as_bytes[offset] = canon_compare_value;
> > > do {
> > > uint32_t new_value = cur;
> > > reinterpret_cast(&new_value)[offset] =
> > > canon_exchange_value;
> > > printf ("Atomic::CmpxchgByteUsingInt::operator2: 1: %d, 2: %d\n",
> > > new_value, cur);
> > > uint32_t res = cmpxchg(new_value, aligned_dest, cur, order);
> > > if (res == cur) break;
> > > cur = res;
> > > } while (cur_as_bytes[offset] == canon_compare_value);
> > > return PrimitiveConversions::cast(cur_as_bytes[offset]);
> > > }
> > > 
> > > $ g++ -O1 -ftree-pre t.cpp
> > > $ ./a.out
> > > Atomic::CmpxchgByteUsingInt::operator: 1: 0, 2: 1
> > > Atomic::CmpxchgByteUsingInt::operator2: 1: 0, 2: 0
> > > 
> > > $ g++ -O1 t.cpp
> > > $ ./a.out
> > > Atomic::CmpxchgByteUsingInt::operator: 1: 0, 2: 1
> > > Atomic::CmpxchgByteUsingInt::operator2: 1: 0, 2: 256
> > > 
> > > Below is the assembler of the loop for the correct version:
> > > 
> > > .L7:
> > > ldr r4, [sp]
> > > str r4, [sp, #4]
> > > strbr7, [r5, #-4]
> > > mov r2, r4
> > > ldr r1, [sp, #4]
> > > mov r0, r6
> > > bl  printf
> > > cbz r4, .L6
> > > movsr3, #0
> > > str r3, [sp]
> > > ldrbr3, [r8]@ zero_extendqisi2
> > > cmp r3, #1
> > > beq .L7
> > > 
> > > and for the incorrect one:
> > > 
> > > .L7:
> > > str r4, [sp, #4]
> > > strbr8, [r6]
> > > mov r2, r4
> > > ldr r1, [sp, #4]
> > > mov r0, r5
> > > bl  printf
> > > cbz r4, .L6
> > > movsr4, #0
> > > str r4, [sp]
> > > ldrbr3, [r7]@ zero_extendqisi2
> > > cmp r3, #1
> > > beq .L7
> > 
> > There are serious aliasing bugs in the source - GCC is quite correct in
> > assuming that cur and cur_as_bytes[offset] never alias (obviously) and even
> > optimize away the cmpxchg (no idea why, that appears wrong).
> Isn't
> 
>uint32_t cur = *aligned_dest;
>uint8_t* cur_as_bytes = reinterpret_cast(&cur);
> 
> the very definition of the pointer aliasing? Regardless, if the function
> being called is doing atomic operations or not, the variable in question
> should not be hoisted? This is the essence of this bug.

The example as is is completely incorrect - it doesn't even return and ends up
executing random code.

> Yes, openjdk code is doing nasty things furtheron (and the code predates
> builtin gcc operations and is compiled by other compilers as well which may
> not be aware of builtins), but the bug as it stands does not depend on that
> logic.

Arm and AArch64 compilers support byte-wise cmpxchg, so my advi

[Bug target/92462] [arm32] -ftree-pre makes a variable to be wrongly hoisted out

2019-11-14 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92462

--- Comment #16 from Wilco  ---
(In reply to Richard Biener from comment #15)
> I can't find PRE doing anything wrong and on 32bit x86_64 the testcase
> executes
> correctly with GCC 7.3 and GCC 9 (when I add the missing return to
> Bar::cmpxchg).
> So -ftree-pre, if it triggers a bug, triggers it elsewhere, the bug isn't in
> PRE itself AFAICS.

Well the difference is that one of the loads is removed by dse1:


**scanning insn=26
cselib lookup (reg/f:SI 102 sfp) => 8:8
cselib lookup (reg/f:SI 126) => 9:4339
cselib lookup (reg/f:SI 102 sfp) => 8:8
cselib lookup (reg/f:SI 102 sfp) => 8:8
cselib lookup (plus:SI (reg/f:SI 102 sfp)
(const_int -8 [0xfff8])) => 9:4339
  mem: (plus:SI (reg/f:SI 102 sfp)
(const_int -8 [0xfff8]))

   after canon_rtx address: (plus:SI (reg/f:SI 102 sfp)
(const_int -8 [0xfff8]))
  gid=1 offset=-8
 processing const load gid=1[-8..-4)
trying to replace SImode load in insn 26 from SImode store in insn 16
deferring rescan insn with uid = 26.
deferring rescan insn with uid = 77.
 -- replaced the loaded MEM with (reg 144)
mems_found = 0, cannot_delete = true
cselib lookup (mem/c:SI (plus:SI (reg/f:SI 102 sfp)
(const_int -8 [0xfff8])) [1 curD.6314+0 S4 A64]) => 0:0

[Bug target/92462] [arm32] -ftree-pre makes a variable to be wrongly hoisted out

2019-11-15 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92462

--- Comment #19 from Wilco  ---
(In reply to Richard Biener from comment #18)
> So I see before DSE1:
> 
> (insn 16 15 17 2 (set (mem/c:SI (plus:SI (reg/f:SI 102 sfp)
> (const_int -8 [0xfff8])) [1 cur+0 S4 A64])
> (reg:SI 119 [ _13 ])) "t.ii":46:14 241 {*arm_movsi_insn}
>  (expr_list:REG_DEAD (reg:SI 119 [ _13 ])
> (nil)))
> (insn 17 16 18 2 (set (reg/f:SI 125)
> (plus:SI (reg/f:SI 102 sfp)
> (const_int -8 [0xfff8]))) "t.ii":48:26 7
> {*arm_addsi3}
>  (nil))
> (insn 18 17 19 2 (set (reg/f:SI 120 [ _14 ])
> (plus:SI (reg/f:SI 125)
> (reg/v:SI 118 [ offset ]))) "t.ii":48:26 7 {*arm_addsi3}
>  (nil))
> (insn 19 18 21 2 (set (reg:SI 126)
> (const_int 1 [0x1])) "t.ii":48:26 241 {*arm_movsi_insn}
>  (nil))
> (insn 21 19 22 2 (set (mem:QI (plus:SI (reg/f:SI 125)
> (reg/v:SI 118 [ offset ])) [0 *_14+0 S1 A8])
> (subreg:QI (reg:SI 126) 0)) "t.ii":48:26 251 {*arm_movqi_insn}
>  (expr_list:REG_DEAD (reg:SI 126)
> (nil)))
> (insn 22 21 23 2 (set (reg:SI 113 [ prephitmp_6 ])
> (mem/c:SI (plus:SI (reg/f:SI 102 sfp)
> (const_int -8 [0xfff8])) [1 cur+0 S4 A64]))
> "t.ii":50:18 241 {*arm_movsi_insn}
>  (nil))
> 
> where DSE1 then replaces the load in insn 22 with (reg:SI 119) (via a new
> pseudo).  But clearly insn 21 clobbers it (using alias-set zero).  This
> is on GIMPLE
> 
>   cur = _13;
>   _14 = &cur + offset_12;
>   *_14 = 1;
>   pretmp_22 = cur;
> 
> not sure why DSE1 performs CSE (heh), but clearly what it does is wrong.
> And yeah, probably tree PRE "enables" this miscompilation.  DSE has
> some special case with frame pointer bases and constant offsets.
> 
> But the target dependency might come in via section anchors since I see
> 
> **scanning insn=21
>   mem: (plus:SI (reg/f:SI 125)
> (reg/v:SI 118 [ offset ]))
> 
>after canon_rtx address: (plus:SI (plus:SI (reg/f:SI 102 sfp)
> (reg/v:SI 118 [ offset ]))
> (const_int -8 [0xfff8]))
> 
>after cselib_expand address: (plus:SI (plus:SI (and:SI (symbol_ref:SI
> ("*.LANCHOR0") [flags 0x182])
> (const_int 3 [0x3]))
> (reg/f:SI 102 sfp))
> (const_int -8 [0xfff8]))
> 
>after canon_rtx address: (plus:SI (plus:SI (and:SI (symbol_ref:SI
> ("*.LANCHOR0") [flags 0x182])
> (const_int 3 [0x3]))
> (reg/f:SI 102 sfp))
> (const_int -8 [0xfff8]))
>   varying cselib base=11:1018534969 offset = -8
>  processing cselib store [-8..-7)
> mems_found = 1, cannot_delete = false
> 
> where it's odd that we end up with an address like this?  I suspect that
> base_alias_check (my special friend) disambiguates the stack-based
> access with one that now appears as a *.LANCHOR0 based one.
> 
> We call canon_true_dependence with
> 
> (plus:SI (value:SI 11:1018534969 @0x3089c60/0x30f5ed0)
> (const_int -8 [0xfff8]))
> 
> for this and get_addr turns it into
> 
> (plus:SI (plus:SI (and:SI (symbol_ref:SI ("*.LANCHOR0") [flags 0x182])
> (const_int 3 [0x3]))
> (value:SI 8:8 @0x3089c18/0x30f5e40))
> (const_int -8 [0xfff8]))
> 
> and indeed find_base_term returns
> 
> (symbol_ref:SI ("*.LANCHOR0") [flags 0x182])
> 
> for this.  Which "obviously" doesn't alias with the stack based address
> because that one is "unique_base_value_p" (address:SI -3) so we win here:
> 
> 2229  if (unique_base_value_p (x_base) || unique_base_value_p (y_base))
> 2230return 0;
> 
> now I wonder where that LANCHOR thing comes from.  arm folks?

Well when you do:

int *p = &local + ((intptr_t)&global & 3);

then you get the above expression when the global is addressed via an anchor.

But whether or not a target uses anchors doesn't matter - the example still
fails with -O1 -fno-section-anchors.

So I'm wondering whether the issue is in the special code that tries to
interpret an AND in an address, and mistakes the base as the symbol_ref
eventhough the real base is sfp.

[Bug target/92462] [arm32] -ftree-pre makes a variable to be wrongly hoisted out

2019-11-18 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92462

--- Comment #23 from Wilco  ---
(In reply to Richard Biener from comment #22)
> Fixed on trunk.  Can arm people verify?  I checked the DSE dump only.  Bonus
> if you manage to create a testcase for the testsuite failing before, passing
> now.
> 
> The patch is simple enough to backport if it works.

I will have a look. But only checking the low bit is still way too dangerous.
Aliasing checks should be conservative and 100% accurate, not use random
heuristics which hope for the best. So if we want to keep the AND code then it
needs to look for a mask with all top bits set as the absolute minimum.

[Bug target/79262] [8/9/10 Regression] load gap with store gap causing performance regression in 462.libquantum

2019-11-19 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79262

--- Comment #8 from Wilco  ---
Author: wilco
Date: Tue Nov 19 15:57:54 2019
New Revision: 278452

URL: https://gcc.gnu.org/viewcvs?rev=278452&root=gcc&view=rev
Log:
[AArch64] PR79262: Adjust vector cost

PR79262 has been fixed for almost all AArch64 cpus, however the example is
still
vectorized in a few cases, resulting in lower performance.  Adjust the vector
cost slightly so that so that -mcpu=cortex-a53 now has identical performance as
-mcpu=cortex-a57 on libquantum.

gcc/
PR target/79262
* config/aarch64/aarch64.c (generic_vector_cost): Adjust
vec_to_scalar_cost.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/aarch64/aarch64.c

[Bug target/79262] [8/9/10 Regression] load gap with store gap causing performance regression in 462.libquantum

2019-11-19 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79262

Wilco  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from Wilco  ---
libquantum perf is now the same between Cortex-A53 and A57.

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2019-11-19 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 79262, which changed state.

Bug 79262 Summary: [8/9/10 Regression] load gap with store gap causing 
performance regression in 462.libquantum
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79262

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug c/92612] [10 Regression] Linker error in 525.x264_r after r278509

2019-11-21 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92612

--- Comment #2 from Wilco  ---
(In reply to Martin Liška from comment #1)
> Following patch fixes that:
> 
> diff --git a/benchspec/CPU/525.x264_r/src/ldecod_src/inc/configfile.h
> b/benchspec/CPU/525.x264_r/src/ldecod_src/inc/configfile.h
> index 12ed1cc8..450930e2 100644
> --- a/benchspec/CPU/525.x264_r/src/ldecod_src/inc/configfile.h
> +++ b/benchspec/CPU/525.x264_r/src/ldecod_src/inc/configfile.h
> @@ -18,9 +18,9 @@
>  //#define LEVEL_IDC   21
>  
>  
> -InputParameters cfgparams;
>  
>  #ifdef INCLUDED_BY_CONFIGFILE_C
> +InputParameters cfgparams;
>  // Mapping_Map Syntax:
>  // {NAMEinConfigFile,  &cfgparams.VariableName, Type, InitialValue,
> LimitType, MinLimit, MaxLimit, CharSize}
>  // Types : {0:int, 1:text, 2: double}
> @@ -59,6 +59,7 @@ extern Mapping Map[];
>  #endif
>  extern void JMDecHelpExit ();
>  extern void ParseCommand(InputParameters *p_Inp, int ac, char *av[]);
> +extern InputParameters cfgparams;
>  
>  #endif

(In reply to Martin Liška from comment #0)
> Since defaulting to -fno-common, one can't build the benchmark due to:

Yes some SPEC benchmarks rely on tentative definitions. Given it is a
portability issue, but the source can't be fixed, the easiest way is to add
-fcommon in the config file for the failing cases. This is similar to the other
workarounds like -fno-strict-aliasing.

[Bug target/91927] -mstrict-align doesn't prevent unaligned accesses at -O2 and -O3 on AARCH64 targets

2019-11-21 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91927

--- Comment #10 from Wilco  ---
(In reply to Andrew Pinski from comment #9)
> I think the following patch is the correct fix:
> diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> index ad4676bc167..787323255cb 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -41,7 +41,7 @@
>  (define_expand "movmisalign"
>[(set (match_operand:VALL 0 "nonimmediate_operand")
>  (match_operand:VALL 1 "general_operand"))]
> -  "TARGET_SIMD"
> +  "TARGET_SIMD && !STRICT_ALIGNMENT"
>  {
>/* This pattern is not permitted to fail during expansion: if both
> arguments
>   are non-registers (e.g. memory := constant, which can be created by the
> 
> 
> Basically movmisalign should not be there if strict alignment as it is just
> emitting a set.

That fixes one of the examples but not pr71727. As the Arm movsi issues show,
there are many mid-end alignment bugs.

[Bug rtl-optimization/92637] runtime issue with -ftree-coalesce-vars

2019-11-23 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92637

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #2 from Wilco  ---
(In reply to John Dong from comment #0)
> Created attachment 47338 [details]
> testsuite
> 
> hi, I compiled the attached with aarch64-linux-gnu-gcc -c -O2 
> -march=armv8.1-a testsuite.c -o testsuite.o, it had runtime error, and I
> found x10 was overwriten ar row 214.

This example doesn't show a problem. Do you have an example that can be run and
gives the runtime error?

[Bug c/85678] -fno-common should be default

2019-11-25 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85678

--- Comment #8 from Wilco  ---
(In reply to David Binderman from comment #7)
> (In reply to David Brown from comment #0)
> > Surely it is time to make "-fno-common" the default, at least when a modern
> > C standard is specified indicating that the code is modern?  People who need
> > the old behaviour can always get it with "-fcommon".
> 
> Interestingly, use of -std=c89 or -std=gnu89 doesn't also switch on -fcommon
> to get old behaviour.
> 
> So use of compiler flag indicating to compile code to old standards
> means implicit use of *new* standard for common.
> 
> Looks odd to me. Possible bug ?

If required, it would be feasible to keep the old behaviour for C89 indeed,
however -fno-common is not incompatible with C89 (embedded C compilers may not
even support -fcommon).

[Bug c/85678] -fno-common should be default

2019-11-25 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85678

--- Comment #12 from Wilco  ---
(In reply to David Brown from comment #11)

> Changing the default to "-fno-common" (and ideally
> "-Werror=strict-prototypes -Werror=old-style-declaration
> -Werror=missing-parameter-type") would have a lot smaller impact than
> changing the default standard.

Giving errors on old-style code by default sounds like a good idea. We could
add -std=legacy similar to Fortran to support building old K&R code (and that
would enable -fcommon by default).

[Bug target/92665] [AArch64] low lanes select not optimized out for vmlal intrinsics

2019-11-25 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92665

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #3 from Wilco  ---
(In reply to Andrew Pinski from comment #2)
> Created attachment 47356 [details]
> Patch which I wrote for GCC 7.3
> 
> I have to double check if it applies directly as I had other patches in this
> area but this is the patch which I had wrote for GCC 7.3.

I think it's because many intrinsics in arm_neon.h still use asm which inhibits
most optimizations.

[Bug target/92692] [9/10 Regression] Saving off the callee saved register between ldxr/stxr (caused by shrink wrapping improvements)

2019-11-27 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92692

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #3 from Wilco  ---
(In reply to Andrew Pinski from comment #2)
> I think this has been a latent bug since revision 243200:
> [AArch64] Separate shrink wrapping hooks implementation
> 
> I think aarch64_disqualify_components would be a location which should
> disqualify the Separate for the register 19.

What is the "exclusives reservation granule" size? It could only fail if the
granule is large and the spill happens to be in the same granule as the stxr.

I guess it's easy to fix by delaying the expansion or inserting a clobber of
x19 before the loop starts.

[Bug target/92692] [9/10 Regression] Saving off the callee saved register between ldxr/stxr (caused by shrink wrapping improvements)

2019-11-27 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92692

--- Comment #5 from Wilco  ---
(In reply to Andrew Pinski from comment #4)
> (In reply to Wilco from comment #3)
> > (In reply to Andrew Pinski from comment #2)
> > > I think this has been a latent bug since revision 243200:
> > > [AArch64] Separate shrink wrapping hooks implementation
> > > 
> > > I think aarch64_disqualify_components would be a location which should
> > > disqualify the Separate for the register 19.
> > 
> > What is the "exclusives reservation granule" size? It could only fail if the
> > granule is large and the spill happens to be in the same granule as the 
> > stxr.
> NO "exclusives reservation granule" does not matter here, please read the
> ARMv8 spec again copied below (B2-142):
> LoadExcl/StoreExcl loops are guaranteed to make forward progress only if,
> for any LoadExcl/StoreExcl loop
> within a single thread of execution, the software meets all of the following
> conditions:
> 1 Between the Load-Exclusive and the Store-Exclusive, there are no
> explicit memory accesses,
> preloads, direct or indirect System register writes, address translation
> instructions, cache or TLB
> maintenance instructions, exception generating instructions, exception
> returns, or indirect branches.
> --- CUT 
> 
> no explicit memory accesses
> Is a requirement so it does not matter what "exclusives reservation granule"
> size is really.
> We had gone through this beforehand with the ARM architectures and made sure
> that the specifications was worded correctly to the above effect.  The
> wording change happened in 2016.

Well I'm looking at the latest version
(https://static.docs.arm.com/ddi0487/ea/DDI0487E_a_armv8_arm.pdf) where in
figure B2-5 it explicitly states that a store that does not match the
reservation granule on the same CPU must not change the exclusive state.

However if a store does match the granule it is implementation defined, hence
the reason for the text you quote to guarantee forward progress - otherwise a
random store in the loop could accidentally match the exclusive granule and
block progress. However I don't see it saying anywhere that all stores must
clear the exclusive state.

[Bug driver/89014] Use-after-free in aarch64 -march=native

2019-11-29 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89014

--- Comment #9 from Wilco  ---
Author: wilco
Date: Fri Nov 29 17:22:30 2019
New Revision: 278854

URL: https://gcc.gnu.org/viewcvs?rev=278854&root=gcc&view=rev
Log:
aarch64: fix use-after-free in -march=native (PR driver/89014)

Running:
  $ valgrind ./xgcc -B. -c test.c -march=native
on aarch64 shows a use-after-free in host_detect_local_cpu due
to the std::string result of aarch64_get_extension_string_for_isa_flags
only living until immediately after a c_str call.

This leads to corrupt "-march=" values being passed to cc1.

This patch fixes the use-after-free, though it appears to also need
Tamar's patch here:
  https://gcc.gnu.org/ml/gcc-patches/2018-12/msg01302.html
in order to generate valid values for cc1.  This may have worked by
accident in the past, if the corrupt "-march=" value happened to be
0-terminated in the "right" place; with this patch it now appears
to reliably break without Tamar's patch.

Backport from mainline
2019-01-23  David Malcolm  

PR driver/89014
* config/aarch64/driver-aarch64.c (host_detect_local_cpu): Fix
use-after-free of the result of
aarch64_get_extension_string_for_isa_flags.

Modified:
branches/gcc-8-branch/gcc/ChangeLog
branches/gcc-8-branch/gcc/config/aarch64/driver-aarch64.c

[Bug tree-optimization/92738] [10 regression] Large code size growth for -O2 binaries between 2019-05-19...2019-05-29

2019-12-02 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92738

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #7 from Wilco  ---
(In reply to Martin Liška from comment #6)
> So wrf grew starting with r271377, size (w/o debug info) goes from 20164464B
> to 23674792.

Also check the build time of wrf. Looking at my logs trunk takes 2x as long to
build it since June.

[Bug tree-optimization/92738] [10 regression] Large code size growth for -O2 binaries between 2019-05-19...2019-05-29

2019-12-02 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92738

--- Comment #9 from Wilco  ---
(In reply to Martin Liška from comment #8)
> (In reply to Wilco from comment #7)
> > (In reply to Martin Liška from comment #6)
> > > So wrf grew starting with r271377, size (w/o debug info) goes from 
> > > 20164464B
> > > to 23674792.
> > 
> > Also check the build time of wrf. Looking at my logs trunk takes 2x as long
> > to build it since June.
> 
> Maybe related to:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91509
> ?

I think not, this is plain -Ofast, no LTO or prefetching. The same slowdown
happens with -O2.

[Bug tree-optimization/92738] [10 regression] Large code size growth for -O2 binaries between 2019-05-19...2019-05-29

2019-12-02 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92738

--- Comment #11 from Wilco  ---
(In reply to Thomas Koenig from comment #10)
> (In reply to Martin Liška from comment #6)
> > So wrf grew starting with r271377, size (w/o debug info) goes from 20164464B
> > to 23674792.
> 
> I think we've had this discussion before, although I cannot offhand
> recall the PR number.  PR91512 is closely related.
> 
> Since r271377, arguments which may be contiguous are now (conditionally)
> packed and unpacked inline (see PR88821).
> 
> This was done so that the middle end can look into these conversions
> and possibly eliminate them, if it can be determined via inlining
> or LTO that the argument is contiguous anyway). This can lead to an
> extremely large performance boost for some test cases (*10 or more),
> but will, in general, lead to a size increase.
> 
> Now, wrf has an extremely strange (and rare) programming style where they
> pass
> a ton of assumed shape arguments (where it is not clear, at compile-time,
> if they need packing/unpacking) to an old-style array argument. This
> causes considerable code size increase.
> 
> So, it's a tradeoff, which we discussed at the time. This is why this
> is not done at -Os.
> 
> Should we "fix" this? I think not, the style of wrf is just too horrid,
> and pessimizing other programs for the sake of one benchmark makes little
> sense to me.

Would using -frepack-arrays solve this issue? I proposed making that the
default a while back. It would do any repacking that is necessary at call sites
rather than creating multiple copies of all loops in every function.

[Bug tree-optimization/92822] [10 Regression] vfma_laneq_f32 and vmul_laneq_f32 are broken on aarch64 after r278938

2019-12-12 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92822

Wilco  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org,
   ||wilco at gcc dot gnu.org
  Component|target  |tree-optimization

--- Comment #4 from Wilco  ---
(In reply to nsz from comment #2)
> e.g.
> 
> #include 
> 
> float32x2_t
> foo (float32x2_t v0, float32x4_t v1)
> {
>   return vmulx_laneq_f32 (v0, v1, 0);
> }
> 
> used to get translated to
> 
> foo:
> fmulx   v0.2s, v0.2s, v1.s[0]
> ret
> 
> now it is
> 
> foo:
>   adrpx0, .LC0
>   ldr q2, [x0, #:lo12:.LC0]
>   tbl v1.16b, {v1.16b}, v2.16b
>   fmulx   v0.2s, v0.2s, v1.2s
>   ret
>   .size   foo, .-foo
>   .section.rodata.cst16,"aM",@progbits,16

Yes the change inserts a VEC_PERM_EXPR with random values for the upper lanes
which becomes a TBL instruction. It happens when you extract a lane from a
128-bit vector and then dup it to a 64-bit vector. Optimized tree before:

foo (float32x2_t v0, float32x4_t v1)
{
  float _4;
  __Float32x2_t _5;
  __Float32x2_t _6;

   [local count: 1073741824]:
  __builtin_aarch64_im_lane_boundsi (16, 4, 0);
  _4 = BIT_FIELD_REF ;
  _5 = {_4, _4};
  _6 = __builtin_aarch64_fmulxv2sf (v0_2(D), _5); [tail call]
  return _6;
}

And after r278938:

foo (float32x2_t v0, float32x4_t v1)
{
  __Float32x2_t _4;
  __Float32x2_t _7;
  __Float32x4_t _8;

   [local count: 1073741824]:
  __builtin_aarch64_im_lane_boundsi (16, 4, 0);
  _8 = VEC_PERM_EXPR ;
  _7 = BIT_FIELD_REF <_8, 64, 0>;
  _4 = __builtin_aarch64_fmulxv2sf (v0_2(D), _7); [tail call]
  return _4;
}

[Bug rtl-optimization/93007] New: [10 regression] pr77698.c testcase fails due to block commoning

2019-12-19 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93007

Bug ID: 93007
   Summary: [10 regression] pr77698.c testcase fails due to block
commoning
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

Since r276960 we see this failure on Arm:

FAIL: gcc.dg/tree-prof/pr77698.c scan-rtl-dump-times alignments "internal loop
alignment added" 1

The issue appears to be that basic block commoning works on an unrolled loop,
which is unlikely to be beneficial for performance:

.L17:
addsr0, r0, #1
b   .L27
.L6:
ldr r4, [r2, #12]
addsr0, r0, #4
ldr lr, [r1]
str lr, [r3, r4, lsl #2]
ldr r4, [r2, #12]
ldr lr, [r1]
str lr, [r3, r4, lsl #2]
ldr r4, [r2, #12]
ldr lr, [r1]
str lr, [r3, r4, lsl #2]
.L27:
ldr r4, [r2, #12]
cmp ip, r0
ldr lr, [r1]
str lr, [r3, r4, lsl #2]
bne .L6
pop {r4, pc}

The test could be easily fixed, but ensuring block commoning takes loops and
execution frequencies into account would be better overall.

[Bug tree-optimization/93023] give preference to address iv without offset in ivopts

2019-12-23 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93023

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #1 from Wilco  ---
(In reply to Feng Xue from comment #0)

> Analysis into ivopts shows that those address IVs have same in-loop cost,
> and IV w/o offset does have smaller pre-loop setup cost. But since the setup
> cost will be averaged to each iteration, the minor cost difference will go
> due to round-off by integer division. To fix this round-off error, cost can
> be represented in a more accurate way, such as adding a fraction part to
> make it a fixpoint number.

It would be easy to adjust the costs. There are a few places where address
costs (multiples of 1) are incorrectly mixed with rtx costs (multiples of 4),
so adjusting address costs should improve things.

However I suspect the problem is also caused by using overly high iteration
estimates and not taking codesize into account. If the loop costs are identical
then you obviously choose the variant with the fewest instructions.

[Bug tree-optimization/90838] Detect table-based ctz implementation

2020-01-10 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

--- Comment #8 from Wilco  ---
Author: wilco
Date: Fri Jan 10 19:32:53 2020
New Revision: 280132

URL: https://gcc.gnu.org/viewcvs?rev=280132&root=gcc&view=rev
Log:
PR90838: Support ctz idioms

Support common idioms for count trailing zeroes using an array lookup.
The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
constant which when multiplied by a power of 2 creates a unique value
in the top 5 or 6 bits.  This is then indexed into a table which maps it
to the number of trailing zeroes.  When the table is valid, we emit a
sequence using the target defined value for ctz (0):

int ctz1 (unsigned x)
{
  static const char table[32] =
{
  0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
  31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};

  return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
}

Is optimized to:

rbitw0, w0
clz w0, w0
and w0, w0, 31
ret

gcc/
PR tree-optimization/90838
* tree-ssa-forwprop.c (check_ctz_array): Add new function.
(check_ctz_string): Likewise.
(optimize_count_trailing_zeroes): Likewise.
(simplify_count_trailing_zeroes): Likewise.
(pass_forwprop::execute): Try ctz simplification.
* match.pd: Add matching for ctz idioms.

testsuite/
PR tree-optimization/90838
* testsuite/gcc.target/aarch64/pr90838.c: New test.

Added:
trunk/gcc/testsuite/gcc.target/aarch64/pr90838.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/match.pd
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-ssa-forwprop.c

[Bug bootstrap/93229] simplify_count_trailing_zeroes doesn't compile on x86_64-pc-linux-gnu

2020-01-10 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93229

--- Comment #2 from Wilco  ---
(In reply to David Malcolm from comment #0)
> A pristine checkout of r280132 doesn't build for me on x86_64-pc-linux-gnu:
> 
> ../../src/gcc/tree-ssa-forwprop.c: In function ‘bool
> simplify_count_trailing_zeroes(gimple_stmt_iterator*)’:
> ../../src/gcc/config/i386/i386.h:2886:30: error: cannot convert
> ‘poly_uint16’ {aka ‘poly_int<1, short unsigned int>’} to ‘long int’ in
> assignment
>  2886 |  ((VALUE) = GET_MODE_BITSIZE (MODE), TARGET_BMI ? 1 : 0)
>   | ~^~
>   |  |
>   |  poly_uint16 {aka poly_int<1, short
> unsigned int>}
> ../../src/gcc/tree-ssa-forwprop.c:1925:22: note: in expansion of macro
> ‘CTZ_DEFINED_VALUE_AT_ZERO’
>  1925 |   bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type),
> ctzval) == 2;
>   |  ^
> 
> 
> I'm assuming this was introduced in r280132, as that commit introduced
> simplify_count_trailing_zeroes.
> 
> Am using gcc-9.2.1-1.fc30.x86_64 to try to build stage 1.

That's odd, it shouldn't be using any poly types on x86... Machmode.h has:

#if ONLY_FIXED_SIZE_MODES
#define GET_MODE_BITSIZE(MODE) ((unsigned short) mode_to_bits (MODE).coeffs[0])
#else

That's doing the correct thing if ONLY_FIXED_SIZE_MODES is defined.

[Bug bootstrap/93229] simplify_count_trailing_zeroes doesn't compile on x86_64-pc-linux-gnu

2020-01-10 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93229

--- Comment #4 from Wilco  ---
(In reply to David Malcolm from comment #3)
> Apparently broken on other archs too, and for other people; from #gcc:
> 
>  nathan: I assume it's not just broken for me; I'm somewhat
> sleep-deprived here
>  dmalcolm: broke PPC
>  correct.  I guess it's somewhat target-specific, otherwise wilco
> would have hit it
>  Rhy0lite, nathan: thanks
>  x86_64 for me, in case not obvious
>  breaks: ~{ARM,AArch64}

Hmm, so maybe it's GET_MODE_BITSIZE that fails somehow? AArch64/Arm have always
used GET_MODE_UNIT_BITSIZE. Looking at the difference, that seems to do
something special for complex types.

[Bug tree-optimization/93231] [10 Regression] ICEs since r280132

2020-01-13 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93231

--- Comment #4 from Wilco  ---
(In reply to Jakub Jelinek from comment #0)
> int ctz2 (int x)
> {
>   static const char table[32] =
> {
>   0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
>   31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
> };
> 
>   return table[((int)((x & -x) * -0x077CB531)) >> 27];
> }
> 
> ICEs because
>unsigned HOST_WIDE_INT val = tree_to_uhwi (mulc);
> is used without checking that the INTEGER_CST fits into uhwi.
> If we don't want to support signed x, we should start the function by
> verification that inner_type is INTEGRAL_TYPE_P which is TYPE_UNSIGNED, if
> we do want to support even signed values, it needs to be tweaked differently.

I guess TREE_INT_CST_LOW should fix that. The goal is to support signed and
unsigned types.

> Similarly,
> int ctz3 (unsigned x)
> {
>   static const char table[32] =
> {
>   0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
>   31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
> };
> 
>   return table[((unsigned)((x & -x) * 0x077CB531U)) >> -27];
> }
> ICEs because -27 doesn't fit into uhwi.  We should just punt if it doesn't.

I'm adding an extra tree_fits_uhwi for that.

> I'm also surprised by
>   /* Check the array is not wider than integer type and the input is a 32-bit
>  or 64-bit type.  */
>   if (TYPE_PRECISION (type) > 32)
> return false;
> because the comment doesn't match what the check is doing, either you want
> an array with 32-bit or smaller elts, then the comment should match that, or
> you care about integers and then you should compare against TYPE_PRECISION
> (integer_type_node).

I'll check but I think this check is no longer required.

> Also, there is no testcase for the string case, nor any non-target specific
> testcase that it at least compiles and perhaps with tree dump scan on 
> selected > targets that it recognizes the ctz.

I can add a generic testcase as well.

> And I don't see a check that if it is a STRING_CST, the array elements must 
> be 
> bytes and not wider, which the function assumes (e.g. if it is u"..."[(x & 
> -x) > * ...) >> 27]).

Right now "abc"[] can't match - the constructor always returns an error for
this case. And this doesn't seem a common idiom so adding support isn't useful.

[Bug tree-optimization/93231] [10 Regression] ICEs since r280132

2020-01-13 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93231

Wilco  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org

--- Comment #6 from Wilco  ---
Patch at https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00731.html

[Bug target/92692] [9/10 Regression] Saving off the callee saved register between ldxr/stxr (caused by shrink wrapping improvements)

2020-01-15 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92692

--- Comment #10 from Wilco  ---
(In reply to Jakub Jelinek from comment #9)
> Any -march= or similar?  Can't reproduce with current trunk, nor
> with even Oct 10 GCC snapshot (crosses in both cases).
> grep -B1 stxr pr92692.s
> doesn't show any stores before stxr.

It's going to be extremely sensitive to register allocation, so it's not clear
it's worth trying to reproduce.

The easiest option is to see whether replacing "reload_completed" with
"epilogue_completed" in aarch64/atomics.md works fine.

[Bug target/92692] [9/10 Regression] Saving off the callee saved register between ldxr/stxr (caused by shrink wrapping improvements)

2020-01-15 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92692

Wilco  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org

--- Comment #13 from Wilco  ---
(In reply to Wilco from comment #10)

> The easiest option is to see whether replacing "reload_completed" with
> "epilogue_completed" in aarch64/atomics.md works fine.

That works indeed, I'll post a patch.

[Bug tree-optimization/93231] [10 Regression] ICEs since r280132

2020-01-16 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93231

Wilco  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #8 from Wilco  ---
Fixed.

[Bug target/71727] -O3 -mstrict-align produces code which assumes unaligned vector accesses work

2020-01-19 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71727

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #8 from Wilco  ---
(In reply to Martin Liška from comment #7)
> Christophe: Can the bug be marked as resolved?

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91927 - it's still failing
since the underlying issues haven't been resolved.

[Bug target/92692] Saving off the callee saved register between ldxr/stxr (caused by shrink wrapping improvements)

2020-01-27 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92692

Wilco  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED
   Target Milestone|9.3 |8.4
Summary|[9 Regression] Saving off   |Saving off the callee saved
   |the callee saved register   |register between ldxr/stxr
   |between ldxr/stxr (caused   |(caused by shrink wrapping
   |by shrink wrapping  |improvements)
   |improvements)   |

--- Comment #20 from Wilco  ---
Backported to GCC8 and GCC9, fixed on all active branches.

[Bug middle-end/64242] Longjmp expansion incorrect

2020-01-27 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64242

--- Comment #37 from Wilco  ---
(In reply to Andrew Pinski from comment #36)
> MIPS is still broken.  I might look into MIPS brokenness next week.

Yes it seems builtin_longjmp has the exact same fp corruption issue:

move$fp,$17
lw  $sp,8($fp)
jr  $16
lw  $28,12($fp)

[Bug rtl-optimization/93565] New: Combine duplicates count trailing zero instructions

2020-02-04 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

Bug ID: 93565
   Summary: Combine duplicates count trailing zero instructions
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

Created attachment 4
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=4&action=edit
ctz_duplication

The attached example causes Combine to duplicate count trailing zero
instructions on targets which have CTZ_DEFINED_VALUE = 2:

f:
cbz x0, .L2
rbitx2, x0
rbitx0, x0
clz x2, x2
clz x0, x0
ldr w2, [x1, x2, lsl 2]
orr w0, w2, w0
str w0, [x1]
.L2:
mov x0, 0
ret

The cause is Combine deciding to merge CTZ into a sign-extend:

(insn 10 9 12 3 (set (reg:DI 100)
(ctz:DI (reg/v:DI 98 [ x ]))) "ctz2.c":17:15 689 {ctzdi2}
 (expr_list:REG_DEAD (reg/v:DI 98 [ x ])
(nil)))
(insn 12 10 14 3 (set (reg:DI 101 [ _9 ])
(sign_extend:DI (subreg:SI (reg:DI 100) 0))) "ctz2.c":25:15 104
{*extendsidi2_aarch64}
 (nil))

allowing combination of insns 10 and 12
original costs 4 + 4 = 8
replacement costs 4 + 4 = 8
modifying insn i210: r100:DI=ctz(r98:DI)
deferring rescan insn with uid = 10.
modifying insn i312: r101:DI=ctz(r98:DI)
  REG_DEAD r98:DI

(insn 10 9 12 3 (set (reg:DI 100)
(ctz:DI (reg/v:DI 98 [ x ]))) "ctz2.c":17:15 689 {ctzdi2}
 (nil))
(insn 12 10 14 3 (set (reg:DI 101 [ _9 ])
(ctz:DI (reg/v:DI 98 [ x ]))) "ctz2.c":25:15 689 {ctzdi2}
 (expr_list:REG_DEAD (reg/v:DI 98 [ x ])
(nil)))

Later passes then seem unable to CSE the two identical CTZ instructions...

This doesn't seem right - if Combine can optimize the sign-extend away then
there is no point in duplicating the CTZ.

[Bug rtl-optimization/93565] Combine duplicates count trailing zero instructions

2020-02-05 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

--- Comment #3 from Wilco  ---
(In reply to Segher Boessenkool from comment #2)
> Of course it first tried to do
> 
> Failed to match this instruction:
> (parallel [
> (set (reg:DI 101 [ _9 ])
> (ctz:DI (reg/v:DI 98 [ x ])))
> (set (reg:DI 100)
> (ctz:DI (reg/v:DI 98 [ x ])))
> ])
> 
> so we could try to do that as just the ctz and then a register move,
> and hope that move can be optimised away.  But this is more expensive
> if it can *not* be optimised (higher latency).  Hrm.

Yes if a sign/zero-extend is proven to be redundant, it should be replaced with
a move - it's unlikely it could not be removed either by Combine or during
register allocation.

It seems to me this could happen with any instruction pair where it decides to
forward substitute, but keep the original instruction. If the costs are
identical, it's better to replace the 2nd instruction with a move. Would it
already do this if say we counted moves as somewhat lower cost than ALU
instructions?

[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions

2020-02-11 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

Wilco  changed:

   What|Removed |Added

 CC||segher at kernel dot 
crashing.org
Summary|Combine duplicates count|[9/10 regression] Combine
   |trailing zero instructions  |duplicates instructions

--- Comment #8 from Wilco  ---
Here is a much simpler example:

void f (int *p, int y)
{
  int a = y & 14;
  *p = a | p[a];
}

Trunk and GCC9.1 for x64:
mov eax, esi
and esi, 14
and eax, 14
or  eax, DWORD PTR [rdi+rsi*4]
mov DWORD PTR [rdi], eax
ret

and AArch64:
and x2, x1, 14
and w1, w1, 14
ldr w2, [x0, x2, lsl 2]
orr w1, w2, w1
str w1, [x0]
ret

However GCC8.2 does:
and w1, w1, 14
ldr w2, [x0, w1, sxtw 2]
orr w2, w2, w1
str w2, [x0]
ret

So it is a 9 regression...

[Bug target/92692] Saving off the callee saved register between ldxr/stxr (caused by shrink wrapping improvements)

2020-02-28 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92692

--- Comment #22 from Wilco  ---
(In reply to Sebastian Pop from comment #21)
> It looks like this hunk from the trunk version of the patch is missing on
> gcc-9 branch:
> 
> diff --git a/gcc/config/aarch64/atomics.md b/gcc/config/aarch64/atomics.md
> index cabcc58f1a0..1458bc00095 100644
> --- a/gcc/config/aarch64/atomics.md
> +++ b/gcc/config/aarch64/atomics.md
> @@ -104,7 +104,7 @@
> (clobber (match_scratch:SI 7 "=&r"))]
>""
>"#"
> -  "&& reload_completed"
> +  "&& epilogue_completed"
>[(const_int 0)]
>{
>  aarch64_split_compare_and_swap (operands);
> 
> 
> 
> With this hunk applied my bootstrap passes on the gcc-9 branch on an
> aarch64-linux graviton2.
> 
> Without this hunk I see an error in thread sanitizers.
> 
> I also have checked gcc-8 release branch and it seems that the patch is not
> missing any hunks in that branch.
> 
> Could somebody apply the missing hunk to the gcc-9 release branch?  Thanks!

I don't see anything like that on the gcc-9 branch - are you sure you don't
have an outstanding change somehow?

[Bug target/91598] [8/9/10 regression] 60% speed drop on neon intrinsic loop

2020-03-03 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91598

--- Comment #4 from Wilco  ---
Fixing vmull_lane_s16 and vmlal_lane_s16 to avoid inline assembler gives this
schedule which runs 63% faster on Cortex-A53:

ldr d2, [x6, x0]
ldr d4, [x6, x3]
ldr d3, [x6, x2]
smull   v2.4s, v2.4h, v0.4h[0]
ldr d1, [x6, x1]
smull   v4.4s, v4.4h, v0.4h[1]
ldr d16, [x7, x3]
smull   v3.4s, v3.4h, v0.4h[3]
ldr d7, [x7, x2]
smull   v1.4s, v1.4h, v0.4h[0]
ldr d6, [x7, x1]
ldr d5, [x7, x0]
smlal   v4.4s, v16.4h, v0.4h[3]
ldr d16, [x4, x3]
smlal   v3.4s, v7.4h, v0.4h[2]
ldr d7, [x4, x2]
smlal   v1.4s, v6.4h, v0.4h[1]
ldr d6, [x4, x1]
smlal   v2.4s, v5.4h, v0.4h[1]
ldr d5, [x4, x0]
smlal   v4.4s, v16.4h, v0.4h[3]
ldr d16, [x5, x3]
smlal   v3.4s, v7.4h, v0.4h[0]
ldr d7, [x5, x2]
smlal   v1.4s, v6.4h, v0.4h[2]
ldr d6, [x5, x1]
smlal   v2.4s, v5.4h, v0.4h[2]
ldr d5, [x5, x0]
smlal   v4.4s, v16.4h, v0.4h[3]
smlal   v3.4s, v7.4h, v0.4h[3]
smlal   v1.4s, v6.4h, v0.4h[2]
smlal   v2.4s, v5.4h, v0.4h[0]

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

2018-12-07 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #8 from Wilco  ---
(In reply to Jiangning Liu from comment #0)
> For the small case below, GCC -O3 can't vectorize the small loop to do byte
> comparison in func2.
> 
> void *malloc(long unsigned int);
> typedef struct {
> unsigned char *buffer;
> } data;
> 
> static unsigned char *func1(data *d)
> {
> return d->buffer;
> }
> 
> static int func2(int max, int pos, unsigned char *cur)
> {
> unsigned char *p = cur + pos;
> int len = 0;
> while (++len != max)
> if (p[len] != cur[len])
> break;
> return cur[len];
> }
> 
> int main (int argc) {
> data d;
> d.buffer = malloc(2*argc);
> return func2(argc, argc, func1(&d));
> }
> 
> At the moment, the following code is generated for this loop,
> 
>   4004d4:   38616862ldrbw2, [x3,x1]
>   4004d8:   6b5fcmp w2, w0
>   4004dc:   54a1b.ne4004f0 
>   4004e0:   38616880ldrbw0, [x4,x1]
>   4004e4:   6b01027fcmp w19, w1
>   4004e8:   91000421add x1, x1, #0x1
>   4004ec:   5441b.ne4004d4 
> 
> In fact, this loop can be vectorized by checking if the comparison size is
> aligned to SIMD register length. It may introduce run time overhead, but
> cost model could make decision on doing it or not.

The only optimization that can be done here is unrolling. For this kind of
string matching the number of bytes that match is will be small on average, so
even if vectorization was feasible, the startup overhead alone would kill
performance. With unrolling you can remove the comparison with max each
iteration and do 4 bytes per iteration like this:

loop4:
ldrbw2, [x3,4]!
ldrbw0, [x4,4]!
cmp w0, w2
bne exitloop1
ldrbw2, [x3,1]
ldrbw0, [x4,1]
cmp w0, w2
bne exitloop2
ldrbw2, [x3,2]
ldrbw0, [x4,2]
cmp w0, w2
bne exitloop3
ldrbw2, [x3,3]
ldrbw0, [x4,3]
cmp w0, w2
bne exitloop4
add x1, x1, 4
cmp x1, w19
blt loop4

[Bug middle-end/64242] Longjmp expansion incorrect

2018-12-07 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64242

--- Comment #21 from Wilco  ---
(In reply to Rainer Orth from comment #20)
> The new testcase also FAILs on sparc-sun-solaris2.11 (both 32 and 64-bit):
> 
> +FAIL: gcc.c-torture/execute/pr64242.c   -O2  execution test
> +FAIL: gcc.c-torture/execute/pr64242.c   -O2 -flto  execution test
> +FAIL: gcc.c-torture/execute/pr64242.c   -O2 -flto -flto-partition=none 
> execution test
> +FAIL: gcc.c-torture/execute/pr64242.c   -O3 -g  execution test
> +FAIL: gcc.c-torture/execute/pr64242.c   -Os  execution test
> 
> Thread 2 received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 1 (LWP 1)]
> 0x0008 in ?? ()
> (gdb) where
> #0  0x0008 in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> Single-stepping, I find that this happens at the very end of main:
> 
> 1: x/i $pc
> => 0x10de4 :  return  %i7 + 8
> (gdb) 
> 0x00010de8 in main ()
> at
> /vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.c-torture/execute/pr64242.c:50
> 50return 0;
> 1: x/i $pc
> => 0x10de8 :  nop 
> (gdb) 
> 0x0008 in ?? ()
> 1: x/i $pc
> => 0x8: 
> 
> Obviously the stack is corrupted beyond repair.  I tried to avoid this by
> replacing the return 0 with exit (0) to no avail.

My latest patch detects this stack corruption with 100% certainty again, see
https://gcc.gnu.org/ml/gcc-patches/2018-12/msg00459.html. However sparc has a
custom nonlocal_goto MD pattern which would need fixing too.

[Bug middle-end/88560] [9 Regression] armv8_2-fp16-move-1.c and related regressions after r260385

2018-12-20 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88560

Wilco  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2018-12-20
 CC||wilco at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #2 from Wilco  ---
Eg. before

test_load_store_1:
ldrhr3, [r2, r1, lsl #1]@ __fp16
strhr3, [r0, r1, lsl #1]@ __fp16
bx  lr

after:

test_load_store_1:
vmov.f16s0, r3  @ __fp16
ldrhr3, [r2, r1, lsl #1]@ __fp16
strhr3, [r0, r1, lsl #1]@ __fp16
bx  lr

Inserting spurious extra moves certainly doesn't look correct.

[Bug target/86891] [9 Regression] wrong code with -O -frerun-cse-after-loop -fno-tree-dominator-opts -fno-tree-fre

2018-12-20 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86891

Wilco  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2018-12-20
 CC||wilco at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #3 from Wilco  ---
(In reply to Jakub Jelinek from comment #1)

> Now, looking at what aarch64 does for add with carry, there are separate
> patterns like add3_carryinC which set CC_C mode and use zero_extend
> and add3_carryinV which sets CC_V mode and uses sign_extend.
> So, shouldn't sub3_carryin{C,V} be split similarly and if we check
> carry flag, we should use subdi3_carryinC?

Yes it looks like the pattern confuses signed and unsigned underflow. Changing
it to zero_extend and using minus for the compare fixes the reported issue, but
it's not possible to support signed and unsigned in a single pattern.

[Bug target/86891] [9 Regression] __builtin_sub_overflow incorrect for unsigned types

2019-01-03 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86891

--- Comment #5 from Wilco  ---
(In reply to Richard Earnshaw from comment #4)
> Yes, the extension should be zero-extend, not sign extend.  The plus
> operation is correct, however, since decrementing the first operand could
> lead to underflow if it was zero.  So the correct rtl would be 
> 
>   (compare ((zero_x(a)) (plus (zero_x(b) (ltu(cc, 0)
>   (minus (...))

Agreed. The issue is more widespread though, signed underflow doesn't work
either,
and subv4 simply uses gen_sub3_compare1 which does do a normal
compare, so the RTL does not compute the overflow flag eventhough the actual
compare does.

It seems like a bug when it does constant fold this RTL given it should have
been folded before expand. Using UNSPEC for these complex flag uses may be best
to be safe.

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

2019-01-04 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #11 from Wilco  ---
(In reply to Jakub Jelinek from comment #10)
> If the compiler knew say from PGO that pos is usually a multiple of certain
> power of two and that the loop usually iterates many times (I guess the
> latter can be determined from comparing the bb count of the loop itself and
> its header), it could emit something like:
> static int func2(int max, int pos, unsigned char *cur)
> {
>   unsigned char *p = cur + pos;
>   int len = 0;
>   if (max > 32 && (pos & 7) == 0)
> {
>   int l = ((1 - ((uintptr_t) cur)) & 7) + 1;
>   while (++len != l)
> if (p[len] != cur[len])
>   goto end;
>   unsigned long long __attribute__((may_alias)) *p2 = (unsigned long
> long *) &p[len];
>   unsigned long long __attribute__((may_alias)) *cur2 = (unsigned long
> long *) &cur[len];
>   while (len + 8 < max)
> {
>   if (*p2++ != *cur2++)
> break;
>   len += 8;
> }
>   --len;
> }
>   while (++len != max)
> if (p[len] != cur[len])
>   break;
> end:
>   return cur[len];
> }
> 
> or so (untested).  Of course, it could be done using SIMD too if there is a
> way to terminate the loop if any of the elts is different and could be done
> in that case at 16 or 32 or 64 characters at a time etc.
> But, without knowing that pos is typically some power of two this would just
> waste code size, dealing with the unaligned cases would be more complicated
> (one can't read the next elt until proving that the current one is all
> equal), so it would need to involve some rotations (or permutes for SIMD).

Given it is compressing data both pointers will typically be misaligned. Pos
would be fairly random and len does not usually start from zero either. And
it's highly unlikely it will iterate more than a few bytes. Compression matches
are typically just a few bytes long. So unrolling is the only useful
optimization for this kind of code.

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

2019-01-07 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #13 from Wilco  ---
So to add some real numbers to the discussion, the average number of iterations
is 4.31. Frequency stats (16 includes all iterations > 16 too):

1: 29.0
2: 4.2
3: 1.0
4: 36.7
5: 8.7
6: 3.4
7: 3.0
8: 2.6
9: 2.1
10: 1.9
11: 1.6
12: 1.2
13: 0.9
14: 0.8
15: 0.7
16: 2.1

So unrolling 4x is perfect for this loop. Note the official xz version has
optimized this loop since 2014(!) using unaligned accesses:
https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/common/memcmplen.h

[Bug target/86891] [9 Regression] __builtin_sub_overflow incorrect for unsigned types

2019-01-07 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86891

--- Comment #10 from Wilco  ---
(In reply to Richard Earnshaw from comment #9)
> Fixed.

Yes looking at git blame all of the addv/subv support was added for GCC9, so no
backporting is needed.

[Bug middle-end/88739] Big-endian union bug

2019-01-07 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88739

Wilco  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-01-07
 CC||wilco at gcc dot gnu.org
  Component|target  |middle-end
Summary|union bug on ARM64  |Big-endian union bug
 Ever confirmed|0   |1

--- Comment #1 from Wilco  ---
Confirmed - this is a generic mid-end bug that happens in fre1 when it replaces
the bitfield store and load with the wrong bitfield extract:

Test_func (U32 ulAddr)
{
  union HEAD_REQ_DW4_UNION unData;
  unsigned int _1;
   _2;
  int _4;
  unsigned int _7;
  int _9;
  short unsigned int _10;
  int _11;
  short unsigned int _23;

   :
  _1 = ulAddr_12(D) >> 2;
  _2 = () _1;
  unData.strMemHead.b30AddrL = _2;
  unData.strMemHead.b2AddrType = 0;
  _4 = (int) _2;
  printf ("unData.strMemHead.b30AddrL=0x%x\r\n", _4);
  printf ("unData.strMemHead.b2AddrType=0x%x\r\n", 0);
  _7 = unData.aulValue[3];
  printf ("unData.aulValue[3]=0x%x\r\n", _7);
  _23 = BIT_FIELD_REF <_2, 16, 0>;// WRONG: should be _2, 14, 0
  _9 = (int) _23;
  printf ("unData.ausValue[6]=0x%x\r\n", _9);
  _10 = unData.ausValue[7];
  _11 = (int) _10;
  printf ("unData.ausValue[7]=0x%x\r\n", _11);
  unData ={v} {CLOBBER};
  return 0;

}

[Bug middle-end/88739] Big-endian union bug

2019-01-07 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88739

--- Comment #3 from Wilco  ---
(In reply to Richard Earnshaw from comment #2)
> >   _23 = BIT_FIELD_REF <_2, 16, 0>;// WRONG: should be _2, 14, 0
> 
> _2 is declared as a 30-bit integer, so perhaps the statement is right, but
> expand needs to understand that the shift extract of the top 16 bits comes
> from a different location in big-endian.

So the question becomes what format is this in?

   _2;

Is it big-endian memory format (so value is in top 30 bits) or simply a 30-bit
value in a virtual register?

[Bug tree-optimization/88739] [7/8/9 Regression] Big-endian union bug

2019-01-08 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88739

--- Comment #16 from Wilco  ---
(In reply to Richard Biener from comment #8)
> So I think part of a fix would be the following.  Not sure if
> REG_WORDS_BIG_ENDIAN or FLOAT_WORDS_BIG_ENDIAN come into play.
> With the fix we no longer simplify this for aarch64 since
> BITS_BIG_ENDIAN is 0 even in BE mode.

A conservative fix to only do this for little endian would be fine given one
shouldn't really do this kind of low-level hacking frequently. I guess
FLOAT_WORDS_BIG_ENDIAN will matter too since you could store a double and read
back part of it as a bitfield (or store a __int128 bitfield and read a double
back). No idea why it would use BITS_BIG_ENDIAN

> That is, not sure if BIT_FIELD_REF <_2, 16, 14> would be correct as suggested
> for aarch64_be because of that BITS_BIG_ENDIAN setting.

Possibly - but then <_2, 16, 0> doing a right shift by 16 would be an incorrect
expansion for big-endian. So something is wrong there...

I think we need to simplify the many BIG_ENDIAN macros so it is feasible to get
big-endian to work reliably on all targets. There seem to be far too many
options which affect too many unrelated things. Big-endian is fundamentally
about memory byte ordering, so allowing to different byte/bit orderings in
registers just makes things overly complex without any benefit.

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

2019-01-08 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #15 from Wilco  ---
(In reply to rguent...@suse.de from comment #14)
> On Mon, 7 Jan 2019, wilco at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398
> > 
> > --- Comment #13 from Wilco  ---
> > So to add some real numbers to the discussion, the average number of 
> > iterations
> > is 4.31. Frequency stats (16 includes all iterations > 16 too):
> > 
> > 1: 29.0
> > 2: 4.2
> > 3: 1.0
> > 4: 36.7
> > 5: 8.7
> > 6: 3.4
> > 7: 3.0
> > 8: 2.6
> > 9: 2.1
> > 10: 1.9
> > 11: 1.6
> > 12: 1.2
> > 13: 0.9
> > 14: 0.8
> > 15: 0.7
> > 16: 2.1
> > 
> > So unrolling 4x is perfect for this loop. Note the official xz version has
> > optimized this loop since 2014(!) using unaligned accesses:
> > https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/common/memcmplen.h
> 
> I guess if we'd have some data to guide then classical unrolling using
> duffs device would be best here?  Because peeling will increase the
> number of dynamic branches and likely the actual distribution of
> #iterations isn't so that they will be well predicted?

Duff's device is very slow on any CPU with branch prediction. It can't be used
here given you don't know the number of iterations in advance (only the
maximum).

Unrolling reduces the number of branches given the loop branch is only taken
once every N iterations. The loop overhead is significant here: 3 out of 7
instructions, with 4x unrolling that reduces to 3 in 19.

  1   2   3   4   5   6   7   8   >