[Bug target/94121] ICE on aarch64-linux-gnu: in abs_hwi, at hwint.h:324
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94121 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #3 from Wilco --- (In reply to Jakub Jelinek from comment #2) > Comment on attachment 48010 [details] > [PATCH PR94121] fix ICE on aarch64 in abs_hwi, at hwint.h:324 > > Can't you instead just use absu_hwi instead of abs_hwi and change moffset > type to unsigned HOST_WIDE_INT? With the latter, the moffset < 0x100 > comparison will DTRT and for HOST_WIDE_INT_MIN DImode it really doesn't > matter if we use addition or subtraction. Yes that works fine.
[Bug middle-end/94172] [arm-none-eabi] ICE in expand_debug_locations, at cfgexpand.c:5403
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94172 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #2 from Wilco --- It's a generic issue with -fshort-enums (which is the default in arm-none-eabi) - it fails since GCC6 on every target.
[Bug middle-end/94172] [arm-none-eabi] ICE in expand_debug_locations, at cfgexpand.c:5403
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94172 --- Comment #6 from Wilco --- (In reply to Jakub Jelinek from comment #3) > Can't reproduce on the trunk, neither on x86_64-linux with -Os -g3 > -fshort-enums, nor on arm-linux-gnueabi with -Os -g3 -fshort-enums > -mcpu=cortex-m0 -mthumb I tried -O2 -g -fshort-enums and this fails on AArch64: extern enum my_enum_type extern_enum; extern void bar(int a); enum my_enum_type { my_entry }; void g(void); void foo(int a) { int local_enum = extern_enum; if (a) { g(); local_enum = 0; } bar(local_enum); } The issue is the placement of the extern enum declaration. Move it after the enum type and all is well - the assert in cfgexpand seems to not allow SI/QI combination.
[Bug debug/94502] [aarch64] Missing LR register location in FDE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94502 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #2 from Wilco --- (In reply to Luis Machado from comment #1) > CC-ing ARM folks so they can assign this to whoever is more appropriate. Can you list the assembly code? My understanding is that unless LR is saved there is no entry needed as the default action is to return to LR.
[Bug debug/94502] [aarch64] Missing LR register location in FDE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94502 --- Comment #4 from Wilco --- (In reply to Luis Machado from comment #3) > The lack of a rule for LR means GDB will assume the register is UNSPECIFIED. > Is GCC assuming this register is considered to have the same value as an > inner frame? Right so it's a leaf function like I suspected. The default rule has always been to use the return register LR if it isn't stored (and that doesn't change if you adjust the stack). Leaf functions have always worked, so I'm surprised you are seeing an issue.
[Bug tree-optimization/91322] [10 regression] g++.dg/lto/alias-4_0.C test failure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91322 --- Comment #15 from Wilco --- (In reply to Richard Biener from comment #14) > So I'm quite sure the missed optimization isn't a regression? (can somebody > quickly check GCC 9 whether the testcase is optimized there on ARM?) It fails on both AArch64 and Arm all the way back to GCC6 (oldest compiler I tried). So it's not a regression but this is all target independent so I wouldn't expect this to fail.
[Bug target/94538] [9/10 Regression] ICE: in extract_constrain_insn_cached, at recog.c:2223 (insn does not satisfy its constraints) with -mcpu=cortex-m23 -mslow-flash-data
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538 Wilco changed: What|Removed |Added Ever confirmed|0 |1 CC||wilco at gcc dot gnu.org Last reconfirmed||2020-04-09 Status|UNCONFIRMED |ASSIGNED Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org --- Comment #1 from Wilco --- Thanks for the concise testcase. -mslow-flash-data disables MOVW generation which exactly the opposite of what we want. I'm testing a trivial fix.
[Bug target/94538] [10 Regression] ICE: in extract_constrain_insn_cached, at recog.c:2223 (insn does not satisfy its constraints) with -mcpu=cortex-m23 -mslow-flash-data
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538 Wilco changed: What|Removed |Added Summary|[9/10 Regression] ICE: in |[10 Regression] ICE: in |extract_constrain_insn_cach |extract_constrain_insn_cach |ed, at recog.c:2223 (insn |ed, at recog.c:2223 (insn |does not satisfy its|does not satisfy its |constraints) with |constraints) with |-mcpu=cortex-m23|-mcpu=cortex-m23 |-mslow-flash-data |-mslow-flash-data --- Comment #2 from Wilco --- This was introduced by commit e24f6408d so only in GCC10.
[Bug target/94538] [10 Regression] ICE: in extract_constrain_insn_cached, at recog.c:2223 (insn does not satisfy its constraints) with -mcpu=cortex-m23 -mslow-flash-data
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538 --- Comment #4 from Wilco --- (In reply to Zdenek Sojka from comment #3) > (In reply to Wilco from comment #2) > > This was introduced by commit e24f6408d so only in GCC10. > > Thank you for checking this! > > I am quite sure this fails in gcc-9 as well: ... > Perhaps the offending commit, or part of it, was backported to gcc-9 as well? It's possible it was recently backported and our GCC9 builds don't yet have it. But that whole patch is badly broken and introduces multiple issues...
[Bug target/94538] [10 Regression] ICE: in extract_constrain_insn_cached, at recog.c:2223 (insn does not satisfy its constraints) with -mcpu=cortex-m23 -mslow-flash-data
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538 --- Comment #7 from Wilco --- (In reply to Wilco from comment #4) > (In reply to Zdenek Sojka from comment #3) > > (In reply to Wilco from comment #2) > > > This was introduced by commit e24f6408d so only in GCC10. > > > > Thank you for checking this! > > > > I am quite sure this fails in gcc-9 as well: > ... > > Perhaps the offending commit, or part of it, was backported to gcc-9 as > > well? > > It's possible it was recently backported and our GCC9 builds don't yet have > it. But that whole patch is badly broken and introduces multiple issues... Adding Christophe. I'm thinking the best approach right now is to revert given -mpure-code doesn't work at all on Thumb-1 targets - it still emits literal pools, switch tables etc. That's not pure code!
[Bug target/94538] [10 Regression] ICE: in extract_constrain_insn_cached, at recog.c:2223 (insn does not satisfy its constraints) with -mcpu=cortex-m23 -mslow-flash-data
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538 --- Comment #10 from Wilco --- (In reply to Christophe Lyon from comment #8) > > Adding Christophe. I'm thinking the best approach right now is to revert > > given -mpure-code doesn't work at all on Thumb-1 targets - it still emits > > literal pools, switch tables etc. That's not pure code! > > Do you have testcases that show these failures? > > I did check some of the problematic testcases in the GCC testsuite when I > committed that patch. Did I miss some of them? > > Can you point me to offending testcases and compiler options so that I can > reproduce them? For example: int x; int f1 (void) { return x; } with eg. -O2 -mcpu=cortex-m0 -mpure-code I get: movsr3, #:upper8_15:#.LC1 lslsr3, #8 addsr3, #:upper0_7:#.LC1 lslsr3, #8 addsr3, #:lower8_15:#.LC1 lslsr3, #8 addsr3, #:lower0_7:#.LC1 @ sp needed ldr r3, [r3] ldr r0, [r3, #40] bx lr That's an extra indirection through a literal... There should only be one ldr to read x. Big switch tables are produced for any Thumb-1 core, however I would expect Cortex-m0/m23 versions to look almost identical to the Cortex-m3 one, and use a sequence of comparisons instead of tables. int f2 (int x, int y) { switch (x) { case 0: return y + 0; case 1: return y + 1; case 2: return y + 2; case 3: return y + 3; case 4: return y + 4; case 5: return y + 5; } return y; } Immediate generation for common cases seems to be screwed up: int f3 (void) { return 0x1100; } -O2 -mcpu=cortex-m0 -mpure-code: movsr0, #17 lslsr0, r0, #8 lslsr0, r0, #8 lslsr0, r0, #8 bx lr This also regressed Cortex-m23 which previously generated: movsr0, #136 lslsr0, r0, #21 bx lr Similar regressions happen with other immediates: int f3 (void) { return 0x12345678; } -O2 -mcpu=cortex-m23 -mpure-code: movsr0, #86 lslsr0, r0, #8 addsr0, r0, #120 movtr0, 4660 bx lr Previously it was: movwr0, #22136 movtr0, 4660 bx lr Also relocations with a small offset should be handled within the relocation. I'd expect this to never generate an extra addition, let alone an extra literal pool entry: int arr[10]; int *f4 (void) { return &arr[1]; } -O2 -mcpu=cortex-m3 -mpure-code generates the expected: movwr0, #:lower16:.LANCHOR0+4 movtr0, #:upper16:.LANCHOR0+4 bx lr -O2 -mcpu=cortex-m23 -mpure-code generates this: movwr0, #:lower16:.LANCHOR0 movtr0, #:upper16:.LANCHOR0 addsr0, r0, #4 bx lr And cortex-m0 again inserts an extra literal load: movsr3, #:upper8_15:#.LC0 lslsr3, #8 addsr3, #:upper0_7:#.LC0 lslsr3, #8 addsr3, #:lower8_15:#.LC0 lslsr3, #8 addsr3, #:lower0_7:#.LC0 ldr r0, [r3] addsr0, r0, #4 bx lr
[Bug target/94538] [10 Regression] ICE: in extract_constrain_insn_cached, at recog.c:2223 (insn does not satisfy its constraints) with -mcpu=cortex-m23 -mslow-flash-data
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538 --- Comment #13 from Wilco --- (In reply to Christophe Lyon from comment #12) > I've posted a patch to fix the regression for your f3() examples: > https://gcc.gnu.org/pipermail/gcc-patches/2020-April/543993.html Yes that improves some of the examples, but I still see regressions on m0 for eg. -1 or 511. The splitter must have the right priority or it will continue to cause regressions. That means placing it around line 790 after the existing movsi splitters.
[Bug target/94538] [10 Regression] ICE: in extract_constrain_insn_cached, at recog.c:2223 (insn does not satisfy its constraints) with -mcpu=cortex-m23 -mslow-flash-data
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538 --- Comment #14 from Wilco --- (In reply to Christophe Lyon from comment #11) > (In reply to Wilco from comment #10) > Right, but the code is functional. It doesn't avoid the literal load from flash which is exactly what pure-code and slow-flash-data is all about. That brings me to another question, why does -mslow-flash-data give an error when used with Cortex-M0? If the goal was to make things more orthogonal then surely we should not give an error and allow both options on all M-class cores. > I believe this is expected: as I wrote in my commit message > "CASE_VECTOR_PC_RELATIVE is now false with -mpure-code, to avoid generating > invalid assembly code with differences from symbols from two different > sections (the difference cannot be computed by the assembler)." > > Maybe there's a possibility to tune this to detect cases where we can do > better? The best option is to do the same as Cortex-M3: just switch off branch tables altogether and fall back to compare+branch. That completely avoids loading data from flash and is always smaller than emitting 32-bit tables. > > Also relocations with a small offset should be handled within the > > relocation. I'd expect this to never generate an extra addition, let alone > > an extra literal pool entry: > > > > int arr[10]; > > int *f4 (void) { return &arr[1]; } > > > > -O2 -mcpu=cortex-m3 -mpure-code generates the expected: > > > > movwr0, #:lower16:.LANCHOR0+4 > > movtr0, #:upper16:.LANCHOR0+4 > > bx lr > > > > -O2 -mcpu=cortex-m23 -mpure-code generates this: > > > > movwr0, #:lower16:.LANCHOR0 > > movtr0, #:upper16:.LANCHOR0 > > addsr0, r0, #4 > > bx lr > > For cortex-m23, I get the same code with and without -mpure-code. GCC9 emits something different for Cortex-M0 and Cortex-M23 so this was changed by the patch somehow even when -mpure-code is not enabled. So it is a regression from what we used to generate. Similarly I would not expect pure-code to change the decision of whether to emit immediates as part of the relocation as long as they are within range. The existing implementation for Cortex-M3 does this correctly.
[Bug target/94538] [10 Regression] ICE: in extract_constrain_insn_cached, at recog.c:2223 (insn does not satisfy its constraints) with -mcpu=cortex-m23 -mslow-flash-data
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538 --- Comment #16 from Wilco --- (In reply to Christophe Lyon from comment #15) > (In reply to Wilco from comment #14) > > (In reply to Christophe Lyon from comment #11) > > > (In reply to Wilco from comment #10) > > > > > Right, but the code is functional. > > > > It doesn't avoid the literal load from flash which is exactly what pure-code > > and slow-flash-data is all about. > > For f1 on M0, I can see: > .section.rodata.cst4,"aM",%progbits,4 > .align 2 > .LC0: > .word .LANCHOR0 > .section .text,"0x2006",%progbits > [...] > f1: > movsr3, #:upper8_15:#.LC0 > lslsr3, #8 > addsr3, #:upper0_7:#.LC0 > lslsr3, #8 > addsr3, #:lower8_15:#.LC0 > lslsr3, #8 > addsr3, #:lower0_7:#.LC0 > ldr r3, [r3]@ 6 [c=10 l=2] *thumb1_movsi_insn/8 > ldr r0, [r3]@ 7 [c=10 l=2] *thumb1_movsi_insn/8 > bx lr > [...] > .bss > .align 2 > .set.LANCHOR0,. + 0 > .type x, %object > .size x, 4 > x: > .space 4 > > So the 1st load is from .rodata.cst4 and the 2nd load is from bss, both of > which do not have the purecode bit set (unlike .text). Isn't that OK? No, it will create a lot of complaints and support queries due to the obvious regressions. It goes against the definition of pure-code and slow-flash-data which is to remove the literal loads. And given the sequence is already inefficient, we should do everything to remove the indirection which increases the codesize overhead by 75%... Another aspect that needs to be checked is that GCC correctly spills addresses and complex constants instead of rematerializing them. This is basic minimal quality that one expects for a feature like this.
[Bug middle-end/94715] New: Squared multiplies are incorrectly signextended
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94715 Bug ID: 94715 Summary: Squared multiplies are incorrectly signextended Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- The following example generates incorrect code with -O2: unsigned long long f (int x) { unsigned int t = x * x; return t; } On AArch64 I get: mul w0, w0, w0 sxtw x0, w0 ret It's correct if you do x * y or x * 100.
[Bug middle-end/94715] Squared multiplies are incorrectly signextended
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94715 Wilco changed: What|Removed |Added Last reconfirmed||2020-04-23 Ever confirmed|0 |1 Status|RESOLVED|REOPENED Resolution|INVALID |--- --- Comment #2 from Wilco --- (In reply to Richard Biener from comment #1) > I think GCC is correct in assuming that x * x is positive since overflow > with signed arithmetic is undefined. Thus on GIMPLE we elide > > _1 = x_2(D) * x_2(D); > t_3 = (unsigned int) _1; > _4 = (long long unsigned int) t_3; > > to > > _1 = x_2(D) * x_2(D); > _4 = (long long unsigned int) _1; If we assume x * x is always positive, using unsigned extension would make more sense. It still adds an unnecessary extra instruction on most targets which cannot be removed in RTL.
[Bug tree-optimization/94787] Failure to detect single bit popcount pattern
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94787 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #3 from Wilco --- (In reply to Gabriel Ravier from comment #1) > Inversely, I'd also suggest doing the opposite. That is, if there is no > hardware popcount instruction, `__builtin_popcount(v) == 1` should be > optimized to `v && !(v & (v - 1))` I actually posted a patch for this and popcount(x) > 1 given the reverse transformation is faster on all targets - even if they have popcount instruction (since they are typically more expensive). This is true on x86 as well, (x-1) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90693
[Bug target/94789] Failure to take advantage of shift operand semantics to turn subtraction into negate
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94789 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #4 from Wilco --- (In reply to Gabriel Ravier from comment #0) > int r(int x, unsigned b) > { > int const m = CHAR_BIT * sizeof(x) - b; > return (x << m); > } > > `CHAR_BIT * sizeof(x) - b;` can be optimized to `-b`. LLVM does this > transformation, not GCC. > > Comparison here : https://godbolt.org/z/5byJ2E AArch64 already generates: neg w1, w1 lsl w0, w0, w1 ret
[Bug target/94538] [9/10 Regression] ICE: in extract_constrain_insn_cached, at recog.c:2223 (insn does not satisfy its constraints) with -mcpu=cortex-m23 -mslow-flash-data
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538 --- Comment #19 from Wilco --- Yes I have a GCC9.3 build now, this fails too.
[Bug target/95285] AArch64:aarch64 medium code model proposal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #2 from Wilco --- (In reply to Bu Le from comment #0) > Created attachment 48584 [details] > proposed patch > > I would like to propose an implementation of the medium code model in > aarch64. A prototype is attached, passed bootstrap and the regression test. > > Mcmodel = medium is a missing code model in aarch64 architecture, which is > supported in x86. This code model describes a situation that some small data > is relocated by small code model while large data is relocated by large code > model. The official statement about medium code model in x86 ABI file page > 34 URL : https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf > > The key difference between x86 and aarch64 is that x86 can use lea+movabs > instruction to implement a dynamic relocatable large code model. Currently, > large code model in AArch64 relocate the symbol using ldr instruction, which > can only be static linked. However, the small code mode use adrp + ldr > instruction, which can be dynamic linked. Therefore, the medium code model > cannot be implemented directly by simply setting a threshold. As a result a > dynamic reloadable large code model is needed first for a functional medium > code model. > > I met this problem when compiling CESM, which is a climate forecast software > that widely used in hpc field. In some configure case, when the manipulating > large arrays, the large code model with dynamic relocation is needed. The > following case is abstract from CESM for this scenario. > > program main > common/baz/a,b,c > real a,b,c > b = 1.0 > call foo() > print*, b > end > > subroutine foo() > common/baz/a,b,c > real a,b,c > > integer, parameter :: nx = 1024 > integer, parameter :: ny = 1024 > integer, parameter :: nz = 1024 > integer, parameter :: nf = 1 > real :: bar(nf,nx*ny*nz) > real :: bar1(nf,nx*ny*nz) > bar = 0.0 > bar1 =0.0 > b = bar(1,1024*1024*100) > b = bar1(1,1) > > return > end > > compile with -mcmodel=small -fPIC will give following error due to the > access of bar1 array > test.f90:(.text+0x28): relocation truncated to fit: > R_AARCH64_ADR_PREL_PG_HI21 against `.bss' > test.f90:(.text+0x6c): relocation truncated to fit: > R_AARCH64_ADR_PREL_PG_HI21 against `.bss' > > compile with -mcmodel=large -fPIC will give unsupported error: > f951: sorry, unimplemented: code model ‘large’ with ‘-fPIC’ > > As discussed in the beginning, to tackle this problem we have to solve the > static large code model problem. My solution here is to use > R_AARCH64_MOVW_PREL_Gx group relocation with instructions to calculate the > current PC value. > > Before change (mcmodel=small) : > adrpx0, bar1.2782 > add x0, x0, :lo12:bar1.2782 > > After change:(mcmodel = medium proposed): > movzx0, :prel_g3:bar1.2782 > movk x0, :prel_g2_nc:bar1.2782 > movk x0, :prel_g1_nc:bar1.2782 > movk x0, :prel_g0_nc:bar1.2782 > adr x1, . > sub x1, x1, 0x4 > add x0, x0, x1 > > The first 4 movk instruction will calculate the offset between bar1 and the > last movk instruction in 64-bits, which fulfil the requirement of large code > model(64-bit relocation). > The adr+sub instruction will calculate the pc-address of the last movk > instruction. By adding the offset with the PC address, bar1 can be > dynamically located. > > Because this relocation is time consuming, a threshold is set to classify > the size of the data to be relocated, like x86. The default value of the > threshold is set to 65536, which is max relocation capability of small code > model. > This implementation will also need to amend the linker in binutils so that > the4 movk can calculated the same pc-offset of the last movk instruction. > > The good side of this implementation is that it can use existed relocation > type to prototype a medium code model. > > The drawback of this implementation also exists. > For start, these 4movk instructions and the adr instruction must be combined > in this order. No other instruction should insert in between the sequence, > which will leads to mistake symbol address. This might impede the insn > schedule optimizations. > Secondly, the linker need to make the change correspondingly so that every > mov instruction calculate the same pc-offset. For example, in my > implementation, the fisrt movz instruction will need to add 12 to the result > of ":prel_g3:bar1.2782" to make up t
[Bug target/95285] AArch64:aarch64 medium code model proposal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285 --- Comment #4 from Wilco --- (In reply to Bu Le from comment #3) > (In reply to Wilco from comment #2) > > > Is the main usage scenario huge arrays? If so, these could easily be > > allocated via malloc at startup rather than using bss. It means an extra > > indirection in some cases (to load the pointer), but it should be much more > > efficient than using a large code model with all the overheads. > > Thanks for the reply. > > The large array is just used to construct the test case. It is not a > neccessary condition for this scenario. The common scenario is that the > symbol is too far away for small code model to reach it, which cloud also > result from large amount of small arrays, structures, etc. Meanwhile, the > large code model is able to reach the symbol but can not be position > independent, which cause the problem. > > Besides, the code in CESM is quiet complicated to reconstruct with malloc, > which is also not an acceptable option for my customer. > > Clear enough for your concern? Well the question is whether we're talking about more than 4GB of code or more than 4GB of data. With >4GB code you're indeed stuck with the large model. With data it is feasible to automatically use malloc for arrays when larger than a certain size, so there is no need to change the application at all. Something like that could be the default in the small model so that you don't have any extra overhead unless you have huge arrays. Making the threshold configurable means you can tune it for a specific application.
[Bug target/95285] AArch64:aarch64 medium code model proposal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285 --- Comment #5 from Wilco --- (In reply to Bu Le from comment #0) Also it would be much more efficient to have a relocation like this if you wanted a 48-bit PC-relative offset: adrpx0, bar1.2782 add x0, x0, :lo12:bar1.2782 movkx0, :high32_47:bar1.2782
[Bug target/95285] AArch64:aarch64 medium code model proposal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285 --- Comment #8 from Wilco --- (In reply to Bu Le from comment #6) > (In reply to Wilco from comment #4) > > (In reply to Bu Le from comment #3) > > > (In reply to Wilco from comment #2) > > > Well the question is whether we're talking about more than 4GB of code or > > more than 4GB of data. With >4GB code you're indeed stuck with the large > > model. With data it is feasible to automatically use malloc for arrays when > > larger than a certain size, so there is no need to change the application at > > all. Something like that could be the default in the small model so that you > > don't have any extra overhead unless you have huge arrays. Making the > > threshold configurable means you can tune it for a specific application. > > > Is this automatic malloc already avaiable on some target? I haven't found an > example that works in that way. Would you mind provide an example? Fortran already has -fstack-arrays to decide between allocating arrays on the heap or on the stack.
[Bug target/95285] AArch64:aarch64 medium code model proposal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285 --- Comment #9 from Wilco --- (In reply to Bu Le from comment #7) > (In reply to Wilco from comment #5) > > (In reply to Bu Le from comment #0) > > > > Also it would be much more efficient to have a relocation like this if you > > wanted a 48-bit PC-relative offset: > > > > adrpx0, bar1.2782 > > add x0, x0, :lo12:bar1.2782 > > movk x0, :high32_47:bar1.2782 > > I am afraid that put the PC-relative offset into x0 is not correct, because > x0 issuppose to be the final address of bar1 rather than an PC offset. > Therefore an extra register is needed to hold the offest temporarily. You're right, we need an extra add, so it's like this: adrpx0, bar1.2782 movkx1, :high32_47:bar1.2782 add x0, x0, x1 add x0, x0, :lo12:bar1.2782 > (By the way, the high32_47 relocation you suggested is the prel_g2 in the > officail aarch64 ABI released) It needs a new relocation because of the ADRP. ADR could be used so the existing R__MOVW_PREL_G0-3 work, but then you need 5 instructions. > And in terms of engineering, you idea can save the trouble to modify the > linker for calculating the offset for 3 movks. But we still need to make a > new relocation type for ADRP, because it currently checking the overflow of > address and gives the "relocation truncated to fit" error. Therefore, both > idea need to do works in binutils, which make it also equivalent. There is relocation 276 (R__ADR_PREL_PG_HI21_NC).
[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #29 from Wilco --- (In reply to Jiu Fu Guo from comment #28) > (In reply to Jiu Fu Guo from comment #27) > > > > 12: 1.2 > > > 13: 0.9 > > > 14: 0.8 > > > 15: 0.7 > > > 16: 2.1 > > > > > > > Find one interesting thing: > > If using widen reading for the run which > 16 iterations, we can see the > > performance is significantly improved(>18%) for xz_r in spec. > > This means that the frequency is small for >16, while it still costs a big > > part of the runtime. > > > > Oh, Recheck frequency in my test, the frequency is big (99.8%) for >16 > iterations. The frequency for >16 iterations is small, 2.1%. The limit is generally large, but the actual number of iterations is what matters because of the early exit. The key question remains whether it is legal to assume the limit implies the memory is valid and use wider accesses.
[Bug target/95285] AArch64:aarch64 medium code model proposal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285 --- Comment #12 from Wilco --- (In reply to Bu Le from comment #10) > > Fortran already has -fstack-arrays to decide between allocating arrays on > > the heap or on the stack. > > I tried the flag with my example. The fstack-array seems cannot move the > array in the bss to the heap. The problem is still there. It is an existing feature that chooses between malloc and stack. It would need modification to do the same for large data/bss objects. > Anyway, my point is that the size of single data does't affact the fact that > medium code model is missing in aarch64 and aarch64 is lack of PIC large > code model. What is missing is efficient support for >4GB of data, right? How that is implemented is a different question - my point is that it does not require a new code model. It would be much better if it just worked without users even needing to think about code models. Also, what is the purpose of a large fpic model? Are there any applications that use shared libraries larger than 4GB?
[Bug target/95285] AArch64:aarch64 medium code model proposal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285 --- Comment #13 from Wilco --- (In reply to Bu Le from comment #11) > > > You're right, we need an extra add, so it's like this: > > > > adrpx0, bar1.2782 > > movkx1, :high32_47:bar1.2782 > > add x0, x0, x1 > > add x0, x0, :lo12:bar1.2782 > > > > > (By the way, the high32_47 relocation you suggested is the prel_g2 in the > > > officail aarch64 ABI released) > > > > It needs a new relocation because of the ADRP. ADR could be used so the > > existing R__MOVW_PREL_G0-3 work, but then you need 5 instructions. > > So you suggest a new relocation type "high32_47" to calculate the offset > between ADRP and bar1. Am I right? Yes. It needs to have an offset to the adrp instruction so it can compute the correct ADRP offset and then extract bits 32-47.
[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398 --- Comment #31 from Wilco --- (In reply to Jiu Fu Guo from comment #30) > (In reply to Wilco from comment #29) > > The key question remains whether it is legal to assume the limit implies the > > memory is valid and use wider accesses. > If unaligned access is support, it would be valid to access the memory. > Otherwise, checking like ((pb&7) == (cur & 7)) would cost an additional > test, and it may not sure likely to be true. If both pointers are aligned it would be safe indeed, but if unaligned they will access up to 7 bytes after the first match. And that's not safe without knowing the underlying buffer bounds. See eg. comment #3.
[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398 --- Comment #34 from Wilco --- (In reply to Jiu Fu Guo from comment #33) > It would be relatively easy if the target supports unaligned access. like > read64ne in > https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/common/memcmplen.h > Then the alignment issue is relaxed. It may be safer if we can > prove/assume the underlying buffer is enough, like array accessing or > pointer+index accessing in a loop. Yes, without unaligned support you can't use a wider access. If we can't prove the buffer bounds then we'd have to use a page cross check before every unaligned access.
[Bug target/95285] AArch64:aarch64 medium code model proposal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285 --- Comment #15 from Wilco --- (In reply to Bu Le from comment #14) > > > Anyway, my point is that the size of single data does't affact the fact > > > that > > > medium code model is missing in aarch64 and aarch64 is lack of PIC large > > > code model. > > > > What is missing is efficient support for >4GB of data, right? How that is > > implemented is a different question - my point is that it does not require a > > new code model. It would be much better if it just worked without users even > > needing to think about code models. > > > > Also, what is the purpose of a large fpic model? Are there any applications > > that use shared libraries larger than 4GB? > > Yes, I understand, and I am grateful for you suggestion. I have to say it is > not a critical problem. After all, most applications works fine with > curreent code modes. > > But there are some cases, like CESM with certain configuration, or my test > case, which cannot be compiled with current gcc compiler on aarch64. > Unfortunately, applications that large than 4GB is quiet normal in HPC > feild. In the meantime, x86 and llvm-aarch64 can compile it, with medium or > large-pic code model. That is the purpose I am proposing it. By adding this > feature, we can make a step forward for aarch64 gcc compiler, making it more > powerful and robust. > > Clear enough for your concern? Yes but such a feature needs to be defined in an ABI and well specified. This is why I'm trying to get the underlying requirements first. Note that while LLVM allows -fpic in large model, it doesn't correctly implement it. The large model shouldn't ever be needed by actual applications. > And for the implementation you suggested, I believe it is a promissing plan. > I would like to try to implement it first. Might take weeks of development. > I will see what I can get. I will give you update with progress. > > Thanks for the suggestion again. As discussed, there are many different ways of supporting the requirement of >4GB of data, so I wouldn't start on the implementation before there is a good specification. GCC and LLVM would need to implement it in the same way after all.
[Bug target/94986] missing diagnostic on ARM thumb2 compilation with -pg when using r7 in inline asm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94986 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #1 from Wilco --- (In reply to Arnd Bergmann from comment #0) > I reported a bug against clang for a Linux kernel failure, but > it was suggested that the clang behavior is probably correct in this corner > case while gcc gets it wrong, see https://bugs.llvm.org/show_bug.cgi?id=45826 > > echo 'void f(void) { asm("mov r7, #0" ::: "r7"); }' | arm-linux-gnueabi-gcc > -march=armv7-a -O2 -mthumb -pg -S -xc - > > silently accepts an inline asm statement that clobbers the frame pointer, > but gcc rejects the same code if any of '-O0', '-fomit-frame-pointer' or > 'fno-omit-frame-pointer' are used: > > : In function 'f': > :1:44: error: r7 cannot be used in 'asm' here > > If using r7 in this case is indeed invalid, we need to ensure the kernel > does not do this, and having gcc reject it would be helpful. GCC will reject it if you explicitly enable the frame pointer. The logic seems wrong in that it doesn't report an error if the frame pointer is implicitly enabled via -pg. As a workaround for the kernel, just use -pg and -fno-omit-frame-pointer together. Corrupting a frame pointer loses the ability to follow the frame chain, similar to a function built with -fomit-frame-pointer which will use r7 as a general purpose register. However this always reports an error since this corruption of the frame pointer will cause a crash: int *f(int x) { asm("mov r7, #0" ::: "r7"); return __builtin_alloca (x); }
[Bug target/94986] missing diagnostic on ARM thumb2 compilation with -pg when using r7 in inline asm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94986 Wilco changed: What|Removed |Added Resolution|--- |INVALID Status|UNCONFIRMED |RESOLVED --- Comment #3 from Wilco --- (In reply to nsz from comment #2) > on arm the -pg abi is > > func: > push {lr} > bl _gnu_mcount_nc > ... > > so no frame pointer is involved, -pg implying > -fno-omit-frame-pointer is a historical mistake i think > (because some targets required fp for -pg, but most don't). Right, so the claim that -pg implies -fno-omit-frame-pointer is wrong, and that means there is no bug in GCC. Looking at the latest docs, there is no mention of frame pointer for the -pg option, and neither does -fomit-frame-pointer discuss -pg. So this dependency must have been removed some time ago. > ideally r7 clobber would just work with -pg -fomit-frame-pointer. > the alloca problem is a separate issue (that r7 clobber may not > work with alloca). GCC correctly reports the error for that. So this can be closed then.
[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398 --- Comment #40 from Wilco --- (In reply to Jiu Fu Guo from comment #39) > I’m thinking to draft a patch for this optimization. If any suggestions, > please point out, thanks. Which optimization to be precise? Besides unrolling I haven't seen a proposal for an optimization which is both safe and generally applicable.
[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398 --- Comment #44 from Wilco --- (In reply to Jiu Fu Guo from comment #43) > To handle vectorization for this kind of code, it needs to overcome the hard > issue mentioned in comment #5: the loop has 2 exits. Yes and that also implies vector loads are unsafe unless they are non-faulting. Few ISAs have such support.
[Bug target/95650] aarch64: Missed optimization storing addition of two shorts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95650 Wilco changed: What|Removed |Added Last reconfirmed||2020-06-12 Ever confirmed|0 |1 CC||wilco at gcc dot gnu.org Status|UNCONFIRMED |NEW --- Comment #4 from Wilco --- (In reply to Alex Coplan from comment #3) > I think clang's optimisation is sound here. > > C says that we add two shorts as int and then truncate to short (i.e. reduce > mod 16). > > The question is whether the top bits being set (which the ABI allows) can > influence the result. I don't think it can. > > The observation is that the "top bits being set" are just extra multiples of > 2^16 in the addition, which just disappear when we reduce mod 2^16. That is: > > (x_1 + x_2 + y_1 + y_2) % 2^16 = (x_1 + x_2) % 2^16 > > where x_1,x_2 are arbitrary integers and y_1,y_2 are multiples of 2^16 (the > top bits). Confirmed. It works for signed as well and any operator except right shift and division. Basically the store requires only the bottom 16 bits to be valid, and a backwards dataflow can propagate this to remove unnecessary zero and sign extends.
[Bug target/96191] aarch64 stack_protect_test canary leak
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96191 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #2 from Wilco --- (In reply to Jim Wilson from comment #0) > Given a simple testcase > extern int sub (int); > > int > main (void) > { > sub (10); > return 0; > } > commpiling with -O -S -fstack-protector-all -mstack-protector-guard=global > in the epilogue for the canary check I see > ldr x1, [sp, 40] > ldr x0, [x19, #:lo12:__stack_chk_guard] > eor x0, x1, x0 > cbnzx0, .L4 > Both x0 and x1 have the stack protector canary loaded into them, and the eor > clobbers x0, but x1 is left alone. This means the value of the canary is > leaking from the epilogue. The canary value is never supposed to survive in > a register outside the stack protector patterns. > > A powerpc64-linux toolchain build with the same testcase and options > generates > lwz 9,28(1) > lwz 10,0(31) > xor. 9,9,10 > li 10,0 > bne- 0,.L4 > and note that it clears the second register after the xor to prevent the > canary leak. The aarch64 stack_protect_test pattern should do the same > thing. The canary value is not a secret. What would the purpose of clearing the register be given the stack slot containing the canary is not cleared as well? And register could potentially contain the address of the canary or that of a global nearby, making reading the canary value really easy.
[Bug tree-optimization/95731] Faiilure to optimize a >= 0 && b >= 0 to (a | b) >= 0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95731 Wilco changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2020-08-04 CC||wilco at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #3 from Wilco --- (In reply to Gabriel Ravier from comment #0) > bool f(int a, int b) > { > return a >= 0 && b >= 0; > } > > This can be optimized to `return (a | b) >= 0;`. LLVM does this > transformation, but GCC does not. For orthogonality you also want: a < 0 && b < 0 -> (a & b) < 0 a >= 0 || b >= 0 -> (a & b) >= 0 a < 0 || b < 0 -> (a | b) < 0
[Bug target/96768] -mpure-code produces switch tables for thumb-1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96768 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #5 from Wilco --- (In reply to Christophe Lyon from comment #4) > That's what I replied in the original PR94538, but Wilco said the best > option was to turn off switch tables: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538#c14 > > See also another comment from him: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94538#c16 related to immediate > loads from rodata/bss. Yes, switching off switch tables is obviously the best option - just look at the m0 vs m3 code above. It's the fastest and smallest, and is also "pure code" since there are no extra literals generated anywhere. In AArch32 switch tables are very inefficient. An easy improvement would be to copy the AArch64 case_values_threshold to disable tables if there are fewer than 16 cases.
[Bug c/92172] ARM Thumb2 frame pointers inconsistent with clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92172 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #1 from Wilco --- Firstly it's important to be clear this is about adding support for a frame chain for unwinding. A frame pointer is something different since it is used for addressing local variables. Historically Arm compilers only supported a frame pointer, not a frame chain for unwinding. So different Arm backends use different frame pointer registers and there is no defined layout since it is not designed for unwinding. Why does this matter? Well as your examples show, if you want to emit a frame chain using standard push/pop, it typically ends up pointing to the top of the frame. That is the worst possible position for a frame pointer on Thumb - while Arm supports negative immediate offsets up to 4KB, Thumb-1 doesn't support negative offsets at all, and Thumb-2 supports offsets up to -255 but only with 32-bit instructions. So the result of conflating the frame chain and frame pointer implies a terrible codesize hit for Thumb. There is also an issue with using r7 as the frame chain in that you're now reserving a precious low register callee-save and just use it once in a typical function. So using r7 is a very bad idea for Thumb. Your examples suggest LLVM suffers from both of these issues, and IIRC it still uses r11 on Arm but r7 on Thumb. That is way too inefficient/incorrect to consider a defacto standard.
[Bug c/92172] ARM Thumb2 frame pointers inconsistent with clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92172 --- Comment #4 from Wilco --- (In reply to Seth LaForge from comment #2) > Good point on frame pointers vs a frame chain for unwinding. I'm looking for > the unwindable frame chain. > > Wilco: > > Why does this matter? Well as your examples show, if you want to emit a > > frame > > chain using standard push/pop, it typically ends up pointing to the top of > > the > > frame. That is the worst possible position for a frame pointer on Thumb - > > while > > Arm supports negative immediate offsets up to 4KB, Thumb-1 doesn't support > > negative offsets at all, and Thumb-2 supports offsets up to -255 but only > > with > > 32-bit instructions. So the result of conflating the frame chain and frame > > pointer implies a terrible codesize hit for Thumb. > > Well, there's really no need for a frame pointer for efficiency, since the > stack frame can be efficiently accessed with positive immediate accesses > relative to the stack pointer. There are even special encodings for Thumb-2 > 16-bit LDR/STR which allow an immediate offset of 0 to 1020 when relative to > SP - much larger than other registers. You're saying using a frame pointer > implies a terrible codesize hit for Thumb, but I don't see how that can be - > stack access will continue to go through SP, and the only code size hit > should be pushing/popping R7 (~2 cycles), computing R7 as a frame pointer > (~1 cycle), and potential register spills due to one less register > available. That's a pretty small amount of overhead for a non-leaf function. On GCC10 the codesize overhead of -fno-omit-frame-pointer is 4.1% for Arm and 4.8% for Thumb-2 (measured on SPEC2006). That's already a large overhead, especially since this feature doesn't do anything useful besides adding overhead... The key is that GCC uses the frame pointer for every stack access, and thus the placement of the frame pointer within a frame matters. Thumb compilers place the frame pointer at the bottom of the frame so they can efficiently access locals using positive offsets. Despite that the overhead is significant already. If GCC would emit a frame chain like the LLVM sequence this means placing the frame pointer at the top of the stack. This forces negative frame offsets for all frame accesses. Getting a 10% overhead is being lucky, I've seen worse... So this is something that needs to be properly designed and carefully implemented. > Baseline: With gcc 4.7, -fomit-frame-pointer, -mthumb: 384016 bytes, 110.943 > s. Thanks for posting actual numbers, but GCC 4.7?!? It might be time to try GCC9...
[Bug c/92172] ARM Thumb2 frame pointers inconsistent with clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92172 --- Comment #6 from Wilco --- (In reply to Seth LaForge from comment #5) > GCC 8: > push{r7, lr} > sub sp, sp, #8 > add r7, sp, #0 > str r0, [r7, #4] > ... > > Clang 9: > push{r7, lr} > mov r7, sp > sub sp, #8 > str r0, [sp, #4] > ... Crazy yes, but it's due to historical reasons. Originally GCC could only emit code using a frame pointer. Later the frame pointer could be switched off (hence -fomit-frame-pointer), but you still needed it for debug tables. Then there was Dwarf which didn't need a frame pointer anymore. And today the frame pointer is off by default globally in GCC. > - GCC ARM and Clang ARM use R11 for frame pointer, pointing to the stacked > R11. Useful. Well Clang does this: push{r4, r10, r11, lr} add r11, sp, #8 but GCC does something different: push{r4, r5, fp, lr} add fp, sp, #12 Ie. FP points to saved LR with GCC but saved FP with Clang, so it's not possible for a generic unwinder to follow the chain, even ignoring Arm/Thumb interworking (which is a real issue when an application is Thumb-2 but various library functions use Arm assembly).
[Bug target/91766] -fvisibility=hidden during -fpic still uses GOT indirection on arm64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91766 --- Comment #12 from Wilco --- (In reply to Andrew Pinski from comment #10) > This should be a global change and not just an aarch64 change. The reason > is because then aarch64 is the odd man out when it comes to this. Agreed, see https://gcc.gnu.org/ml/gcc-patches/2017-11/msg01549.html. It would be great to sort that out so C and C++ finally address globals identically.
[Bug c/85678] -fno-common should be default
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85678 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #6 from Wilco --- (In reply to Jonathan Wakely from comment #5) > The other bug links to a patch to change the default: > > https://gcc.gnu.org/ml/gcc-patches/2017-11/msg01549.html Updated patch: https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01847.html
[Bug target/91766] -fvisibility=hidden during -fpic still uses GOT indirection on arm64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91766 --- Comment #13 from Wilco --- (In reply to Wilco from comment #12) > (In reply to Andrew Pinski from comment #10) > > > This should be a global change and not just an aarch64 change. The reason > > is because then aarch64 is the odd man out when it comes to this. > > Agreed, see https://gcc.gnu.org/ml/gcc-patches/2017-11/msg01549.html. It > would be great to sort that out so C and C++ finally address globals > identically. Patch: https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01847.html
[Bug target/91927] -mstrict-align doesn't prevent unaligned accesses at -O2 and -O3 on AARCH64 targets
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91927 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #8 from Wilco --- Note gcc.target/aarch64/pr71727.c fails when compiled with -mstrict-align -fno-common -O3: adrpx2, .LC0 adrpx3, .LC1 adrpx1, xarray add x0, x1, :lo12:xarray ldr q1, [x2, #:lo12:.LC0] mov x2, 5 ldr q0, [x3, #:lo12:.LC1] str x2, [x0, 32] str q1, [x1, #:lo12:xarray] str q0, [x0, 16] ret .bss .align 4 .type xarray, %object .size xarray, 5120 xarray: .zero 5120
[Bug rtl-optimization/92294] New: alias attribute generates incorrect code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92294 Bug ID: 92294 Summary: alias attribute generates incorrect code Product: gcc Version: 4.8.4 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- The following example (from gcc.c-torture/execute/alias-2.c) always calls abort on any AArch64 compiler with -O1 or -O2: static int a[10]; extern int b[10] __attribute__ ((alias("a"))); int off = 0; void f(void) { b[off]=1; a[off]=2; if (b[off]!=2) __builtin_abort (); } Using extern linkage for 'a' avoids the problem, as is doing off = 1 or static int off = 0. It may only affect targets which use section anchors since -fno-section-anchors avoids the issue.
[Bug rtl-optimization/92294] alias attribute generates incorrect code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92294 Wilco changed: What|Removed |Added Target||aarch64 Target Milestone|--- |10.0
[Bug rtl-optimization/92294] alias attribute generates incorrect code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92294 Wilco changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2019-10-31 Ever confirmed|0 |1 --- Comment #2 from Wilco --- Confirmed then
[Bug c++/92425] Incorrect logical AND on 64bit variable using 32bit register
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92425 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #1 from Wilco --- Where do you see a problem? and eax, 4095 clears the top 32 bits of rax.
[Bug target/92462] [arm32] -ftree-pre makes a variable to be wrongly hoisted out
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92462 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #1 from Wilco --- (In reply to Aleksei Voitylov from comment #0) > Created attachment 47212 [details] > reduced testcase from the openjdk sources > > While compiling openjdk sources I bumped into a bug: when -ftree-pre is > enabled the optimizer hoists out reload of a variable which subsequently > leads to the infinite loop situation. > > Below is the relevant piece of code and "new_value" is the variable that > gets hoisted out. > > template > inline T Atomic::CmpxchgByteUsingInt::operator()(T exchange_value, > T volatile* dest, > T compare_value, > atomic_memory_order order) const { > printf ("Atomic::CmpxchgByteUsingInt::operator: 1: %d, 2: %d\n", > exchange_value, compare_value); > uint8_t canon_exchange_value = exchange_value; > uint8_t canon_compare_value = compare_value; > volatile uint32_t* aligned_dest > = reinterpret_cast(align_down(dest, > sizeof(uint32_t))); > size_t offset = (uintptr_t)dest - (uintptr_t)aligned_dest; > uint32_t cur = *aligned_dest; > uint8_t* cur_as_bytes = reinterpret_cast(&cur); > cur_as_bytes[offset] = canon_compare_value; > do { > uint32_t new_value = cur; > reinterpret_cast(&new_value)[offset] = > canon_exchange_value; > printf ("Atomic::CmpxchgByteUsingInt::operator2: 1: %d, 2: %d\n", > new_value, cur); > uint32_t res = cmpxchg(new_value, aligned_dest, cur, order); > if (res == cur) break; > cur = res; > } while (cur_as_bytes[offset] == canon_compare_value); > return PrimitiveConversions::cast(cur_as_bytes[offset]); > } > > $ g++ -O1 -ftree-pre t.cpp > $ ./a.out > Atomic::CmpxchgByteUsingInt::operator: 1: 0, 2: 1 > Atomic::CmpxchgByteUsingInt::operator2: 1: 0, 2: 0 > > $ g++ -O1 t.cpp > $ ./a.out > Atomic::CmpxchgByteUsingInt::operator: 1: 0, 2: 1 > Atomic::CmpxchgByteUsingInt::operator2: 1: 0, 2: 256 > > Below is the assembler of the loop for the correct version: > > .L7: > ldr r4, [sp] > str r4, [sp, #4] > strbr7, [r5, #-4] > mov r2, r4 > ldr r1, [sp, #4] > mov r0, r6 > bl printf > cbz r4, .L6 > movsr3, #0 > str r3, [sp] > ldrbr3, [r8]@ zero_extendqisi2 > cmp r3, #1 > beq .L7 > > and for the incorrect one: > > .L7: > str r4, [sp, #4] > strbr8, [r6] > mov r2, r4 > ldr r1, [sp, #4] > mov r0, r5 > bl printf > cbz r4, .L6 > movsr4, #0 > str r4, [sp] > ldrbr3, [r7]@ zero_extendqisi2 > cmp r3, #1 > beq .L7 There are serious aliasing bugs in the source - GCC is quite correct in assuming that cur and cur_as_bytes[offset] never alias (obviously) and even optimize away the cmpxchg (no idea why, that appears wrong). Even if you fix the aliasing bugs, it won't emulate a byte-oriented cmpxchg correctly, there are bugs in the logic too. So I suggest to go back to the drawing board - you can't hack your own atomic operations and just hope for the best. GCC supports a standard set of atomic operations for a good reason!
[Bug target/92462] [arm32] -ftree-pre makes a variable to be wrongly hoisted out
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92462 Wilco changed: What|Removed |Added Resolution|INVALID |FIXED --- Comment #7 from Wilco --- (In reply to Andrew Pinski from comment #2) > (In reply to Wilco from comment #1) > > Even if you fix the aliasing bugs, it won't emulate a byte-oriented cmpxchg > > correctly, there are bugs in the logic too. > > More than that, it will never be atomic. You can't do byte wise cmpxchg for > an word size and think it will be atomic. And since both Arm and AArch64 have byte-wise __atomic_compare_exchange_n, I can't see a reason to even try when the compiler already does it correctly and efficiently. So my advice is to remove this function altogether and all related code from openjdk - it's completely incorrect and counterproductive.(In reply to Aleksei Voitylov from comment #4) > (In reply to Wilco from comment #1) > > (In reply to Aleksei Voitylov from comment #0) > > > Created attachment 47212 [details] > > > reduced testcase from the openjdk sources > > > > > > While compiling openjdk sources I bumped into a bug: when -ftree-pre is > > > enabled the optimizer hoists out reload of a variable which subsequently > > > leads to the infinite loop situation. > > > > > > Below is the relevant piece of code and "new_value" is the variable that > > > gets hoisted out. > > > > > > template > > > inline T Atomic::CmpxchgByteUsingInt::operator()(T exchange_value, > > > T volatile* dest, > > > T compare_value, > > > atomic_memory_order order) const { > > > printf ("Atomic::CmpxchgByteUsingInt::operator: 1: %d, 2: %d\n", > > > exchange_value, compare_value); > > > uint8_t canon_exchange_value = exchange_value; > > > uint8_t canon_compare_value = compare_value; > > > volatile uint32_t* aligned_dest > > > = reinterpret_cast(align_down(dest, > > > sizeof(uint32_t))); > > > size_t offset = (uintptr_t)dest - (uintptr_t)aligned_dest; > > > uint32_t cur = *aligned_dest; > > > uint8_t* cur_as_bytes = reinterpret_cast(&cur); > > > cur_as_bytes[offset] = canon_compare_value; > > > do { > > > uint32_t new_value = cur; > > > reinterpret_cast(&new_value)[offset] = > > > canon_exchange_value; > > > printf ("Atomic::CmpxchgByteUsingInt::operator2: 1: %d, 2: %d\n", > > > new_value, cur); > > > uint32_t res = cmpxchg(new_value, aligned_dest, cur, order); > > > if (res == cur) break; > > > cur = res; > > > } while (cur_as_bytes[offset] == canon_compare_value); > > > return PrimitiveConversions::cast(cur_as_bytes[offset]); > > > } > > > > > > $ g++ -O1 -ftree-pre t.cpp > > > $ ./a.out > > > Atomic::CmpxchgByteUsingInt::operator: 1: 0, 2: 1 > > > Atomic::CmpxchgByteUsingInt::operator2: 1: 0, 2: 0 > > > > > > $ g++ -O1 t.cpp > > > $ ./a.out > > > Atomic::CmpxchgByteUsingInt::operator: 1: 0, 2: 1 > > > Atomic::CmpxchgByteUsingInt::operator2: 1: 0, 2: 256 > > > > > > Below is the assembler of the loop for the correct version: > > > > > > .L7: > > > ldr r4, [sp] > > > str r4, [sp, #4] > > > strbr7, [r5, #-4] > > > mov r2, r4 > > > ldr r1, [sp, #4] > > > mov r0, r6 > > > bl printf > > > cbz r4, .L6 > > > movsr3, #0 > > > str r3, [sp] > > > ldrbr3, [r8]@ zero_extendqisi2 > > > cmp r3, #1 > > > beq .L7 > > > > > > and for the incorrect one: > > > > > > .L7: > > > str r4, [sp, #4] > > > strbr8, [r6] > > > mov r2, r4 > > > ldr r1, [sp, #4] > > > mov r0, r5 > > > bl printf > > > cbz r4, .L6 > > > movsr4, #0 > > > str r4, [sp] > > > ldrbr3, [r7]@ zero_extendqisi2 > > > cmp r3, #1 > > > beq .L7 > > > > There are serious aliasing bugs in the source - GCC is quite correct in > > assuming that cur and cur_as_bytes[offset] never alias (obviously) and even > > optimize away the cmpxchg (no idea why, that appears wrong). > Isn't > >uint32_t cur = *aligned_dest; >uint8_t* cur_as_bytes = reinterpret_cast(&cur); > > the very definition of the pointer aliasing? Regardless, if the function > being called is doing atomic operations or not, the variable in question > should not be hoisted? This is the essence of this bug. The example as is is completely incorrect - it doesn't even return and ends up executing random code. > Yes, openjdk code is doing nasty things furtheron (and the code predates > builtin gcc operations and is compiled by other compilers as well which may > not be aware of builtins), but the bug as it stands does not depend on that > logic. Arm and AArch64 compilers support byte-wise cmpxchg, so my advi
[Bug target/92462] [arm32] -ftree-pre makes a variable to be wrongly hoisted out
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92462 --- Comment #16 from Wilco --- (In reply to Richard Biener from comment #15) > I can't find PRE doing anything wrong and on 32bit x86_64 the testcase > executes > correctly with GCC 7.3 and GCC 9 (when I add the missing return to > Bar::cmpxchg). > So -ftree-pre, if it triggers a bug, triggers it elsewhere, the bug isn't in > PRE itself AFAICS. Well the difference is that one of the loads is removed by dse1: **scanning insn=26 cselib lookup (reg/f:SI 102 sfp) => 8:8 cselib lookup (reg/f:SI 126) => 9:4339 cselib lookup (reg/f:SI 102 sfp) => 8:8 cselib lookup (reg/f:SI 102 sfp) => 8:8 cselib lookup (plus:SI (reg/f:SI 102 sfp) (const_int -8 [0xfff8])) => 9:4339 mem: (plus:SI (reg/f:SI 102 sfp) (const_int -8 [0xfff8])) after canon_rtx address: (plus:SI (reg/f:SI 102 sfp) (const_int -8 [0xfff8])) gid=1 offset=-8 processing const load gid=1[-8..-4) trying to replace SImode load in insn 26 from SImode store in insn 16 deferring rescan insn with uid = 26. deferring rescan insn with uid = 77. -- replaced the loaded MEM with (reg 144) mems_found = 0, cannot_delete = true cselib lookup (mem/c:SI (plus:SI (reg/f:SI 102 sfp) (const_int -8 [0xfff8])) [1 curD.6314+0 S4 A64]) => 0:0
[Bug target/92462] [arm32] -ftree-pre makes a variable to be wrongly hoisted out
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92462 --- Comment #19 from Wilco --- (In reply to Richard Biener from comment #18) > So I see before DSE1: > > (insn 16 15 17 2 (set (mem/c:SI (plus:SI (reg/f:SI 102 sfp) > (const_int -8 [0xfff8])) [1 cur+0 S4 A64]) > (reg:SI 119 [ _13 ])) "t.ii":46:14 241 {*arm_movsi_insn} > (expr_list:REG_DEAD (reg:SI 119 [ _13 ]) > (nil))) > (insn 17 16 18 2 (set (reg/f:SI 125) > (plus:SI (reg/f:SI 102 sfp) > (const_int -8 [0xfff8]))) "t.ii":48:26 7 > {*arm_addsi3} > (nil)) > (insn 18 17 19 2 (set (reg/f:SI 120 [ _14 ]) > (plus:SI (reg/f:SI 125) > (reg/v:SI 118 [ offset ]))) "t.ii":48:26 7 {*arm_addsi3} > (nil)) > (insn 19 18 21 2 (set (reg:SI 126) > (const_int 1 [0x1])) "t.ii":48:26 241 {*arm_movsi_insn} > (nil)) > (insn 21 19 22 2 (set (mem:QI (plus:SI (reg/f:SI 125) > (reg/v:SI 118 [ offset ])) [0 *_14+0 S1 A8]) > (subreg:QI (reg:SI 126) 0)) "t.ii":48:26 251 {*arm_movqi_insn} > (expr_list:REG_DEAD (reg:SI 126) > (nil))) > (insn 22 21 23 2 (set (reg:SI 113 [ prephitmp_6 ]) > (mem/c:SI (plus:SI (reg/f:SI 102 sfp) > (const_int -8 [0xfff8])) [1 cur+0 S4 A64])) > "t.ii":50:18 241 {*arm_movsi_insn} > (nil)) > > where DSE1 then replaces the load in insn 22 with (reg:SI 119) (via a new > pseudo). But clearly insn 21 clobbers it (using alias-set zero). This > is on GIMPLE > > cur = _13; > _14 = &cur + offset_12; > *_14 = 1; > pretmp_22 = cur; > > not sure why DSE1 performs CSE (heh), but clearly what it does is wrong. > And yeah, probably tree PRE "enables" this miscompilation. DSE has > some special case with frame pointer bases and constant offsets. > > But the target dependency might come in via section anchors since I see > > **scanning insn=21 > mem: (plus:SI (reg/f:SI 125) > (reg/v:SI 118 [ offset ])) > >after canon_rtx address: (plus:SI (plus:SI (reg/f:SI 102 sfp) > (reg/v:SI 118 [ offset ])) > (const_int -8 [0xfff8])) > >after cselib_expand address: (plus:SI (plus:SI (and:SI (symbol_ref:SI > ("*.LANCHOR0") [flags 0x182]) > (const_int 3 [0x3])) > (reg/f:SI 102 sfp)) > (const_int -8 [0xfff8])) > >after canon_rtx address: (plus:SI (plus:SI (and:SI (symbol_ref:SI > ("*.LANCHOR0") [flags 0x182]) > (const_int 3 [0x3])) > (reg/f:SI 102 sfp)) > (const_int -8 [0xfff8])) > varying cselib base=11:1018534969 offset = -8 > processing cselib store [-8..-7) > mems_found = 1, cannot_delete = false > > where it's odd that we end up with an address like this? I suspect that > base_alias_check (my special friend) disambiguates the stack-based > access with one that now appears as a *.LANCHOR0 based one. > > We call canon_true_dependence with > > (plus:SI (value:SI 11:1018534969 @0x3089c60/0x30f5ed0) > (const_int -8 [0xfff8])) > > for this and get_addr turns it into > > (plus:SI (plus:SI (and:SI (symbol_ref:SI ("*.LANCHOR0") [flags 0x182]) > (const_int 3 [0x3])) > (value:SI 8:8 @0x3089c18/0x30f5e40)) > (const_int -8 [0xfff8])) > > and indeed find_base_term returns > > (symbol_ref:SI ("*.LANCHOR0") [flags 0x182]) > > for this. Which "obviously" doesn't alias with the stack based address > because that one is "unique_base_value_p" (address:SI -3) so we win here: > > 2229 if (unique_base_value_p (x_base) || unique_base_value_p (y_base)) > 2230return 0; > > now I wonder where that LANCHOR thing comes from. arm folks? Well when you do: int *p = &local + ((intptr_t)&global & 3); then you get the above expression when the global is addressed via an anchor. But whether or not a target uses anchors doesn't matter - the example still fails with -O1 -fno-section-anchors. So I'm wondering whether the issue is in the special code that tries to interpret an AND in an address, and mistakes the base as the symbol_ref eventhough the real base is sfp.
[Bug target/92462] [arm32] -ftree-pre makes a variable to be wrongly hoisted out
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92462 --- Comment #23 from Wilco --- (In reply to Richard Biener from comment #22) > Fixed on trunk. Can arm people verify? I checked the DSE dump only. Bonus > if you manage to create a testcase for the testsuite failing before, passing > now. > > The patch is simple enough to backport if it works. I will have a look. But only checking the low bit is still way too dangerous. Aliasing checks should be conservative and 100% accurate, not use random heuristics which hope for the best. So if we want to keep the AND code then it needs to look for a mask with all top bits set as the absolute minimum.
[Bug target/79262] [8/9/10 Regression] load gap with store gap causing performance regression in 462.libquantum
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79262 --- Comment #8 from Wilco --- Author: wilco Date: Tue Nov 19 15:57:54 2019 New Revision: 278452 URL: https://gcc.gnu.org/viewcvs?rev=278452&root=gcc&view=rev Log: [AArch64] PR79262: Adjust vector cost PR79262 has been fixed for almost all AArch64 cpus, however the example is still vectorized in a few cases, resulting in lower performance. Adjust the vector cost slightly so that so that -mcpu=cortex-a53 now has identical performance as -mcpu=cortex-a57 on libquantum. gcc/ PR target/79262 * config/aarch64/aarch64.c (generic_vector_cost): Adjust vec_to_scalar_cost. Modified: trunk/gcc/ChangeLog trunk/gcc/config/aarch64/aarch64.c
[Bug target/79262] [8/9/10 Regression] load gap with store gap causing performance regression in 462.libquantum
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79262 Wilco changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #9 from Wilco --- libquantum perf is now the same between Cortex-A53 and A57.
[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 Bug 53947 depends on bug 79262, which changed state. Bug 79262 Summary: [8/9/10 Regression] load gap with store gap causing performance regression in 462.libquantum https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79262 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug c/92612] [10 Regression] Linker error in 525.x264_r after r278509
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92612 --- Comment #2 from Wilco --- (In reply to Martin Liška from comment #1) > Following patch fixes that: > > diff --git a/benchspec/CPU/525.x264_r/src/ldecod_src/inc/configfile.h > b/benchspec/CPU/525.x264_r/src/ldecod_src/inc/configfile.h > index 12ed1cc8..450930e2 100644 > --- a/benchspec/CPU/525.x264_r/src/ldecod_src/inc/configfile.h > +++ b/benchspec/CPU/525.x264_r/src/ldecod_src/inc/configfile.h > @@ -18,9 +18,9 @@ > //#define LEVEL_IDC 21 > > > -InputParameters cfgparams; > > #ifdef INCLUDED_BY_CONFIGFILE_C > +InputParameters cfgparams; > // Mapping_Map Syntax: > // {NAMEinConfigFile, &cfgparams.VariableName, Type, InitialValue, > LimitType, MinLimit, MaxLimit, CharSize} > // Types : {0:int, 1:text, 2: double} > @@ -59,6 +59,7 @@ extern Mapping Map[]; > #endif > extern void JMDecHelpExit (); > extern void ParseCommand(InputParameters *p_Inp, int ac, char *av[]); > +extern InputParameters cfgparams; > > #endif (In reply to Martin Liška from comment #0) > Since defaulting to -fno-common, one can't build the benchmark due to: Yes some SPEC benchmarks rely on tentative definitions. Given it is a portability issue, but the source can't be fixed, the easiest way is to add -fcommon in the config file for the failing cases. This is similar to the other workarounds like -fno-strict-aliasing.
[Bug target/91927] -mstrict-align doesn't prevent unaligned accesses at -O2 and -O3 on AARCH64 targets
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91927 --- Comment #10 from Wilco --- (In reply to Andrew Pinski from comment #9) > I think the following patch is the correct fix: > diff --git a/gcc/config/aarch64/aarch64-simd.md > b/gcc/config/aarch64/aarch64-simd.md > index ad4676bc167..787323255cb 100644 > --- a/gcc/config/aarch64/aarch64-simd.md > +++ b/gcc/config/aarch64/aarch64-simd.md > @@ -41,7 +41,7 @@ > (define_expand "movmisalign" >[(set (match_operand:VALL 0 "nonimmediate_operand") > (match_operand:VALL 1 "general_operand"))] > - "TARGET_SIMD" > + "TARGET_SIMD && !STRICT_ALIGNMENT" > { >/* This pattern is not permitted to fail during expansion: if both > arguments > are non-registers (e.g. memory := constant, which can be created by the > > > Basically movmisalign should not be there if strict alignment as it is just > emitting a set. That fixes one of the examples but not pr71727. As the Arm movsi issues show, there are many mid-end alignment bugs.
[Bug rtl-optimization/92637] runtime issue with -ftree-coalesce-vars
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92637 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #2 from Wilco --- (In reply to John Dong from comment #0) > Created attachment 47338 [details] > testsuite > > hi, I compiled the attached with aarch64-linux-gnu-gcc -c -O2 > -march=armv8.1-a testsuite.c -o testsuite.o, it had runtime error, and I > found x10 was overwriten ar row 214. This example doesn't show a problem. Do you have an example that can be run and gives the runtime error?
[Bug c/85678] -fno-common should be default
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85678 --- Comment #8 from Wilco --- (In reply to David Binderman from comment #7) > (In reply to David Brown from comment #0) > > Surely it is time to make "-fno-common" the default, at least when a modern > > C standard is specified indicating that the code is modern? People who need > > the old behaviour can always get it with "-fcommon". > > Interestingly, use of -std=c89 or -std=gnu89 doesn't also switch on -fcommon > to get old behaviour. > > So use of compiler flag indicating to compile code to old standards > means implicit use of *new* standard for common. > > Looks odd to me. Possible bug ? If required, it would be feasible to keep the old behaviour for C89 indeed, however -fno-common is not incompatible with C89 (embedded C compilers may not even support -fcommon).
[Bug c/85678] -fno-common should be default
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85678 --- Comment #12 from Wilco --- (In reply to David Brown from comment #11) > Changing the default to "-fno-common" (and ideally > "-Werror=strict-prototypes -Werror=old-style-declaration > -Werror=missing-parameter-type") would have a lot smaller impact than > changing the default standard. Giving errors on old-style code by default sounds like a good idea. We could add -std=legacy similar to Fortran to support building old K&R code (and that would enable -fcommon by default).
[Bug target/92665] [AArch64] low lanes select not optimized out for vmlal intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92665 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #3 from Wilco --- (In reply to Andrew Pinski from comment #2) > Created attachment 47356 [details] > Patch which I wrote for GCC 7.3 > > I have to double check if it applies directly as I had other patches in this > area but this is the patch which I had wrote for GCC 7.3. I think it's because many intrinsics in arm_neon.h still use asm which inhibits most optimizations.
[Bug target/92692] [9/10 Regression] Saving off the callee saved register between ldxr/stxr (caused by shrink wrapping improvements)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92692 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #3 from Wilco --- (In reply to Andrew Pinski from comment #2) > I think this has been a latent bug since revision 243200: > [AArch64] Separate shrink wrapping hooks implementation > > I think aarch64_disqualify_components would be a location which should > disqualify the Separate for the register 19. What is the "exclusives reservation granule" size? It could only fail if the granule is large and the spill happens to be in the same granule as the stxr. I guess it's easy to fix by delaying the expansion or inserting a clobber of x19 before the loop starts.
[Bug target/92692] [9/10 Regression] Saving off the callee saved register between ldxr/stxr (caused by shrink wrapping improvements)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92692 --- Comment #5 from Wilco --- (In reply to Andrew Pinski from comment #4) > (In reply to Wilco from comment #3) > > (In reply to Andrew Pinski from comment #2) > > > I think this has been a latent bug since revision 243200: > > > [AArch64] Separate shrink wrapping hooks implementation > > > > > > I think aarch64_disqualify_components would be a location which should > > > disqualify the Separate for the register 19. > > > > What is the "exclusives reservation granule" size? It could only fail if the > > granule is large and the spill happens to be in the same granule as the > > stxr. > NO "exclusives reservation granule" does not matter here, please read the > ARMv8 spec again copied below (B2-142): > LoadExcl/StoreExcl loops are guaranteed to make forward progress only if, > for any LoadExcl/StoreExcl loop > within a single thread of execution, the software meets all of the following > conditions: > 1 Between the Load-Exclusive and the Store-Exclusive, there are no > explicit memory accesses, > preloads, direct or indirect System register writes, address translation > instructions, cache or TLB > maintenance instructions, exception generating instructions, exception > returns, or indirect branches. > --- CUT > > no explicit memory accesses > Is a requirement so it does not matter what "exclusives reservation granule" > size is really. > We had gone through this beforehand with the ARM architectures and made sure > that the specifications was worded correctly to the above effect. The > wording change happened in 2016. Well I'm looking at the latest version (https://static.docs.arm.com/ddi0487/ea/DDI0487E_a_armv8_arm.pdf) where in figure B2-5 it explicitly states that a store that does not match the reservation granule on the same CPU must not change the exclusive state. However if a store does match the granule it is implementation defined, hence the reason for the text you quote to guarantee forward progress - otherwise a random store in the loop could accidentally match the exclusive granule and block progress. However I don't see it saying anywhere that all stores must clear the exclusive state.
[Bug driver/89014] Use-after-free in aarch64 -march=native
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89014 --- Comment #9 from Wilco --- Author: wilco Date: Fri Nov 29 17:22:30 2019 New Revision: 278854 URL: https://gcc.gnu.org/viewcvs?rev=278854&root=gcc&view=rev Log: aarch64: fix use-after-free in -march=native (PR driver/89014) Running: $ valgrind ./xgcc -B. -c test.c -march=native on aarch64 shows a use-after-free in host_detect_local_cpu due to the std::string result of aarch64_get_extension_string_for_isa_flags only living until immediately after a c_str call. This leads to corrupt "-march=" values being passed to cc1. This patch fixes the use-after-free, though it appears to also need Tamar's patch here: https://gcc.gnu.org/ml/gcc-patches/2018-12/msg01302.html in order to generate valid values for cc1. This may have worked by accident in the past, if the corrupt "-march=" value happened to be 0-terminated in the "right" place; with this patch it now appears to reliably break without Tamar's patch. Backport from mainline 2019-01-23 David Malcolm PR driver/89014 * config/aarch64/driver-aarch64.c (host_detect_local_cpu): Fix use-after-free of the result of aarch64_get_extension_string_for_isa_flags. Modified: branches/gcc-8-branch/gcc/ChangeLog branches/gcc-8-branch/gcc/config/aarch64/driver-aarch64.c
[Bug tree-optimization/92738] [10 regression] Large code size growth for -O2 binaries between 2019-05-19...2019-05-29
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92738 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #7 from Wilco --- (In reply to Martin Liška from comment #6) > So wrf grew starting with r271377, size (w/o debug info) goes from 20164464B > to 23674792. Also check the build time of wrf. Looking at my logs trunk takes 2x as long to build it since June.
[Bug tree-optimization/92738] [10 regression] Large code size growth for -O2 binaries between 2019-05-19...2019-05-29
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92738 --- Comment #9 from Wilco --- (In reply to Martin Liška from comment #8) > (In reply to Wilco from comment #7) > > (In reply to Martin Liška from comment #6) > > > So wrf grew starting with r271377, size (w/o debug info) goes from > > > 20164464B > > > to 23674792. > > > > Also check the build time of wrf. Looking at my logs trunk takes 2x as long > > to build it since June. > > Maybe related to: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91509 > ? I think not, this is plain -Ofast, no LTO or prefetching. The same slowdown happens with -O2.
[Bug tree-optimization/92738] [10 regression] Large code size growth for -O2 binaries between 2019-05-19...2019-05-29
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92738 --- Comment #11 from Wilco --- (In reply to Thomas Koenig from comment #10) > (In reply to Martin Liška from comment #6) > > So wrf grew starting with r271377, size (w/o debug info) goes from 20164464B > > to 23674792. > > I think we've had this discussion before, although I cannot offhand > recall the PR number. PR91512 is closely related. > > Since r271377, arguments which may be contiguous are now (conditionally) > packed and unpacked inline (see PR88821). > > This was done so that the middle end can look into these conversions > and possibly eliminate them, if it can be determined via inlining > or LTO that the argument is contiguous anyway). This can lead to an > extremely large performance boost for some test cases (*10 or more), > but will, in general, lead to a size increase. > > Now, wrf has an extremely strange (and rare) programming style where they > pass > a ton of assumed shape arguments (where it is not clear, at compile-time, > if they need packing/unpacking) to an old-style array argument. This > causes considerable code size increase. > > So, it's a tradeoff, which we discussed at the time. This is why this > is not done at -Os. > > Should we "fix" this? I think not, the style of wrf is just too horrid, > and pessimizing other programs for the sake of one benchmark makes little > sense to me. Would using -frepack-arrays solve this issue? I proposed making that the default a while back. It would do any repacking that is necessary at call sites rather than creating multiple copies of all loops in every function.
[Bug tree-optimization/92822] [10 Regression] vfma_laneq_f32 and vmul_laneq_f32 are broken on aarch64 after r278938
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92822 Wilco changed: What|Removed |Added CC||rguenth at gcc dot gnu.org, ||wilco at gcc dot gnu.org Component|target |tree-optimization --- Comment #4 from Wilco --- (In reply to nsz from comment #2) > e.g. > > #include > > float32x2_t > foo (float32x2_t v0, float32x4_t v1) > { > return vmulx_laneq_f32 (v0, v1, 0); > } > > used to get translated to > > foo: > fmulx v0.2s, v0.2s, v1.s[0] > ret > > now it is > > foo: > adrpx0, .LC0 > ldr q2, [x0, #:lo12:.LC0] > tbl v1.16b, {v1.16b}, v2.16b > fmulx v0.2s, v0.2s, v1.2s > ret > .size foo, .-foo > .section.rodata.cst16,"aM",@progbits,16 Yes the change inserts a VEC_PERM_EXPR with random values for the upper lanes which becomes a TBL instruction. It happens when you extract a lane from a 128-bit vector and then dup it to a 64-bit vector. Optimized tree before: foo (float32x2_t v0, float32x4_t v1) { float _4; __Float32x2_t _5; __Float32x2_t _6; [local count: 1073741824]: __builtin_aarch64_im_lane_boundsi (16, 4, 0); _4 = BIT_FIELD_REF ; _5 = {_4, _4}; _6 = __builtin_aarch64_fmulxv2sf (v0_2(D), _5); [tail call] return _6; } And after r278938: foo (float32x2_t v0, float32x4_t v1) { __Float32x2_t _4; __Float32x2_t _7; __Float32x4_t _8; [local count: 1073741824]: __builtin_aarch64_im_lane_boundsi (16, 4, 0); _8 = VEC_PERM_EXPR ; _7 = BIT_FIELD_REF <_8, 64, 0>; _4 = __builtin_aarch64_fmulxv2sf (v0_2(D), _7); [tail call] return _4; }
[Bug rtl-optimization/93007] New: [10 regression] pr77698.c testcase fails due to block commoning
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93007 Bug ID: 93007 Summary: [10 regression] pr77698.c testcase fails due to block commoning Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- Since r276960 we see this failure on Arm: FAIL: gcc.dg/tree-prof/pr77698.c scan-rtl-dump-times alignments "internal loop alignment added" 1 The issue appears to be that basic block commoning works on an unrolled loop, which is unlikely to be beneficial for performance: .L17: addsr0, r0, #1 b .L27 .L6: ldr r4, [r2, #12] addsr0, r0, #4 ldr lr, [r1] str lr, [r3, r4, lsl #2] ldr r4, [r2, #12] ldr lr, [r1] str lr, [r3, r4, lsl #2] ldr r4, [r2, #12] ldr lr, [r1] str lr, [r3, r4, lsl #2] .L27: ldr r4, [r2, #12] cmp ip, r0 ldr lr, [r1] str lr, [r3, r4, lsl #2] bne .L6 pop {r4, pc} The test could be easily fixed, but ensuring block commoning takes loops and execution frequencies into account would be better overall.
[Bug tree-optimization/93023] give preference to address iv without offset in ivopts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93023 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #1 from Wilco --- (In reply to Feng Xue from comment #0) > Analysis into ivopts shows that those address IVs have same in-loop cost, > and IV w/o offset does have smaller pre-loop setup cost. But since the setup > cost will be averaged to each iteration, the minor cost difference will go > due to round-off by integer division. To fix this round-off error, cost can > be represented in a more accurate way, such as adding a fraction part to > make it a fixpoint number. It would be easy to adjust the costs. There are a few places where address costs (multiples of 1) are incorrectly mixed with rtx costs (multiples of 4), so adjusting address costs should improve things. However I suspect the problem is also caused by using overly high iteration estimates and not taking codesize into account. If the loop costs are identical then you obviously choose the variant with the fewest instructions.
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 --- Comment #8 from Wilco --- Author: wilco Date: Fri Jan 10 19:32:53 2020 New Revision: 280132 URL: https://gcc.gnu.org/viewcvs?rev=280132&root=gcc&view=rev Log: PR90838: Support ctz idioms Support common idioms for count trailing zeroes using an array lookup. The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic constant which when multiplied by a power of 2 creates a unique value in the top 5 or 6 bits. This is then indexed into a table which maps it to the number of trailing zeroes. When the table is valid, we emit a sequence using the target defined value for ctz (0): int ctz1 (unsigned x) { static const char table[32] = { 0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9 }; return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27]; } Is optimized to: rbitw0, w0 clz w0, w0 and w0, w0, 31 ret gcc/ PR tree-optimization/90838 * tree-ssa-forwprop.c (check_ctz_array): Add new function. (check_ctz_string): Likewise. (optimize_count_trailing_zeroes): Likewise. (simplify_count_trailing_zeroes): Likewise. (pass_forwprop::execute): Try ctz simplification. * match.pd: Add matching for ctz idioms. testsuite/ PR tree-optimization/90838 * testsuite/gcc.target/aarch64/pr90838.c: New test. Added: trunk/gcc/testsuite/gcc.target/aarch64/pr90838.c Modified: trunk/gcc/ChangeLog trunk/gcc/match.pd trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-forwprop.c
[Bug bootstrap/93229] simplify_count_trailing_zeroes doesn't compile on x86_64-pc-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93229 --- Comment #2 from Wilco --- (In reply to David Malcolm from comment #0) > A pristine checkout of r280132 doesn't build for me on x86_64-pc-linux-gnu: > > ../../src/gcc/tree-ssa-forwprop.c: In function ‘bool > simplify_count_trailing_zeroes(gimple_stmt_iterator*)’: > ../../src/gcc/config/i386/i386.h:2886:30: error: cannot convert > ‘poly_uint16’ {aka ‘poly_int<1, short unsigned int>’} to ‘long int’ in > assignment > 2886 | ((VALUE) = GET_MODE_BITSIZE (MODE), TARGET_BMI ? 1 : 0) > | ~^~ > | | > | poly_uint16 {aka poly_int<1, short > unsigned int>} > ../../src/gcc/tree-ssa-forwprop.c:1925:22: note: in expansion of macro > ‘CTZ_DEFINED_VALUE_AT_ZERO’ > 1925 | bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), > ctzval) == 2; > | ^ > > > I'm assuming this was introduced in r280132, as that commit introduced > simplify_count_trailing_zeroes. > > Am using gcc-9.2.1-1.fc30.x86_64 to try to build stage 1. That's odd, it shouldn't be using any poly types on x86... Machmode.h has: #if ONLY_FIXED_SIZE_MODES #define GET_MODE_BITSIZE(MODE) ((unsigned short) mode_to_bits (MODE).coeffs[0]) #else That's doing the correct thing if ONLY_FIXED_SIZE_MODES is defined.
[Bug bootstrap/93229] simplify_count_trailing_zeroes doesn't compile on x86_64-pc-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93229 --- Comment #4 from Wilco --- (In reply to David Malcolm from comment #3) > Apparently broken on other archs too, and for other people; from #gcc: > > nathan: I assume it's not just broken for me; I'm somewhat > sleep-deprived here > dmalcolm: broke PPC > correct. I guess it's somewhat target-specific, otherwise wilco > would have hit it > Rhy0lite, nathan: thanks > x86_64 for me, in case not obvious > breaks: ~{ARM,AArch64} Hmm, so maybe it's GET_MODE_BITSIZE that fails somehow? AArch64/Arm have always used GET_MODE_UNIT_BITSIZE. Looking at the difference, that seems to do something special for complex types.
[Bug tree-optimization/93231] [10 Regression] ICEs since r280132
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93231 --- Comment #4 from Wilco --- (In reply to Jakub Jelinek from comment #0) > int ctz2 (int x) > { > static const char table[32] = > { > 0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, > 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9 > }; > > return table[((int)((x & -x) * -0x077CB531)) >> 27]; > } > > ICEs because >unsigned HOST_WIDE_INT val = tree_to_uhwi (mulc); > is used without checking that the INTEGER_CST fits into uhwi. > If we don't want to support signed x, we should start the function by > verification that inner_type is INTEGRAL_TYPE_P which is TYPE_UNSIGNED, if > we do want to support even signed values, it needs to be tweaked differently. I guess TREE_INT_CST_LOW should fix that. The goal is to support signed and unsigned types. > Similarly, > int ctz3 (unsigned x) > { > static const char table[32] = > { > 0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, > 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9 > }; > > return table[((unsigned)((x & -x) * 0x077CB531U)) >> -27]; > } > ICEs because -27 doesn't fit into uhwi. We should just punt if it doesn't. I'm adding an extra tree_fits_uhwi for that. > I'm also surprised by > /* Check the array is not wider than integer type and the input is a 32-bit > or 64-bit type. */ > if (TYPE_PRECISION (type) > 32) > return false; > because the comment doesn't match what the check is doing, either you want > an array with 32-bit or smaller elts, then the comment should match that, or > you care about integers and then you should compare against TYPE_PRECISION > (integer_type_node). I'll check but I think this check is no longer required. > Also, there is no testcase for the string case, nor any non-target specific > testcase that it at least compiles and perhaps with tree dump scan on > selected > targets that it recognizes the ctz. I can add a generic testcase as well. > And I don't see a check that if it is a STRING_CST, the array elements must > be > bytes and not wider, which the function assumes (e.g. if it is u"..."[(x & > -x) > * ...) >> 27]). Right now "abc"[] can't match - the constructor always returns an error for this case. And this doesn't seem a common idiom so adding support isn't useful.
[Bug tree-optimization/93231] [10 Regression] ICEs since r280132
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93231 Wilco changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org --- Comment #6 from Wilco --- Patch at https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00731.html
[Bug target/92692] [9/10 Regression] Saving off the callee saved register between ldxr/stxr (caused by shrink wrapping improvements)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92692 --- Comment #10 from Wilco --- (In reply to Jakub Jelinek from comment #9) > Any -march= or similar? Can't reproduce with current trunk, nor > with even Oct 10 GCC snapshot (crosses in both cases). > grep -B1 stxr pr92692.s > doesn't show any stores before stxr. It's going to be extremely sensitive to register allocation, so it's not clear it's worth trying to reproduce. The easiest option is to see whether replacing "reload_completed" with "epilogue_completed" in aarch64/atomics.md works fine.
[Bug target/92692] [9/10 Regression] Saving off the callee saved register between ldxr/stxr (caused by shrink wrapping improvements)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92692 Wilco changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org --- Comment #13 from Wilco --- (In reply to Wilco from comment #10) > The easiest option is to see whether replacing "reload_completed" with > "epilogue_completed" in aarch64/atomics.md works fine. That works indeed, I'll post a patch.
[Bug tree-optimization/93231] [10 Regression] ICEs since r280132
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93231 Wilco changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #8 from Wilco --- Fixed.
[Bug target/71727] -O3 -mstrict-align produces code which assumes unaligned vector accesses work
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71727 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #8 from Wilco --- (In reply to Martin Liška from comment #7) > Christophe: Can the bug be marked as resolved? See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91927 - it's still failing since the underlying issues haven't been resolved.
[Bug target/92692] Saving off the callee saved register between ldxr/stxr (caused by shrink wrapping improvements)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92692 Wilco changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED Target Milestone|9.3 |8.4 Summary|[9 Regression] Saving off |Saving off the callee saved |the callee saved register |register between ldxr/stxr |between ldxr/stxr (caused |(caused by shrink wrapping |by shrink wrapping |improvements) |improvements) | --- Comment #20 from Wilco --- Backported to GCC8 and GCC9, fixed on all active branches.
[Bug middle-end/64242] Longjmp expansion incorrect
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64242 --- Comment #37 from Wilco --- (In reply to Andrew Pinski from comment #36) > MIPS is still broken. I might look into MIPS brokenness next week. Yes it seems builtin_longjmp has the exact same fp corruption issue: move$fp,$17 lw $sp,8($fp) jr $16 lw $28,12($fp)
[Bug rtl-optimization/93565] New: Combine duplicates count trailing zero instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 Bug ID: 93565 Summary: Combine duplicates count trailing zero instructions Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wilco at gcc dot gnu.org Target Milestone: --- Created attachment 4 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=4&action=edit ctz_duplication The attached example causes Combine to duplicate count trailing zero instructions on targets which have CTZ_DEFINED_VALUE = 2: f: cbz x0, .L2 rbitx2, x0 rbitx0, x0 clz x2, x2 clz x0, x0 ldr w2, [x1, x2, lsl 2] orr w0, w2, w0 str w0, [x1] .L2: mov x0, 0 ret The cause is Combine deciding to merge CTZ into a sign-extend: (insn 10 9 12 3 (set (reg:DI 100) (ctz:DI (reg/v:DI 98 [ x ]))) "ctz2.c":17:15 689 {ctzdi2} (expr_list:REG_DEAD (reg/v:DI 98 [ x ]) (nil))) (insn 12 10 14 3 (set (reg:DI 101 [ _9 ]) (sign_extend:DI (subreg:SI (reg:DI 100) 0))) "ctz2.c":25:15 104 {*extendsidi2_aarch64} (nil)) allowing combination of insns 10 and 12 original costs 4 + 4 = 8 replacement costs 4 + 4 = 8 modifying insn i210: r100:DI=ctz(r98:DI) deferring rescan insn with uid = 10. modifying insn i312: r101:DI=ctz(r98:DI) REG_DEAD r98:DI (insn 10 9 12 3 (set (reg:DI 100) (ctz:DI (reg/v:DI 98 [ x ]))) "ctz2.c":17:15 689 {ctzdi2} (nil)) (insn 12 10 14 3 (set (reg:DI 101 [ _9 ]) (ctz:DI (reg/v:DI 98 [ x ]))) "ctz2.c":25:15 689 {ctzdi2} (expr_list:REG_DEAD (reg/v:DI 98 [ x ]) (nil))) Later passes then seem unable to CSE the two identical CTZ instructions... This doesn't seem right - if Combine can optimize the sign-extend away then there is no point in duplicating the CTZ.
[Bug rtl-optimization/93565] Combine duplicates count trailing zero instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 --- Comment #3 from Wilco --- (In reply to Segher Boessenkool from comment #2) > Of course it first tried to do > > Failed to match this instruction: > (parallel [ > (set (reg:DI 101 [ _9 ]) > (ctz:DI (reg/v:DI 98 [ x ]))) > (set (reg:DI 100) > (ctz:DI (reg/v:DI 98 [ x ]))) > ]) > > so we could try to do that as just the ctz and then a register move, > and hope that move can be optimised away. But this is more expensive > if it can *not* be optimised (higher latency). Hrm. Yes if a sign/zero-extend is proven to be redundant, it should be replaced with a move - it's unlikely it could not be removed either by Combine or during register allocation. It seems to me this could happen with any instruction pair where it decides to forward substitute, but keep the original instruction. If the costs are identical, it's better to replace the 2nd instruction with a move. Would it already do this if say we counted moves as somewhat lower cost than ALU instructions?
[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 Wilco changed: What|Removed |Added CC||segher at kernel dot crashing.org Summary|Combine duplicates count|[9/10 regression] Combine |trailing zero instructions |duplicates instructions --- Comment #8 from Wilco --- Here is a much simpler example: void f (int *p, int y) { int a = y & 14; *p = a | p[a]; } Trunk and GCC9.1 for x64: mov eax, esi and esi, 14 and eax, 14 or eax, DWORD PTR [rdi+rsi*4] mov DWORD PTR [rdi], eax ret and AArch64: and x2, x1, 14 and w1, w1, 14 ldr w2, [x0, x2, lsl 2] orr w1, w2, w1 str w1, [x0] ret However GCC8.2 does: and w1, w1, 14 ldr w2, [x0, w1, sxtw 2] orr w2, w2, w1 str w2, [x0] ret So it is a 9 regression...
[Bug target/92692] Saving off the callee saved register between ldxr/stxr (caused by shrink wrapping improvements)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92692 --- Comment #22 from Wilco --- (In reply to Sebastian Pop from comment #21) > It looks like this hunk from the trunk version of the patch is missing on > gcc-9 branch: > > diff --git a/gcc/config/aarch64/atomics.md b/gcc/config/aarch64/atomics.md > index cabcc58f1a0..1458bc00095 100644 > --- a/gcc/config/aarch64/atomics.md > +++ b/gcc/config/aarch64/atomics.md > @@ -104,7 +104,7 @@ > (clobber (match_scratch:SI 7 "=&r"))] >"" >"#" > - "&& reload_completed" > + "&& epilogue_completed" >[(const_int 0)] >{ > aarch64_split_compare_and_swap (operands); > > > > With this hunk applied my bootstrap passes on the gcc-9 branch on an > aarch64-linux graviton2. > > Without this hunk I see an error in thread sanitizers. > > I also have checked gcc-8 release branch and it seems that the patch is not > missing any hunks in that branch. > > Could somebody apply the missing hunk to the gcc-9 release branch? Thanks! I don't see anything like that on the gcc-9 branch - are you sure you don't have an outstanding change somehow?
[Bug target/91598] [8/9/10 regression] 60% speed drop on neon intrinsic loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91598 --- Comment #4 from Wilco --- Fixing vmull_lane_s16 and vmlal_lane_s16 to avoid inline assembler gives this schedule which runs 63% faster on Cortex-A53: ldr d2, [x6, x0] ldr d4, [x6, x3] ldr d3, [x6, x2] smull v2.4s, v2.4h, v0.4h[0] ldr d1, [x6, x1] smull v4.4s, v4.4h, v0.4h[1] ldr d16, [x7, x3] smull v3.4s, v3.4h, v0.4h[3] ldr d7, [x7, x2] smull v1.4s, v1.4h, v0.4h[0] ldr d6, [x7, x1] ldr d5, [x7, x0] smlal v4.4s, v16.4h, v0.4h[3] ldr d16, [x4, x3] smlal v3.4s, v7.4h, v0.4h[2] ldr d7, [x4, x2] smlal v1.4s, v6.4h, v0.4h[1] ldr d6, [x4, x1] smlal v2.4s, v5.4h, v0.4h[1] ldr d5, [x4, x0] smlal v4.4s, v16.4h, v0.4h[3] ldr d16, [x5, x3] smlal v3.4s, v7.4h, v0.4h[0] ldr d7, [x5, x2] smlal v1.4s, v6.4h, v0.4h[2] ldr d6, [x5, x1] smlal v2.4s, v5.4h, v0.4h[2] ldr d5, [x5, x0] smlal v4.4s, v16.4h, v0.4h[3] smlal v3.4s, v7.4h, v0.4h[3] smlal v1.4s, v6.4h, v0.4h[2] smlal v2.4s, v5.4h, v0.4h[0]
[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #8 from Wilco --- (In reply to Jiangning Liu from comment #0) > For the small case below, GCC -O3 can't vectorize the small loop to do byte > comparison in func2. > > void *malloc(long unsigned int); > typedef struct { > unsigned char *buffer; > } data; > > static unsigned char *func1(data *d) > { > return d->buffer; > } > > static int func2(int max, int pos, unsigned char *cur) > { > unsigned char *p = cur + pos; > int len = 0; > while (++len != max) > if (p[len] != cur[len]) > break; > return cur[len]; > } > > int main (int argc) { > data d; > d.buffer = malloc(2*argc); > return func2(argc, argc, func1(&d)); > } > > At the moment, the following code is generated for this loop, > > 4004d4: 38616862ldrbw2, [x3,x1] > 4004d8: 6b5fcmp w2, w0 > 4004dc: 54a1b.ne4004f0 > 4004e0: 38616880ldrbw0, [x4,x1] > 4004e4: 6b01027fcmp w19, w1 > 4004e8: 91000421add x1, x1, #0x1 > 4004ec: 5441b.ne4004d4 > > In fact, this loop can be vectorized by checking if the comparison size is > aligned to SIMD register length. It may introduce run time overhead, but > cost model could make decision on doing it or not. The only optimization that can be done here is unrolling. For this kind of string matching the number of bytes that match is will be small on average, so even if vectorization was feasible, the startup overhead alone would kill performance. With unrolling you can remove the comparison with max each iteration and do 4 bytes per iteration like this: loop4: ldrbw2, [x3,4]! ldrbw0, [x4,4]! cmp w0, w2 bne exitloop1 ldrbw2, [x3,1] ldrbw0, [x4,1] cmp w0, w2 bne exitloop2 ldrbw2, [x3,2] ldrbw0, [x4,2] cmp w0, w2 bne exitloop3 ldrbw2, [x3,3] ldrbw0, [x4,3] cmp w0, w2 bne exitloop4 add x1, x1, 4 cmp x1, w19 blt loop4
[Bug middle-end/64242] Longjmp expansion incorrect
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64242 --- Comment #21 from Wilco --- (In reply to Rainer Orth from comment #20) > The new testcase also FAILs on sparc-sun-solaris2.11 (both 32 and 64-bit): > > +FAIL: gcc.c-torture/execute/pr64242.c -O2 execution test > +FAIL: gcc.c-torture/execute/pr64242.c -O2 -flto execution test > +FAIL: gcc.c-torture/execute/pr64242.c -O2 -flto -flto-partition=none > execution test > +FAIL: gcc.c-torture/execute/pr64242.c -O3 -g execution test > +FAIL: gcc.c-torture/execute/pr64242.c -Os execution test > > Thread 2 received signal SIGSEGV, Segmentation fault. > [Switching to Thread 1 (LWP 1)] > 0x0008 in ?? () > (gdb) where > #0 0x0008 in ?? () > Backtrace stopped: previous frame identical to this frame (corrupt stack?) > > Single-stepping, I find that this happens at the very end of main: > > 1: x/i $pc > => 0x10de4 : return %i7 + 8 > (gdb) > 0x00010de8 in main () > at > /vol/gcc/src/hg/trunk/local/gcc/testsuite/gcc.c-torture/execute/pr64242.c:50 > 50return 0; > 1: x/i $pc > => 0x10de8 : nop > (gdb) > 0x0008 in ?? () > 1: x/i $pc > => 0x8: > > Obviously the stack is corrupted beyond repair. I tried to avoid this by > replacing the return 0 with exit (0) to no avail. My latest patch detects this stack corruption with 100% certainty again, see https://gcc.gnu.org/ml/gcc-patches/2018-12/msg00459.html. However sparc has a custom nonlocal_goto MD pattern which would need fixing too.
[Bug middle-end/88560] [9 Regression] armv8_2-fp16-move-1.c and related regressions after r260385
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88560 Wilco changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2018-12-20 CC||wilco at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #2 from Wilco --- Eg. before test_load_store_1: ldrhr3, [r2, r1, lsl #1]@ __fp16 strhr3, [r0, r1, lsl #1]@ __fp16 bx lr after: test_load_store_1: vmov.f16s0, r3 @ __fp16 ldrhr3, [r2, r1, lsl #1]@ __fp16 strhr3, [r0, r1, lsl #1]@ __fp16 bx lr Inserting spurious extra moves certainly doesn't look correct.
[Bug target/86891] [9 Regression] wrong code with -O -frerun-cse-after-loop -fno-tree-dominator-opts -fno-tree-fre
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86891 Wilco changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2018-12-20 CC||wilco at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #3 from Wilco --- (In reply to Jakub Jelinek from comment #1) > Now, looking at what aarch64 does for add with carry, there are separate > patterns like add3_carryinC which set CC_C mode and use zero_extend > and add3_carryinV which sets CC_V mode and uses sign_extend. > So, shouldn't sub3_carryin{C,V} be split similarly and if we check > carry flag, we should use subdi3_carryinC? Yes it looks like the pattern confuses signed and unsigned underflow. Changing it to zero_extend and using minus for the compare fixes the reported issue, but it's not possible to support signed and unsigned in a single pattern.
[Bug target/86891] [9 Regression] __builtin_sub_overflow incorrect for unsigned types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86891 --- Comment #5 from Wilco --- (In reply to Richard Earnshaw from comment #4) > Yes, the extension should be zero-extend, not sign extend. The plus > operation is correct, however, since decrementing the first operand could > lead to underflow if it was zero. So the correct rtl would be > > (compare ((zero_x(a)) (plus (zero_x(b) (ltu(cc, 0) > (minus (...)) Agreed. The issue is more widespread though, signed underflow doesn't work either, and subv4 simply uses gen_sub3_compare1 which does do a normal compare, so the RTL does not compute the overflow flag eventhough the actual compare does. It seems like a bug when it does constant fold this RTL given it should have been folded before expand. Using UNSPEC for these complex flag uses may be best to be safe.
[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398 --- Comment #11 from Wilco --- (In reply to Jakub Jelinek from comment #10) > If the compiler knew say from PGO that pos is usually a multiple of certain > power of two and that the loop usually iterates many times (I guess the > latter can be determined from comparing the bb count of the loop itself and > its header), it could emit something like: > static int func2(int max, int pos, unsigned char *cur) > { > unsigned char *p = cur + pos; > int len = 0; > if (max > 32 && (pos & 7) == 0) > { > int l = ((1 - ((uintptr_t) cur)) & 7) + 1; > while (++len != l) > if (p[len] != cur[len]) > goto end; > unsigned long long __attribute__((may_alias)) *p2 = (unsigned long > long *) &p[len]; > unsigned long long __attribute__((may_alias)) *cur2 = (unsigned long > long *) &cur[len]; > while (len + 8 < max) > { > if (*p2++ != *cur2++) > break; > len += 8; > } > --len; > } > while (++len != max) > if (p[len] != cur[len]) > break; > end: > return cur[len]; > } > > or so (untested). Of course, it could be done using SIMD too if there is a > way to terminate the loop if any of the elts is different and could be done > in that case at 16 or 32 or 64 characters at a time etc. > But, without knowing that pos is typically some power of two this would just > waste code size, dealing with the unaligned cases would be more complicated > (one can't read the next elt until proving that the current one is all > equal), so it would need to involve some rotations (or permutes for SIMD). Given it is compressing data both pointers will typically be misaligned. Pos would be fairly random and len does not usually start from zero either. And it's highly unlikely it will iterate more than a few bytes. Compression matches are typically just a few bytes long. So unrolling is the only useful optimization for this kind of code.
[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398 --- Comment #13 from Wilco --- So to add some real numbers to the discussion, the average number of iterations is 4.31. Frequency stats (16 includes all iterations > 16 too): 1: 29.0 2: 4.2 3: 1.0 4: 36.7 5: 8.7 6: 3.4 7: 3.0 8: 2.6 9: 2.1 10: 1.9 11: 1.6 12: 1.2 13: 0.9 14: 0.8 15: 0.7 16: 2.1 So unrolling 4x is perfect for this loop. Note the official xz version has optimized this loop since 2014(!) using unaligned accesses: https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/common/memcmplen.h
[Bug target/86891] [9 Regression] __builtin_sub_overflow incorrect for unsigned types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86891 --- Comment #10 from Wilco --- (In reply to Richard Earnshaw from comment #9) > Fixed. Yes looking at git blame all of the addv/subv support was added for GCC9, so no backporting is needed.
[Bug middle-end/88739] Big-endian union bug
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88739 Wilco changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2019-01-07 CC||wilco at gcc dot gnu.org Component|target |middle-end Summary|union bug on ARM64 |Big-endian union bug Ever confirmed|0 |1 --- Comment #1 from Wilco --- Confirmed - this is a generic mid-end bug that happens in fre1 when it replaces the bitfield store and load with the wrong bitfield extract: Test_func (U32 ulAddr) { union HEAD_REQ_DW4_UNION unData; unsigned int _1; _2; int _4; unsigned int _7; int _9; short unsigned int _10; int _11; short unsigned int _23; : _1 = ulAddr_12(D) >> 2; _2 = () _1; unData.strMemHead.b30AddrL = _2; unData.strMemHead.b2AddrType = 0; _4 = (int) _2; printf ("unData.strMemHead.b30AddrL=0x%x\r\n", _4); printf ("unData.strMemHead.b2AddrType=0x%x\r\n", 0); _7 = unData.aulValue[3]; printf ("unData.aulValue[3]=0x%x\r\n", _7); _23 = BIT_FIELD_REF <_2, 16, 0>;// WRONG: should be _2, 14, 0 _9 = (int) _23; printf ("unData.ausValue[6]=0x%x\r\n", _9); _10 = unData.ausValue[7]; _11 = (int) _10; printf ("unData.ausValue[7]=0x%x\r\n", _11); unData ={v} {CLOBBER}; return 0; }
[Bug middle-end/88739] Big-endian union bug
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88739 --- Comment #3 from Wilco --- (In reply to Richard Earnshaw from comment #2) > > _23 = BIT_FIELD_REF <_2, 16, 0>;// WRONG: should be _2, 14, 0 > > _2 is declared as a 30-bit integer, so perhaps the statement is right, but > expand needs to understand that the shift extract of the top 16 bits comes > from a different location in big-endian. So the question becomes what format is this in? _2; Is it big-endian memory format (so value is in top 30 bits) or simply a 30-bit value in a virtual register?
[Bug tree-optimization/88739] [7/8/9 Regression] Big-endian union bug
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88739 --- Comment #16 from Wilco --- (In reply to Richard Biener from comment #8) > So I think part of a fix would be the following. Not sure if > REG_WORDS_BIG_ENDIAN or FLOAT_WORDS_BIG_ENDIAN come into play. > With the fix we no longer simplify this for aarch64 since > BITS_BIG_ENDIAN is 0 even in BE mode. A conservative fix to only do this for little endian would be fine given one shouldn't really do this kind of low-level hacking frequently. I guess FLOAT_WORDS_BIG_ENDIAN will matter too since you could store a double and read back part of it as a bitfield (or store a __int128 bitfield and read a double back). No idea why it would use BITS_BIG_ENDIAN > That is, not sure if BIT_FIELD_REF <_2, 16, 14> would be correct as suggested > for aarch64_be because of that BITS_BIG_ENDIAN setting. Possibly - but then <_2, 16, 0> doing a right shift by 16 would be an incorrect expansion for big-endian. So something is wrong there... I think we need to simplify the many BIG_ENDIAN macros so it is feasible to get big-endian to work reliably on all targets. There seem to be far too many options which affect too many unrelated things. Big-endian is fundamentally about memory byte ordering, so allowing to different byte/bit orderings in registers just makes things overly complex without any benefit.
[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398 --- Comment #15 from Wilco --- (In reply to rguent...@suse.de from comment #14) > On Mon, 7 Jan 2019, wilco at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398 > > > > --- Comment #13 from Wilco --- > > So to add some real numbers to the discussion, the average number of > > iterations > > is 4.31. Frequency stats (16 includes all iterations > 16 too): > > > > 1: 29.0 > > 2: 4.2 > > 3: 1.0 > > 4: 36.7 > > 5: 8.7 > > 6: 3.4 > > 7: 3.0 > > 8: 2.6 > > 9: 2.1 > > 10: 1.9 > > 11: 1.6 > > 12: 1.2 > > 13: 0.9 > > 14: 0.8 > > 15: 0.7 > > 16: 2.1 > > > > So unrolling 4x is perfect for this loop. Note the official xz version has > > optimized this loop since 2014(!) using unaligned accesses: > > https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/common/memcmplen.h > > I guess if we'd have some data to guide then classical unrolling using > duffs device would be best here? Because peeling will increase the > number of dynamic branches and likely the actual distribution of > #iterations isn't so that they will be well predicted? Duff's device is very slow on any CPU with branch prediction. It can't be used here given you don't know the number of iterations in advance (only the maximum). Unrolling reduces the number of branches given the loop branch is only taken once every N iterations. The loop overhead is significant here: 3 out of 7 instructions, with 4x unrolling that reduces to 3 in 19.