from:"peter at cordes dot ca"

[Bug c++/35669] NULL (__null) not considered different from 0 with C++

2009-02-12 Thread peter at cordes dot ca



--- Comment #8 from peter at cordes dot ca  2009-02-12 17:56 ---
Would it cause any problems for g++ to behave more like a C compiler when it
comes to NULL?  e.g. I found this bug report after finding that kscope 1.9.1
didn't compile, because it expected NULL to match the void* version of an
overloaded function.

locationlistmodel.cpp:204: error: call of overloaded createIndex(int&, int&,
NULL) is ambiguous

.../qabstractitemmodel.h:288: note: candidates are: QModelIndex
QAbstractItemModel::createIndex(int, int, void*) const
.../qabstractitemmodel.h:290: note: QModelIndex
QAbstractItemModel::createIndex(int, int, int) const
.../qabstractitemmodel.h:299: note: QModelIndex
QAbstractItemModel::createIndex(int, int, quint32) const

This was in released alpha code
(http://qt-apps.org/content/show.php?content=96992) so presumably it built ok
on some compiler.  (Although maybe it used to just pick one of the int
overloads, if that's what Lubos was talking about having to debug.)

 As a mostly C programmer, this just seems like something stupid in the
standard, and the sort of behaviour you should only get with -std=c++0x, but
not -std=gnu++0x.  As everyone else is saying, who in their right mind actually
wants this behaviour?  And more importantly, would changing it ever make g++
actually mis-compile anything?  (not counting compiling stuff like kscope, or
test.c below, where the result "should" be an error message, not a binary)

 Anyway, NULL should be a void*, damn it.  Yes, I'm a C programmer.


-- 

peter at cordes dot ca changed:

   What|Removed |Added

 CC|                |peter at cordes dot ca


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35669

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2020-04-14 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #53 from Peter Cordes  ---
I think we can close this as fixed at some point.  The last activity on this
bug was some patches that sound like they were supposed to fix, and the MCVEs
from comments I tested no longer has a problem.

GCC9.3 -O3 -march=core2 -fomit-frame-pointer only uses a `.p2align` to align
the top of the loop, not between leave and ret or between cmp/jcc.

void wait_for_enter()
{
volatile int foo = 0;  // to get a LEAVE instruction emitted at all
int u = getchar();
while (!u)
u = getchar()-13;
}

https://godbolt.org/z/RvxzZv

(Note that Godbolt normally filters .p2align so you have to either compile to
binary or not filter directives in the asm source.  Otherwise you'll never see
NOPs except in the unusual case where GCC actually emits a nop mnemonic.)

[Bug tree-optimization/92243] New: Missing "auto-vectorization" of char array reversal using x86 scalar bswap when SIMD pshufb isn't available

2019-10-27 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92243

Bug ID: 92243
   Summary: Missing "auto-vectorization" of char array reversal
using x86 scalar bswap when SIMD pshufb isn't
available
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

We could use integer bswap to speed up an in-place byte-reverse loop by a
factor of probably 8, the same way we uses SIMD shuffles.

Consider this loop which reverses an explicit-length char array:
https://godbolt.org/z/ujXq_J

typedef char swapt; // int can auto-vectorize with just SSE2
void strrev_explicit(swapt *head, long len)
{
  swapt *tail = head + len - 1;
  for( ; head < tail; ++head, --tail) {
  swapt h = *head, t = *tail;
  *head = t;
  *tail = h;
  }
}

gcc -O3 (including current trunk) targeting x86-64 makes naive scalar
byte-at-a-time code, even though bswap r64 is available to byte-reverse a
uint64 in 1 or 2 uops (AMD and Intel, respectively).

With -mssse3, we do see auto-vectorization using SIMD pshufb (after checking
lengths and calculating how many 16-byte chunks can be done before bloated
fully-unrolled cleanup).  Doing the same thing with 64-bit integer registers
would be very much worth it (for code where a loop like this was a bottleneck).



With `swapt = short`, vectorizing with SSE2 pshuflw / pshufhw / pshufd is
probably worth it, but GCC chooses not to do that either.  Or working in 8-byte
chunks just using movq + pshuflw, so we only have 1 shuffle per 8-byte
load/store instead of 3 per 16-byte store.  That's a good balance for modern
Intel (Haswell, Skylake, and I think IceLake), although some AMD and earlier
Intel with more integer shuffle throughput (e.g. Sandybridge) might do better
with 3x shuffles per 16-byte load/store.

[Bug tree-optimization/92243] Missing "auto-vectorization" of char array reversal using x86 scalar bswap when SIMD pshufb isn't available

2019-10-27 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92243

--- Comment #1 from Peter Cordes  ---
Forgot to mention, this probably applies to other ISAs with GP-integer
byte-reverse instructions and efficient unaligned loads.

[Bug tree-optimization/92244] New: extra sub inside vectorized loop instead of calculating end-pointer

2019-10-27 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244

Bug ID: 92244
   Summary: extra sub inside vectorized loop instead of
calculating end-pointer
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

We get a redundant instruction inside the vectorized loop here.  But it's not a
separate *counter*, it's a duplicate of the tail pointer.

It goes away if we find tail with while(*tail++); instead of calculating it
from head+length.

Only happens with vectorization, not pure scalar (bug 92243 is about the fact
that -O3 fails to use bswap as a GP-integer shuffle to auto-vectorize without
x86 SSSE3).

typedef char swapt;
void strrev_explicit(swapt *head, long len)
{
  swapt *tail = head + len - 1;
  for( ; head < tail; ++head, --tail) {
  swapt h = *head, t = *tail;
  *head = t;
  *tail = h;
  }
}
https://godbolt.org/z/wdGv4S

compiled with g++ -O3 -march=sandybridge gives us a main loop of

...
movq%rcx, %rsi # RSI = RCX before entering the loop
addq%rdi, %r8
.L4:
vmovdqu (%rcx), %xmm3   # tail load from RCX
addq$16, %rax# head
subq$16, %rcx# tail
subq$16, %rsi# 2nd tail?
vmovdqu -16(%rax), %xmm0
vpshufb %xmm2, %xmm3, %xmm1
vmovups %xmm1, -16(%rax)
vpshufb %xmm2, %xmm0, %xmm0
vmovups %xmm0, 16(%rsi) # tail store to RSI
cmpq%r8, %rax   # } while(head != end_head)
jne .L4

RSI = RCX before and after the loop.  This is obviously pointless.
head uses the same register for loads and stores.

 Then we have bloated fully-unrolled scalar cleanup, instead of using the
shuffle control for 8-byte vectors -> movhps.  Or scalar bswap.  Ideally we'd
do something clever at the overlap like one load + shuffle + store, but we
might have to load the next vector before storing the current to make this work
at the overlap.  That would presumably require more special-casing this kind of
meet-in-the-middle loop.




The implicit-length version doesn't have this extra sub in the main loop.

void strrev_implicit(swapt *head)
{
  swapt *tail = head;
  while(*tail) ++tail;// find the 0 terminator, like head+strlen
  --tail; // tail points to the last real char
  for( ; head < tail; ++head, --tail) {
  swapt h = *head, t = *tail;
  *head = t;
  *tail = h;
  }
}

.L22:
vmovdqu (%rcx), %xmm3
addq$16, %rdx   # head
subq$16, %rcx   # tail
vmovdqu -16(%rdx), %xmm0
vpshufb %xmm2, %xmm3, %xmm1
vmovups %xmm1, -16(%rdx)
vpshufb %xmm2, %xmm0, %xmm0
vmovups %xmm0, 16(%rcx)
cmpq%rsi, %rdx  # } while(head != end_head)
jne .L22

[Bug tree-optimization/92244] extra sub inside vectorized loop instead of calculating end-pointer

2019-10-27 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244

--- Comment #1 from Peter Cordes  ---
On AArch64 (with gcc8.2), we see a similar effect, more instructions in the
loop.  And an indexed addressing mode.

https://godbolt.org/z/6ZVWY_


# strrev_explicit   -O3 -mcpu=cortex-a53
   ...
.L4:
ldr q1, [x4, x2]# tail
ldr q0, [x3]# head
tbl v1.16b, {v1.16b}, v2.16b# byte shuffle
tbl v0.16b, {v0.16b}, v2.16b
str q1, [x3], 16# post-increment store to head
cmp x3, x1
str q0, [x4, x2]
sub x2, x2, #16   # doesn't update flags, not SUBS
bne .L4 # }while( head != end_head )



# strrev_implicit   -O3 -mcpu=cortex-a53
...
.L19:
ldr q1, [x3]
ldr q0, [x2]
tbl v1.16b, {v1.16b}, v2.16b
tbl v0.16b, {v0.16b}, v2.16b
str q1, [x2], 16   # post-increment addressing mode 
cmp x2, x4
str q0, [x3], -16  # post-decrement addressing mode 
bne .L19   # }while( head != end_head )

[Bug tree-optimization/92244] vectorized loop updating 2 copies of the same pointer (for in-place reversal cross in the middle)

2019-10-27 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244

Peter Cordes  changed:

   What|Removed |Added

Summary|extra sub inside vectorized |vectorized loop updating 2
   |loop instead of calculating |copies of the same pointer
   |end-pointer |(for in-place reversal
   ||cross in the middle)

--- Comment #2 from Peter Cordes  ---
Forgot to update title after looking more carefully at the asm.

[Bug target/92246] New: Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)

2019-10-27 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92246

Bug ID: 92246
   Summary: Byte or short array reverse loop auto-vectorized with
3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

typedef short swapt;
void strrev_explicit(swapt *head, long len)
{
  swapt *tail = head + len - 1;
  for( ; head < tail; ++head, --tail) {
  swapt h = *head, t = *tail;
  *head = t;
  *tail = h;
  }
}

g++ -O3 -march=skylake-avx512
  (Compiler-Explorer-Build) 10.0.0 20191022 (experimental)

https://godbolt.org/z/LS34w9

...
.L4:
vmovdqu16   (%rdx), %ymm1
vmovdqu16   (%rax), %ymm0
vmovdqa64   %ymm1, %ymm3# useless copy
vpermt2w%ymm1, %ymm2, %ymm3
vmovdqu16   %ymm3, (%rax)
vpermt2w%ymm0, %ymm2, %ymm0
addq$32, %rax
vmovdqu16   %ymm0, (%rcx)
subq$32, %rdx
subq$32, %rcx   # two tail pointers, PR 92244 is unrelated to
this
cmpq%rsi, %rax
jne .L4

vpermt2w ymm is 3 uops on SKX and CannonLake:  2p5 + p015
(https://www.uops.info/table.html)

Obviously better would be  vpermw (%rax), %ymm2, %ymm0.

vpermw apparently can't micro-micro-fuse a load, but it's only 2 ALU uops plus
a load if we use a memory source.  SKX still bottlenecks on 2p5 for vpermw,
losing only the p015 uop, but in general fewer uops is better.

But on CannonLake it runs on p01 + p5 (plus p23 with a memory source).

uops.info doesn't have IceLake-client data yet but vpermw throughput on IceLake
is 1/clock, vs 1 / 2 clocks for vpermt2w, so this could double throughput on
CNL and ICL.

We have exactly the same problem with AVX512VBMI vpermt2b over vpermb with ICL
g++ -O3 -march=icelake-client -mprefer-vector-width=512

[Bug target/92246] Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)

2019-10-27 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92246

--- Comment #1 from Peter Cordes  ---
And BTW, GCC *does* use vpermd (not vpermt2d) for swapt = int or long.  This
problem only applies to char and short.  Possibly because AVX2 includes vpermd
ymm.



Apparently CannonLake has 1 uop vpermb but 2 uop vpermw, according to real
testing on real hardware by https://uops.info/.  Their automated test methods
are generally reliable.

That seems to be true for Ice Lake, too, so when AVX512VBMI is available we
should be using vpermb any time we might have used vpermw with a
compile-time-constant control vector.


(verpmw requires AVX512BW, e.g. SKX and Cascade Lake.  vpermb requires
AVX512VBMI, only Ice Lake and the mostly aborted CannonLake.)

Instlat provides some confirmation:
https://github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel00706E5_IceLakeY_InstLatX64.txt
 shows vpermb at 3 cycle latency, but vpermw at 4 cycle latency (presumably a
chain of 2 uops, 1c and 3c being the standard latencies that exist in recent
Intel CPUs).  InstLat doesn't document which input the dep chain goes through,
so it's not 100% confirmation of only 1 uop.  But it's likely that ICL has 1
uop vpermb given that CNL definitely does.

uops.info lists latencies separately from each input to the result, sometimes
letting us figure out that e.g. one of the inputs isn't needed until the 2nd
uop.  Seems to be the case for CannonLake vpermw: latency from one of the
inputs is only 3 cycles, the other is 4. 
https://www.uops.info/html-lat/CNL/VPERMW_YMM_YMM_YMM-Measurements.html

[Bug tree-optimization/92244] vectorized loop updating 2 copies of the same pointer (for in-place reversal cross in the middle)

2019-10-28 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244

--- Comment #4 from Peter Cordes  ---
(In reply to Andrew Pinski from comment #3)
> (In reply to Peter Cordes from comment #1)
> > On AArch64 (with gcc8.2), we see a similar effect, more instructions in the
> > loop.  And an indexed addressing mode.

That was an overstatement, the generic tuning I showed isn't using 2 separate
pointers or indices like we get on x86.

Your thunderx2t99 output is like that, but write-back addressing modes mean it
doesn't cost extra instructions.

> I am not shocked that IV-OPTS can chose these widly differences.
> I have not looked at the cost differences to understand why
> -mcpu=thunderx2t99 chose what close might be the best (we could use one less
> IV by replacing the first ldr by using the same IV as the last str).

I don't know ARM tuning; the x86 version is clearly worse with an extra uop
inside the loop.  And an extra instruction to copy the register before the
loop, wasting code-size if nothing else.

On Skylake for example, the loop is 10 uops and bottlenecks on front-end
throughput (4 uops / clock) if the back-end can keep up with a bit less than 1
store per clock.  (Easy if pointers are aligned and data is hot in L1d). 
Reducing it to 9 uops should help in practice.  Getting it down to 8 uops would
be really nice, but we can't do that unless we could use a shuffle that
micro-fuses with a load.  (For int elements, AVX2 VPERMD can micro-fuse a
memory source, so can SSE2 PSHUFD.  pshufb's xmm/memory operand is the control
vector which doesn't help us.  AVX512 vpermb can't micro-fuse)

[Bug target/82459] AVX512BW instruction costs: vpmovwb is 2 uops on Skylake and not always worth using vs. vpack + vpermq lane-crossing fixup

2019-10-29 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459

Peter Cordes  changed:

   What|Removed |Added

   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=89346
Summary|AVX512F instruction costs:  |AVX512BW instruction costs:
   |vmovdqu8 stores may be an   |vpmovwb is 2 uops on
   |extra uop, and vpmovwb is 2 |Skylake and not always
   |uops on Skylake and not |worth using vs. vpack +
   |always worth using  |vpermq lane-crossing fixup

--- Comment #5 from Peter Cordes  ---
Turns out vmovdqu8 with no masking doesn't cost an extra uop.  IACA was wrong,
and Agner Fog's results were *only* for the masked case.  The only downside of
that is the code-size cost of using EVEX load/store instructions instead of
AVX2 VEX. That's bug 89346


https://www.uops.info/table.html confirms that SKX non-masked vmovdqu8 load and
store are both single uop.  (Or the usual micro-fused store-address +
store-data).
 https://www.uops.info/html-tp/SKX/VMOVDQU8_ZMM_M512-Measurements.html
 https://www.uops.info/html-tp/SKX/VMOVDQU8_M512_ZMM-Measurements.html

And between registers it can be eliminated if there's no masking.

But *with* masking, as a load it's a micro-fused load+ALU uop, and as a masked
store it's just a normal store uop for xmm and ymm.  But zmm masked store is 5
uops (micro-fused to 4 front-end uops)! (Unlike vmovdqu16 or 32 masked stores
which are efficient even for zmm).

https://www.uops.info/html-tp/SKX/VMOVDQU8_M512_K_ZMM-Measurements.html

uops.info's table also shows us that IACA3.0 is wrong about vmovdqu8 as an
*unmasked* ZMM store: IACA thinks that's also 5 uops.

Retitling this bug report since that part was based on Intel's bogus data, not
real testing.

vpmovwb is still 2 uops, and current trunk gcc still uses  2x vpmovwb +
vinserti64x4 for ZMM auto-vec.  -mprefer-vector-width=512 is not the default,
but people may enable it in code that heavily uses 512-bit vectors.

YMM auto-vec is unchanged since previous comments: we do get vpackusbw +
vpermq, but an indexed addressing mode defeats micro-fusion.  And we have
redundant VPAND after shifting.

---

For icelake-client/server (AVX512VBMI) GCC is using vpermt2b, but it doesn't
fold the shifts into the 2-source byte shuffle.   (vpermt2b has 5c latency and
2c throughput on ICL, so probably its uop count is the same as uops.info
measured for CannonLake: 1*p05 + 2*p5.  Possible 2x 1-uop vpermb with
merge-masking for the 2nd into the first would work better.)

IceLake vpmovwb ymm,zmm is still 2-cycle throughput, 4-cycle latency, so
probably still 2 uops.

[Bug target/89346] Unnecessary EVEX encoding

2019-10-30 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89346

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #1 from Peter Cordes  ---
Still present in pre10.0.0 trunk 20191022.  We pessimize vmovdqu/a in AVX2
intrinsics and autovectorization with -march=skylake-avx512 (and arch=native on
such machines)

It seems only VMOVDQU/A load/store/register-copy instructions are affected; we
get AVX2 VEX vpxor instead of AVX512VL EVEX vpxord for xor-zeroing, and
non-zeroing XOR.  (And most other instructions have the same mnemonic for VEX
and EVEX, like vpaddd.  This includes FP moves like VMOVUPS/PD)

(https://godbolt.org/z/TEvWiU for example)

The good options are: 

* use VEX whenever possible instead of AVX512VL to save code-size.  (2 or 3
byte prefix instead of 4-byte EVEX)

* Avoid the need for vzeroupper by using only x/y/zmm16..31.  (Still has a
max-turbo penalty so -mprefer-vector-width=256 is still appropriate for code
that doesn't spend a lot of time in vectorized loops.)

 This might be appropriate for very simple functions / blocks that only have a
few SIMD instructions before the next vzeroupper would be needed.  (e.g.
copying or zeroing some memory); could be competitive on code-size as well as
saving the 4-uop instruction.

 VEX instructions can't access x/y/zmm16..31 so this forces an EVEX encoding
for everything involving the vector (and rules out using AVX2 and earlier
instructions, which may be a problem for KNL without AVX512VL unless we narrow
to 128-bit in an XMM reg)



(citation for not needing vzeroupper if y/zmm0..15 aren't written explicitly:
https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper-for-turbo-clocks-to-recover-after-a-512-bit-instruc
- it's even safe to do

vpxor xmm0,xmm0,xmm0
vpcmpeqb  k0, zmm0, [rdi]

without vzeroupper.  Although that will reduce max turbo *temporarily* because
it's a 512-bit uop.

Or more frequently useful: to zero some memory with vpxor xmm zeroing and YMM
stores.

[Bug target/40838] gcc shouldn't assume that the stack is aligned

2019-10-30 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #91 from Peter Cordes  ---
This bug should be closed as "resolved fixed".  The "fix" was to change the ABI
doc and break existing hand-written asm, and old binaries.  This was
intentional and resulted in some pain, but at this point it's a done deal.



My attempt at a summary of the current state of affairs for 32-bit x86 calling
conventions (on Linux and elsewhere):

Yes, the version of the i386 System V ABI used on Linux really did change
between gcc2.8 and gcc8.  Those compilers are not ABI-compatible with each
other.  This is a known fact.  Hand-written asm that makes function calls with
misaligned stack pointers is violating the (updated) ABI, and was also
knowingly broken by this change.


(Perhaps unintentionally at first, with stack alignment intended to just
provide a performance benefit, not a correctness issue.  But the resolution
ended up being to standardize on 16-byte alignment matching x86-64 System V.  
Instead of reverting to the old ABI and breaking compat with new binaries that
had started to rely on 16-byte incoming alignment, or to add significant
overhead to every function that didn't know how both its caller and callee were
compiled, i.e. most functions.  Using MOVUPS instead of MOVAPS everywhere
wouldn't work well because it would mean no folding of memory operands into ALU
instructions: without AVX's VEX encoding,  paddd xmm0, [mem] requires aligned
mem.  And existing binaries that rely on incoming 16-byte alignment weren't
doing that.)


An earlier comment also mentioned common arrays: the ABI also requires arrays
larger than 16 bytes to have 16-byte alignment.



Perhaps unnecessary pain for little real benefit: i386 on Linux has been mostly
obsolete for a long time, and the inefficient stack-args calling convention was
never changed.  It's ironic that Linux broke ABI compat for i386 in the name of
more efficient SSE-usage despite not caring to introduce anything like Windows
fastcall or vectorcall (efficient register-args calling conventions).

(GCC does have ABI-changing -mregparm=3 and -msseregparm to pass integers in
regs, and pass/return FP values in XMM registers (instead of passing on the
stack / returning in x87 st0).  But no distros have switched over to using that
calling convention for i386 binaries, AFAIK.  The Linux kernel does use regparm
for 32-bit kernel builds.)

Even more ironic, probably a lot of 32-bit code is compiled without -msse2
(because one of the main reasons for using 32-bit code is CPUs too old for
x86-64, which is about the same vintage as SSE2).  SSE usage can still happen
with runtime dispatching in binaries that are compatible with old machines
while still being able to take advantage of new ones.


But in most cases, if you want performance you use x86-64 kernel + user-space,
or maybe x32 user-space (ILP32 in 64-bit mode) to get modern calling
conventions and the benefit of twice as many registers.  x86-64 System V has
mandated 16-byte stack alignment from the start.  (I don't know the history,
but perhaps i386 code-gen started assuming / depending on it for correctness,
not just performance, by accident because of devs being used to x86-64?)

The 32-bit ABI on some other OSes, including i386 *BSD and 32-bit Windows, has
*not* changed; presumably gcc there doesn't rely on incoming stack alignment. 
(It might try to propagate 16-byte alignment for performance benefits, though.)

My understanding is that i386 MacOS still uses a version of i386 System V that
doesn't include the 16-byte stack alignment update, like other *BSDs.


(In reply to Harald van Dijk from comment #90)
> compile
> 
>   void exit(int);
>   int main(void) { exit(0); }
> 
> with GCC 2.8, compile current glibc with GCC 8, and there will be a segfault
> in glibc's __run_exit_handlers because GCC 2.8 never kept the stack
> 16-byte-aligned, but GCC 8 does now generate code which assumes it.
>
> For the moment, I've rebuilt glibc with -mincoming-stack-boundary=2 to handle 
> the problem well enough for my current needs, but it's not a complete 
> solution.

Yes, you need workarounds like this to change modern GCC's ABI back to legacy
4-byte.

Note that you might break atomicity of C11 _Atomic 8-byte objects even outside
structs by doing this, if they split across a cache line (Intel) or possibly
narrower (AMD) boundary.  But only if they were stack allocated.

[Bug target/93141] Missed optimization : Use of adc when checking overflow

2020-01-03 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93141

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #2 from Peter Cordes  ---
gcc doesn't actually *branch* unless you use an if(), it just uses cmp/sbb to
do a 128-bit compare.  CMP is like a SUB that only sets flags.  The CF result
of SBB is used as an input for ADC.

https://godbolt.org/z/64C4R- of a testcase

GCC also wastes a varying number of MOV instructions beyond the minimum one to
make cmp/sbb work, depending on BMI2 MULX or not, and how the sum is written.

u128 prod = a[i] * (unsigned __int128) b[i];
#if 1
sum += prod;
//if(sum

[Bug target/89063] [x86] lack of support for BEXTR from BMI extension

2019-01-25 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89063

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #1 from Peter Cordes  ---
Unfortunately Intel Haswell/Skylake implement BEXTR as 2 uops with 2c latency. 
Presumably those uops are a shift + bzhi, so 1p06 + 1p15 would explain Agner
Fog's experimental result of 2p0156 for BEXTR, with 0.5c throughput.

On AMD Excavator/Ryzen, it's 1 uop with 1c latency.  On Steamroller and
earlier, it's 2 uops but 1c latency.  (I assume that's latency from the
non-control input to the output.  So maybe one of the uops pre-processes the
control input, otherwise you'd expect 2c latency from either operand.)  Ryzen
dropped support for AMD TBM, so only Excavator (bdver4) has 1-uop bextr imm16
which would avoid the need for mov reg,imm32 with the control operand.  But
mov-imm + bextr can still be a win on Ryzen, lower latency than RORX+AND

BMI2 RORX is single-uop on all CPUs that support it.  If we already need a 2nd
uop to mask anyway, we can use RORX+AND-immediate to duplicate the
functionality and performance of BEXTR-immediate, with the smaller code-size if
the AND-mask fits in an imm8.  (5+5 vs. 6+3  or 6+4 if the AND needs a REX)

Without an immediate-source BEXTR (like AMD TBM has/had), the only advantage
mov-immediate+bextr has (on Intel) over mov-reg+shift+and is that can deal with
wide bitfields using a count instead of an immediate AND mask.  (Especially if
it doesn't fit in 32 bits).

If you can reuse the same control-register in a loop, BEXTR is good-ish for
copy-and-extract.

PEXT is 1 uop on Intel CPUs even though the simpler-looking BEXTR is 2.  But
PEXT is extremely slow on Ryzen (7 uops, 18c lat and tput).  So for 32-bit
constants at least, mov r32,imm32 + PEXT to copy-and-extract is better than
BEXTR on Intel.  movabs imm64 is too big and can cause front-end problems
(slower to read from the uop cache, if that effect from Sandybridge is still
present on Haswell/Skylake), and has no advantage vs. RORX + AND unless the
bitfield you're extracting is wider than 32 bits.

PEXT has 3 cycle latency, though, and can only run on port 1 on SnB-family. 
(All integer uops with latency > 1 are p1-only).  It's potentially good for
throughput, but worse than RORX+AND for latency.

Unfortunately x86 bitfield instructions are pretty weak compared to ARM /
AArch64 ubfx or PowerPC rlwinm and friends, where the bit-positions are simply
specified as immediates.  Only AMD's immediate version of BEXTR (1 uop on
Excavator) matched them.  Having a bunch of different control operands for
BEXTR or PEXT in registers might be usable in a loop, but a lot more rarely
useful than immediate controls.




 :
   0:   c4 e3 fb f0 c7 2a   rorx   $0x2a,%rdi,%rax# $(64-22)
   6:   c4 e3 fb f0 d7 35   rorx   $0x35,%rdi,%rdx# $(64-11)
   c:   83 e7 3fand$0x3f,%edi
   f:   83 e0 3fand$0x3f,%eax
  12:   83 e2 3fand$0x3f,%edx
  15:   01 f8   add%edi,%eax # 32-bit operand-size
because we can prove it can't overflow
  17:   01 d0   add%edx,%eax # missed optimization in
both gcc's versions.
  19:   c3  retq   

Not counting the ret, this is 7 uops for Skylake and Ryzen.  **I'm pretty sure
this is our best bet for -march=skylake, and for tune=generic -mbmi2**

The BEXT intrinsics version is 9 uops for SKL, 7 for Ryzen, but is 2 bytes
larger.  (not counting the savings from avoiding a REX prefix on the ADD
instructions; that missed optimization applies equally to both.)  OTOH, the
critical path latency for BEXTR on Ryzen is better by 1 cycle, so we could
still consider it for -march=znver1.  Or for tune=generic -mbmi without BMI2.

The legacy mov+shr+and version is 10 uops because gcc wasted a `mov %rdi,%rax`
instruction; it *should* be 9 uops for all normal CPUs.

---

With only BMI1 but not BMI2 enabled, we should probably use the mov-imm + BEXTR
version.  It's not worse than the mov+shr+and version on SnB-family or bd/zn,
and it's better on some AMD.  And it's probably smaller code-size.

And in future if Intel designs CPUs that can handle BEXTR as a single uop with
1c latency, mov+bextr will become good-ish everywhere.


For code-size, BEXTR has a definite advantage for bitfields wider than 1 byte,
because AND $imm32, %r32 is 6 bytes long instead of 3.

[Bug target/89071] New: AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double

2019-01-26 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

Bug ID: 89071
   Summary: AVX vcvtsd2ss lets us avoid PXOR dependency breaking
for scalar float<->double
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

float cvt(double unused, double xmm1) { return xmm1; }

g++ (GCC-Explorer-Build) 9.0.0 20190120 (experimental):

vxorps  %xmm0, %xmm0, %xmm0
vcvtsd2ss   %xmm1, %xmm0, %xmm0# merge into XMM0

clang7.0
vcvtsd2ss   %xmm1, %xmm1, %xmm0# both sources are from XMM1, no
false dep

gcc already uses this trick for SQRTSS/SD, but not for float<->double
conversion.  I haven't checked all the other scalar instructions, but roundss
for floor() does neither and has a false dependency.  (i.e. it chooses the
output register as the merge-target, not the actual input.)

 return floorf(x);  ->   vroundss$9, %xmm1, %xmm0, %xmm0

Some testcases:

https://godbolt.org/z/-rqUVZ


---

In SSE, one-input scalar instructions like CVT* and SQRTSS/SD have an output
dependency because of Intel's short-sighted ISA design optimizing for
Pentium-III's 64-bit SIMD: zero-extending to fill the destination XMM register
would have cost an extra uop to write the upper half of the destination.

For consistency(?), SSE2 scalar instructions (new with Pentium 4 which had
128-bit SIMD execution units / register file) have the same behaviour of
merging into the low 64 bits of the destination, even conversion between double
and float between two xmm registers, which didn't exist before SSE2. 
(Previously conversion instructions were only between float in XMM and integers
in scalar or MMX regs, or packed-integer <-> ps which filled the whole XMM reg
and thus avoided a false dependency).

(Fortunately this isn't a problem for 2-input instructions like ADDSS: the
operation already depends on both registers.)

---

The VEX encoding makes the merge-target separate from the actual destination,
so we can finally avoid false dependencies without wasting an instruction
breaking it.  (When the source is already in an XMM register).


For instructions where the source isn't an XMM register (e.g. memory or integer
reg for int->FP conversions), one zeroed register can be used as a read-only
merge target by any number of scalar AVX instructions, including in a loop. 
That's bug 80571.


(It's unfortunate that Intel didn't take the opportunity to give the AVX
versions subtly different semantics, and zero-extend into the target register. 
That would probably have enabled vcvtsd2ss to be single-uop instead of 2 on
Sandybridge-family.  IDK if they didn't think of that, or if they wanted strict
consistency with the semantics of the SSE version, or if they thought decoding
/ internals would be easier if they didn't have to omit the
merge-into-destination part of the scalar operation.  At least they made the
extra dependency an explicit input, so we can choose a register other than the
destination, but it's so rarely useful to actually merge into the low 64 or 32
of another reg that it's just long-term harmful to gimp the ISA with an extra
dependency for these instructions, especially integer->FP.)



(I suspect that most of the dep-breaking gcc does isn't gaining any speed, but
the trick is figuring out when we can omit it while being sure that we don't
couple things into one big loop-carried chain, or serialize some things that
OoO exec could otherwise benefit from hiding.  Within one function with no
calls, we might be able to prove that a false dep isn't serializing anything
important (e.g. if there's already enough ILP and something else breaks a dep
on that register between loop iterations), but in general it's hard if we can't
pick a register that was already part of the dep chain that led to the input
for this operation, and thus is harmless to introduce a dep on.)



Relevant instructions that can exist in scalar xmm,xmm form:

VROUNDSS/SD  (gcc leaves a false dep, clang gets it right)

VSQRTSS/SD  (gcc already gets this right)
VRCPSS
VRSQRTSS  haven't checked

[V]CVTSS2SD xmm,xmm  (Skylake: SRC1/output dependency is a separate 1c latency
32-bit merge uop)
  The memory-source version is still 2 uops.

[V]CVTSD2SS xmm,xmm  (Skylake: SRC1/output dependency is the main 4c conversion
uop, the extra uop is first, maybe extracting 32 bits from the src?)
 The memory-source version of [V]CVTSD2SS is only 1 uop!

So avoiding a false dep by loading with MOVSS/MOVSD and then using the reg-reg
version is a bad idea for CVTSD2SS.  It's actually much better to PXOR and then
CVTSD2SS (me

[Bug target/80586] vsqrtss with AVX should avoid a dependency on the destination register.

2019-01-26 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80586

Peter Cordes  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #1 from Peter Cordes  ---
Fixed for vsqrtss/sd somewhere in 9.0, but not 8.2.  
https://godbolt.org/z/0Gxf05.

The general case of one-input scalar xmm,xmm instructions like vcvtss2sd is
still all over the place, with false deps or wasted xor-zeroing.  Reported that
as bug 89071

It seems only VSQRTsd/ss itself was fixed for this; sorry I didn't think of
checking for other one-input instructions when I reported this.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #2 from Peter Cordes  ---
(In reply to H.J. Lu from comment #1)
> But
> 
>   vxorps  %xmm0, %xmm0, %xmm0
>   vcvtsd2ss   %xmm1, %xmm0, %xmm0
> 
> are faster than both.

On Skylake-client (i7-6700k), I can't reproduce this result in a hand-written
asm loop.  (I was using NASM to make a static executable that runs a 100M
iteration loop so I could measure with perf).  Can you show some asm where this
performs better?

vcvtsd2ss src-reg,dst,dst is always 2 uops, regardless of the merge destination
being an xor-zeroed register.  (Either zeroed outside the loop, or inside, or
once per 4 converts with an unrolled loop.)

I can't construct a case where  vcvtsd2ss %xmm1, %xmm1, %xmm0  is worse in any
way (dependencies, uops, latency, throughput) than VXORPS + vcvtsd2ss with dst
= middle source.  I wasn't mixing it with other instructions other than VXORPS,
but I don't think anything is going to get rid of its 2nd uop, and choosing
both inputs = the same source removes any benefit from dep-breaking the output.

If adding a VXORPS helped, its probably due to some other side-effect.

Could the effect you saw have been due to code-gen changes for memory sources,
maybe  vxorps + vcvtsd2ss (mem), %xmm0, %xmm0   vs.  vmovsd + vcvtsd2ss %xmm1,
%xmm1, %xmm0?  (Those should be about equal, but memory-source SS2SD is
cheaper, no port5 uop.)



BTW, the false-dependency effect is much more obvious with SS2SD, where the
latency from src1 to output is 4 cycles, vs. 1 cycle for SD2SS.

Even without dependency-breaking, repeated

 vcvtsd2ss  %xmm1, %xmm0, %xmm0

can run at 1 per clock (same as with dep breaking), because the port-5 uop that
merges into the low 32 bits of xmm0 with 1 cycle latency is 2nd.  So latency
from xmm0 -> xmm0 for that [v]cvtsd2ss %xmm1, %xmm0 is 1 cycle.

With dep-breaking, they both still bottleneck on the port5 uop if you're doing
nothing else.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #3 from Peter Cordes  ---
(In reply to H.J. Lu from comment #1)
I have a patch for PR 87007:
> 
> https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html
> 
> which inserts a vxorps at the last possible position.  vxorps
> will be executed only once in a function.

That's talking about the mem,reg case, which like I said is different.  I
reported Bug 80571 a while ago about the mem,reg case (or gp-reg for si2ss/d),
so it's great that you have a fix for that, doing one xor-zeroing and reusing
that as a merge target for a whole function / loop.

But this bug is about the reg,reg case, where I'm pretty sure there's nothing
to be gained from xor-zeroing anything.  We can fully avoid any false dep just
by choosing both source registers = src, making the destination properly
write-only.

If you *have* an xor-zeroed register, there's no apparent harm in using it as
the merge-target for a reg-reg vcvt, vsqrt, vround, or whatever, but there's no
benefit either vs. just setting both source registers the same.  So whichever
is easier to implement, but ideally we want to avoid introducing a vxorps into
functions / blocks that don't need it at all.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #5 from Peter Cordes  ---
(In reply to H.J. Lu from comment #4)
> (In reply to Peter Cordes from comment #2)

> >  Can you show some
> > asm where this performs better?
> 
> Please try cvtsd2ss branch at:
> 
> https://github.com/hjl-tools/microbenchmark/
> 
> On Intel Core i7-6700K, I got

I have the same CPU.

> [hjl@gnu-skl-2 microbenchmark]$ make
> gcc -g -I.-c -o test.o test.c
> gcc -g   -c -o sse.o sse.S
> gcc -g   -c -o sse-clear.o sse-clear.S
> gcc -g   -c -o avx.o avx.S
> gcc -g   -c -o avx2.o avx2.S
> gcc -g   -c -o avx-clear.o avx-clear.S
> gcc -o test test.o sse.o sse-clear.o avx.o avx2.o avx-clear.o
> ./test
> sse  : 24533145
> sse_clear: 24286462
> avx  : 64117779
> avx2 : 62186716
> avx_clear: 58684727
> [hjl@gnu-skl-2 microbenchmark]$

You forgot the RET at the end of the AVX functions (but not the SSE ones); The
AVX functions fall through into each other, then into __libc_csu_init before
jumping around and eventually returning.  That's why they're much slower. 
Single-step through the loop in GDB...

   │0x5660 vcvtsd2ss xmm0,xmm0,xmm1
  >│0x5664  nopWORD PTR cs:[rax+rax*1+0x0]
   │0x566e  xchg   ax,ax
   │0x5670vcvtsd2ss xmm0,xmm1,xmm1
   │0x5674  nopWORD PTR cs:[rax+rax*1+0x0]
   │0x567e  xchg   ax,ax
   │0x5680   vxorps xmm0,xmm0,xmm0
   │0x5684 vcvtsd2ss xmm0,xmm0,xmm1
   │0x5688  nopDWORD PTR [rax+rax*1+0x0]
   │0x5690 <__libc_csu_init>endbr64
   │0x5694 <__libc_csu_init+4>  push   r15
   │0x5696 <__libc_csu_init+6>  movr15,rdx

And BTW, SSE vs. SSE_clear are about the same speed because your loop
bottlenecks on the store/reload latency of keeping a loop counter in memory
(because you compiled the C without optimization).  Plus, the C caller loads
write-only into XMM0 and XMM1 every iteration, breaking any loop-carried
dependency the false dep would create.

I'm not sure why it makes a measurable difference to run the extra NOPS, and 3x
vcvtsd2ss instead of 1 for avx() vs. avx_clear(), because the C caller should
still be breaking dependencies for the AVX-128 instructions.

But whatever the effect is, it's totally unrelated to what you were *trying* to
test. :/

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #6 from Peter Cordes  ---
(In reply to Peter Cordes from comment #5)
> But whatever the effect is, it's totally unrelated to what you were *trying*
> to test. :/

After adding a `ret` to each AVX function, all 5 are basically the same speed
(compiling the C with `-O2` or -O2 -march=native), with just noise making it
hard to see anything clearly.  sse_clear tends to be faster than sse in a group
of runs, but if there are differences it's more likely due to weird front-end
effects and all the loads of inputs + store/reload of the return address by
call/ret.

I did  while ./test;  : ;done   to factor out CPU clock-speed ramp up and maybe
some cache warmup stuff, but it's still noisy from run to run.  Making
printf/write system calls between tests will cause TLB / branch-prediction
effects because of kernel spectre mitigation, so I guess every test is in the
same boat, running right after a system call.

Adding loads and stores into the mix makes microbenchmarking a lot harder.

Also notice that since `xmm0` and `xmm1` pointers are global, those pointers
are reloaded every time through the loop even with optimization.  I guess
you're not trying to minimize the amount of work outside of the asm functions,
to measure them as part of a messy loop.  So for the version that have a false
dependency, you're making that dependency on the result of this:

movrax,QWORD PTR [rip+0x2ebd]  # reload xmm1
vmovapd xmm1,XMMWORD PTR [rax+rbx*1]   # index xmm1

Anyway, I think there's too much noise in the data, and lots of reason to
expect that vcvtsd2ss %xmm0, %xmm0, %xmm1 is strictly better than
VPXOR+convert, except in cases where adding an extra uop actually helps, or
where code-alignment effects matter.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #8 from Peter Cordes  ---
Created attachment 45544
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45544&action=edit
testloop-cvtss2sd.asm

(In reply to H.J. Lu from comment #7)
> I fixed assembly codes and run it on different AVX machines.
> I got similar results:
> 
> ./test
> sse  : 28346518
> sse_clear: 28046302
> avx  : 28214775
> avx2 : 28251195
> avx_clear: 28092687
> 
> avx_clear:
>   vxorps  %xmm0, %xmm0, %xmm0
>   vcvtsd2ss   %xmm1, %xmm0, %xmm0
>   ret
> 
> is slightly faster.


I'm pretty sure that's a coincidence, or an unrelated microarchitectural effect
where adding any extra uop makes a difference.  Or just chance of code
alignment for the uop-cache (32-byte or maybe 64-byte boundaries).

You're still testing with the caller compiled without optimization.  The loop
is a mess of sign-extension and reloads, of course, but most importantly
keeping the loop counter in memory creates a dependency chain involving
store-forwarding latency.

Attempting a load later can make it succeed more quickly in store-forwarding
cases, on Intel Sandybridge-family, so perhaps an extra xor-zeroing uop is
reducing the average latency of the store/reloads for the loop counter (which
is probably the real bottleneck.)

https://stackoverflow.com/questions/49189685/adding-a-redundant-assignment-speeds-up-code-when-compiled-without-optimization

Loads are weird in general: the scheduler anticipates their latency and
dispatches uops that will consume their results in the cycle when it expects a
load will put the result on the forwarding network.  But if the load *isn't*
ready when expected, it may have to replay the uops that wanted that input. 
See
https://stackoverflow.com/questions/54084992/weird-performance-effects-from-nearby-dependent-stores-in-a-pointer-chasing-loop
for a detailed analysis of this effect on IvyBridge.  (Skylake doesn't have the
same restrictions on stores next to loads, but other effects can cause
replays.)

https://stackoverflow.com/questions/52351397/is-there-a-penalty-when-baseoffset-is-in-a-different-page-than-the-base/52358810#52358810
is an interesting case for pointer-chasing where the load port speculates that
it can use the base pointer for TLB lookups, instead of the base+offset. 
https://stackoverflow.com/questions/52527325/why-does-the-number-of-uops-per-iteration-increase-with-the-stride-of-streaming
shows load replays on cache misses.

So there's a huge amount of complicating factors from using a calling loop that
keeps its loop counter in memory, because SnB-family doesn't have a simple
fixed latency for store forwarding.





If I put the tests in a different order, I sometimes get results like:

./test
sse  : 26882815
sse_clear: 26207589
avx_clear: 25968108
avx  : 25920897
avx2 : 25956683

Often avx (with the false dep on the load result into XMM1) is slower than
avx_clear of avx2, but there's a ton of noise.



Adding vxorps  %xmm2, %xmm2, %xmm2  to avx.S also seems to have sped it up; now
it's the same speed as the others, even though I'm *not* breaking the
dependency chain anymore.  XMM2 is unrelated, nothing touches it.

This basically proves that your benchmark is sensitive to extra instructions,
whether they interact with vcvtsd2ss or not.


We know that in the general case, throwing in extra NOPs or xor-zeroing
instructions on unused registers does not make code faster, so we should
definitely distrust the result of this microbenchmark.




I've attached my NASM loop.  It has various commented-out loop bodies, and
notes in comments on results I found with performance counters.  I don't know
if it will be useful (because it's a bit messy), but it's what I use for
testing snippets of asm in a static binary with near-zero startup overhead.  I
just run perf stat on the whole executable and look at cycles / uops.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-29 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #10 from Peter Cordes  ---
(In reply to Uroš Bizjak from comment #9)
> There was similar patch for sqrt [1], I think that the approach is
> straightforward, and could be applied to other reg->reg scalar insns as
> well, independently of PR87007 patch.
> 
> [1] https://gcc.gnu.org/ml/gcc-patches/2018-05/msg00202.html

Yeah, that looks good.  So I think it's just vcvtss2sd and sd2ss, and
VROUNDSS/SD that aren't done yet.

That patch covers VSQRTSS/SD, VRCPSS, and VRSQRTSS.

It also bizarrely uses it for VMOVSS, which gcc should only emit if it actually
wants to merge (right?).  *If* this part of the patch isn't a bug

-   return "vmovss\t{%1, %0, %0|%0, %0, %1}";
+   return "vmovss\t{%d1, %0|%0, %d1}";

then even better would be vmovaps %1, %0 (which can benefit from
mov-elimination, and doesn't need a port-5-only ALU uop.)  Same for vmovsd of
course.

[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression

2019-02-01 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494

--- Comment #4 from Peter Cordes  ---
I suspect dep-chains are the problem, and branching to skip work is a Good
Thing when it's predictable.

(In reply to Richard Biener from comment #2)
> On Skylake it's better (1uop, 1 cycle latency) while on Ryzen even better.
> On Bulldozer it also isn't that bad (comparable to Skylake I guess).

SKL: AVX VBLENDVPS x,x,x,x  is 2 uops, 2c latency, ~1c throughput.  (Same for
ymm)
SKL: SSE4 BLENDVPS x,x,xmm0 is 1 uop,  1c latency, ~0.36c throughput in my
testing, or maybe 0.333c with breaking dep chains.  (IDK how Agner got 1c. 
Maybe he that was an editing mistake, and he copied the 1c from the VEX
version.)


[V](P)BLENDV(B|PS|PD) is funny: the SSE versions are 1 uop on SKL, I assume
because they only have 3 register operands (including implicit XMM0).  But the
VEX encoding has 4 operands: 1 output and 3 inputs.  I think this is too many
for 1 uop to encode, and that's why VBLENDVPS is 2 uops even on Skylake.

(The blend-control register encoded by an imm8 in the VEX version instead of
implicit xmm0, but I don't think that's what stops the decoders from making it
1 uop.  I think it's simply having 4 total operands.)

On Skylake, the uop(s) for [V]BLENDVPS/D and [V]PBLENDVB can run on any of p015
(instead of only p5 on BDW and earlier), but the 2-uop VEX version is still 2
cycle latency.  The VEX version has a bias towards port 5, but less than half
the total uops run on p5 so it's not p015 + p5.  The SSE version seems equally
distributed to all of p015.



On SKL, the optimal choice might be to use the SSE encoding, if we can deal
with a destructive destination and having the blend control in xmm0.

The SSE/AVX penalty on SKL is output dependencies for write-only SSE
instructions (like movaps or cvtps2dq) writing to an XMM register that has a
dirty upper 128.  It's a per-register thing, not like Haswell where there's it
triggers a state slow change. 
(https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake)

---

Footnote: VBLENDVPS throughput is only 1c for a big block of it back-to-back,
even though it's only 2 uops that can run on any of 3 ports.  So why isn't it
0.66c throughput?

VBLENDVPS throughput (for back-to-back vblendvps) seems to be limited by some
front-end effect.  In an unrolled loop with 20 vblendvps (with no loop-carried
dependencies), there are a negligible amount of cycles where the front-end
delivered the full 4 uops.  Most cycles only 2 are issued.

This is not a general a problem for 2 uop instructions or anything: 9x bextr +
dec/jnz = 19 uops total runs at 5.00c / iter, or 3.8 uops / clock, with the
only cycle to not issue 4 uops being (I think) the group of 3 including the
loop branch.  Playing around with other 2 uops instructions, I didn't see
front-end bottlenecks.  I saw some back-end bottlenecks because other 2-uop
instructions aren't so nicely distributed over ports, but perf counts for 
idq_uops_not_delivered.cycles_fe_was_ok:u generally equaled total cycles. 
 (It counts when either the FE delivers 4 uops, or the back end was stalled and
thus not the front-end's fault.)

A 1 uop instruction following a vblendvps can issue with it in the same cycle,
so this effect is probably not horrible for normal cases where we're using
vblendvps mixed with normal instructions.

I haven't investigated further, whether this is a front-end effect (uop cache
fetch problem?) or whether it's an allocation bottleneck.  Possibly being a
4-operand instruction has something to do with it, although each uop can't have
that many I don't think.

[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression

2019-02-01 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494

--- Comment #5 from Peter Cordes  ---
   IF ( xij.GT.+HALf ) xij = xij - PBCx
   IF ( xij.LT.-HALf ) xij = xij + PBCx

For code like this, *if we can prove only one of the IF() conditions will be
true*, we can implement it more efficiently, I think, by checking the magnitude
of xij to see if a SUB is needed, and if so figuring out the sign to apply to
PBCx.

if(abs(xij) > HALF) {
xij -= PBCx XOR sign_bit( xij )
}


# xij  in  xmm0
# PBCx in  xmm7
# HALF in  xmm6
# set1( -0.0f ) in xmm5 (i.e. 1U<<31 a sign-bit mask)
vandnps%xmm5, %xmm0, %xmm1# abs(xij)
vcmpltps   %xmm1, %xmm6, %xmm1# HALF < abs(xij)

vandps%xmm5, %xmm0, %xmm2 # signbit(xij)
vxorps%xmm7, %xmm2, %xmm2 # PBCX (xij>=0) or -PBCx  (xij<0)

vandps%xmm2, %xmm1, %xmm1 # +-PBCx or 0.0 if abs(xij) is between
-+HALF
vsubps%xmm1, %xmm0, %xmm0 # xij -= PBCx, -PBCx, or 0.0

There's a good amount of ILP here, but the critical path is ANDPS + CMPPS +
ANDPS + SUBPS = 10 cycles on Skylake.

We might want to use VPAND for some of this on Haswell, to avoid a port 5
bottleneck at least on the critical path.  (Skylake runs FP booleans on any
port.  BDW and earlier restrict them to port 5 where they can't compete with
FMA, and where bypass latency is always optimal.  On SKL they can introduce
extra bypass latency if they pick p0 or p1.)



vandnps   %xmm5, %xmm0, %xmm2 # signbit(xij)
vxorps%xmm7, %xmm2, %xmm2 # PBCX (xij>=0) or -PBCx  (xij<0)

could be replaced with a (v)blendvps using the original xij to select between
PBCx and -PBCx.  With the SSE encoding, that saves a uop and a cycle of latency
(but only off the critical path).  And I think it would cost us a vmovaps to
set up for it.

---

I think this is better than IF-conversion of both IFs separately, but I haven't
really looked.  It should be much better for *latency*.  But it's only
equivalent if subtracting PBCx can't possibly make xij negative and the next IF
condition also true.

---

I was looking at a similar case of applying a fixup if the abs value of an
input is outside a range in
https://stackoverflow.com/questions/54364694/how-to-convert-scalar-code-of-the-double-version-of-vdts-pade-exp-fast-ex-app/54377840#54377840.
 I don't think I came up with anything there that's not already obvious or
covered by the example above, though.

Except if we had needed to square xij at some point, we could have checked  xij
* xij < HALF*HALF as the bound condition to save the ANDNPS.  But then the
mulps latency is part of the input to cmpps.

[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression

2019-02-01 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494

--- Comment #6 from Peter Cordes  ---
Oops, these were SD not SS.  Getting sleepy >.<.  Still, my optimization
suggestion for doing both compares in one masked SUB of +-PBCx applies equally.

And I think my testing with VBLENDVPS should apply equally to VBLENDVPD.

Since this is `double`, if we're going branchless we should definitely be
vectorizing for a pair of doubles, like doing 

xij = X0(1,i) - X0(1,j)   and 
yij = X0(2,i) - X0(2,j)

together with a vmovupd, and a vector of PBCx, PBCy.

Even if we later need both x and y separately (if those FMAs in the asm are
multiplying components of one vector), we might still come out ahead from doing
the expensive input processing with PD, then it's only one `vunpckhpd` to get
the Y element ready, and that can run in parallel with any x * z stuff

Or if we can unroll by 3 SIMD vectors over contiguous memory, we can get
{X0,Y0} {Z0,X1} {Y1,Z1}.  We get twice the work for a cost of only 3 extra
unpacks, doing 2 i and j values at once.



If this was 3 floats, using a SIMD load would be tricky (maybe vmaskmovps if we
need to avoid going off the end), unless we again unroll by 3 = LCM(vec_len,
width)

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-01 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #15 from Peter Cordes  ---
(In reply to Uroš Bizjak from comment #13)
> I assume that memory inputs are not problematic for SSE/AVX {R,}SQRT, RCP
> and ROUND instructions. Contrary to CVTSI2S{S,D}, CVTSS2SD and CVTSD2SS, we
> currently don't emit XOR clear in front of these instrucitons, when they
> operate with memory input.

They *do* have an output dependency.  It might or might not actually be a
problem and be worth clogging the front-end with extra uops to avoid, it
depending on surrounding code. >.<

e.g. ROUNDSD:  DEST[127:63] remains unchanged
Thanks, Intel.  You'd think by SSE4.1 they would have learned that false
dependencies suck, and that it's extremely rare to actually take advantage of
this merge behaviour, but no.

For register-source ROUNDSD / ROUNDSS, we can use ROUNDPD / ROUNDPS which write
the full destination register and have identical performance on all CPUs that
support them.  (Except Silvermont, where roundps/pd have 5c latency vs. 4c for
roundss/sd.  Goldmont makes them equal.)  KNL has faster (V)ROUNDPS/D than
ROUNDSS/SD, maybe only because of the SSE encoding?  Agner Fog isn't clear, and
doesn't have an entry that would match vroundss/sd.

Copy-and-round is good for avoiding extra MOVAPS instructions which can make
SSE code front-end bound, and reduce the effective size of the out-of-order
window.

Preserving FP exception semantics for packed instead of scalar register-source:

* if the upper element(s) of the source is/are known 0, we can always do this
with sqrt and round, and convert: they won't produce any FP exceptions, not
even inexact.  (But not rsqrt / rcpps, of course.)
  This will be the case after a scalar load, so if we need the original value
in memory *and* the result of one of these instructions, we're all set.

* with rounding, the immediate can control masking of precision exceptions, but
not Invalid which is always raised by SRC = SNaN.  If we can rule out SNaN in
the upper elements of the input, we can use ROUNDPS / ROUNDPD

roundps/d can't produce a denormal output.  I don't think denormal inputs slow
it down on any CPUs, but worth checking for cases where we don't care about
preserving exception semantics and want to use it with potentially-arbitrary
garbage in high elements.


rsqrtps can't produce a denormal output because sqrt makes the output closer to
1.0 (reducing the magnitude of the exponent).  (And thus neither can sqrtps.) 
SQRTPS/PD is the same performance as SQRTSS/SD on new CPUs, but old CPUs that
crack 128-bit ops into 64-bit are slower: Pentium III, Pentium M, and Bobcat. 
And Jaguar for sqrt.  Also Silvermont is *MUCH* slower for SQRTPD/PS then
SD/SS, and even Goldmont Plus has slower packed SQRT, RSQRT, and RCP than
scalar.

But RCPPS can produce a denormal.  (double)1.0/FLT_MAX = 2.938736e-39, which is
smaller than FLT_MIN = 1.175494e-38



So according to Agner's tables:

* ROUNDPS/PD is never slower than ROUNDSS/SD on any CPU that support them.
* SQRTPS/PD *are* slower than scalar on Silvermont through Goldmont Plus, and
Bobcat, Nano 3000, and P4 Prescott/Nocona.  By about a factor of 2, enough that
should probably care about it for tune=generic.  For ss/ps only (not double),
also K10 and Jaguar have slower sqrtps than ss.  Also in 32-bit mode, P4,
Pentium M and earlier Intel, and Atom, are much slower for packed than scalar
sqrt.
  SQRTPD is *faster* than SQRTSD on KNL.  (But hopefully we're never tuning for
KNL without AVX available.)

* RSQRT / RCP: packed is slower on Atom, Silvermont, and Goldmont (multi-uop so
a big decode stall).  Somewhat slower on Goldmont Plus (1 uop but half
throughput).  Also slower on Nano3000, and slightly slower on Pentium 4 (before
and after Prescott/Nocona), and KNL.  (But hopefully KNL can always use
VRSQRT28PS/PD or scalar)
  Pentium M and older again decode as at least 2 uops for packed, same as
Bobcat and K8.
  Same performance for packed vs. scalar on Jaguar, K10, bdver1-4, ryzen, Core2
and later, and SnB-family.

* CVTSS2SD vs. PD, and SD2SS vs. PD2PS
  packed is slower on k8, bdver1-4 (scalar avoids the shuffle uop), Nano3000,
KNL.  On Silvermont by just 1 cycle latency (so  even a MOVAPS on the critical
path would make it equal.)  Similar on Atom.  Slower on CPUs that do 128-bit
vectors as two 64-bit uops, like Bobcat, and Pentium M / K8 and older.

  packed is *faster* on K10, Goldmont/GDM Plus (same latency, 1c vs. 2c
throughput), Prescott, P4.  Much faster on Jaguar (1c vs. 8c throughput, and 1
uop vs. 2).

  same speed (but without the false dep) for SnB-family (mostly), Core 2,
Ryzen.

  Odd stuff: Agner reports:
Nehalem: ps2pd = 2 uops / 2c, ss2sd = 1 uop / 1c.  (I guess just
zero-padding the significand, no rounding required).  pd2ps and sd2ss are equal
at 2 uops / 4c latency.
SnB: cvtpd2ps is 1c higher latency than sd2ss.
IvB: ps2pd on IvB is 1c vs. 2c for ss2sd
On HSW and later things have settled down to e

[Bug target/85366] New: Failure to use both div and mod results of one IDIV in a prime-factor loop while(n%i==0) { n/=i; }

2018-04-11 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85366

Bug ID: 85366
   Summary: Failure to use both div and mod results of one IDIV in
a prime-factor loop while(n%i==0) { n/=i; }
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

From
https://codereview.stackexchange.com/questions/191792/find-prime-factors-in-c/191801#191801,
simplified to use a pointer instead of returning std::vector. 
Interestingly, the version with std::vector can be more easily coaxed to use
both results of one idiv, see the Godbolt link.

void find_prime_factors_ptr(int n, int *p)
{
// inefficient to test even numbers > 2, but that's a separate missed
optimization.
for (int i = 2; i <= n; i++) {
while (n % i == 0) {
*p++ = i;
n /= i;   // reordering the loop body doesn't help
}
}
}

https://godbolt.org/g/ogyZW8

g++ 8.0.1 20180411 -O3 -march=haswell gives us this inner loop:

 ...
 # outer loop
 movl%edi, %eax
# idiv to test if inner loop should even run once, leaving n/i in eax
.L4:
movl%edi, %eax# but instead we discard it
addq$4, %rsi
movl%ecx, -4(%rsi)
cltd
idivl   %ecx
cltd  # then modulo that division result to see if
the next iteration should run
movl%eax, %edi
idivl   %ecx  # leaves n/i in eax, ready for next
iteration...
testl   %edx, %edx
je  .L4
 ...

So both ways to get to .L4 (fall in or loop) have n/i in EAX from an idiv
already!  The loop doesn't need to be re-structured to take advantage, gcc just
needs to keep track of what it's doing.

## Hand optimized version of the whole function:
cmpl$1, %edi
jle .L9
movl$2, %ecx
.L5:
movl%edi, %eax
cltd
idivl   %ecx  # eax = tmp = n/i
testl   %edx, %edx
jne .L3
.L4:
movl%ecx, (%rsi)
addq$4, %rsi  # we're tuning for Haswell, no register-read
stalls so increment after reading and save a byte in the addressing mode
movl%eax, %edi# n = tmp
cltd
idivl   %ecx  # eax = tmp = n/i
testl   %edx, %edx
je  .L4
.L3:
incl%ecx
cmpl%edi, %ecx
jle .L5
.L9:
ret


I didn't make *any* changes to the code outside the inner loop.  I ended up
just removing movl %edi, %eax / cltd / idiv %ecx.

Changing the inner loop to

int tmp;
while (tmp = n/i, n % i == 0) {
*p++ = i;
n = tmp;
}

gives us the asm almost that good (an extra mov inside the loop), but we get a
jmp into the loop instead of peeling the while condition from before the first
iteration:


# gcc8.0.1 -O3 -march=haswell output, commented but unmodified
find_prime_factors_ptr_opt(int, int*):
cmpl$1, %edi
jle .L18
movl$2, %ecx
jmp .L19
.L16: # top of inner loop
addq$4, %rsi
movl%ecx, -4(%rsi)
movl%eax, %edi# extra mov puts this and the next mov on
the critical path
.L19:# inner loop entry point
movl%edi, %eax
cltd
idivl   %ecx
testl   %edx, %edx
je  .L16  # bottom of inner
incl%ecx
cmpl%edi, %ecx
jle .L19   # bottom of outer
.L18:
ret

Saving code-size here with the dependent chain of movl %eax, %edi / movl %edi,
%eax is pretty minor even on CPUs like original Sandybridge, or Bulldozer,
without mov-elimination, because idiv's latency dominates.  But it could easily
be taken out of the inner loop by duplicating it outside the outer loop, then
moving it to the outer-only part of the loop body, like this:

cmpl$1, %edi
jle .L18
movl$2, %ecx
movl%edi, %eax   # eax = n added here
jmp .L19
.L16: # top of inner loop
addq$4, %rsi
movl%ecx, -4(%rsi)
movl%eax, %edi # n = tmp  still here
.L19:# inner loop entry point
 #movl%edi, %eax  # eax = n removed from here in inner/outer loop
cltd
idivl   %ecx
testl   %edx, %edx
je  .L16  # bottom of inner

movl%edi, %eax# eax = n also added here, in the outer-only part
incl%ecx
cmpl%edi, %ecx
jle .L19   #

[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics

2018-04-15 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #1 from Peter Cordes  ---
This LEA stuff is part of what gcc does to align the stack by 32 for spilling
AVX locals.

Gcc's stack-align sequence is over-complicated and ties up an extra register
for the whole function (add  volatile  to the local and see the -O3 code).  Or
at least it was; it seems gcc8 trunk just makes a stack frame with EBP / RBP
but references 32-byte aligned locals from aligned RSP instead of unaligned
RBP.

It used to copy the address of the return address to make a full copy of
ret-addr / saved-RBP for the aligned stack frame, which was super weird.

https://godbolt.org/g/RLJNtd.  (With an alloca or something, gcc8 does the same
crazy stack-frame stuff as gcc7, otherwise it's much cleaner, like clang)



The actual bug here is that it's not fully optimized away when it turns out
that no 32-byte spills / reloads from locals are left in the function.

gcc for x86-64 sometimes has a few leftover instructions like that in more
complex functions using __m256; this is not exclusively an i386 problem, but
it's happens more easily for 32-bit it seems.

[Bug c++/69560] x86_64: alignof(uint64_t) produces incorrect results with -m32

2018-04-26 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69560

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #23 from Peter Cordes  ---
Just to recap the current situation (gcc/g++ 8.0.1 20180425):

I ported David Marillat's testcase to work as C or C++
https://godbolt.org/g/QdG2V6.  (And changed it to set global variables instead
of calling printf, so you can see the results from looking at the asm output
instead of running it).

C++11 alignof() now agrees with C11 alignof() (which didn't change) that
alignof(int64_t) is 4 when targeting the i386 System V ABI.

Previously G++'s alignof() reported 8, while gcc's C11 alignof (stdalign.h)
reported 4.  That was the only change: struct-member alignof results are
unchanged, and already matched between C11 and C++11.


4 is the minimum alignment that *any* int64_t, or pointer to int64_t, is
assumed to have when generating code for i386 SysV.  gcc / g++ are allowed to
generate code that breaks if passed a pointer to int64_t that wasn't 4-byte
aligned.  (Auto-vectorization is one case where that can happen on x86:
https://stackoverflow.com/q/47510783/224132).

They're *not* allowed to assume that it's 8-byte aligned unless they can see
the definition and know that a particular int64_t object is over-aligned, e.g.
to its natural alignment of 8, like gcc chooses to do whenever possible (i.e.
outside structs).

So in both C++ and C (and in g++/gcc after this patch), alignof(int64_t) is the
minimum that any allocator must give an int64_t for correctness (in this funky
32-bit ABI), not the recommended alignment that gcc and g++ both already used
whenever ABI struct-packing rules didn't constrain them.

It's also the guaranteed minimum that code can *assume*.  e.g. a
manually-vectorized library function might check alignof(T) == sizeof(T) before
assuming that using 16-byte aligned loads/stores can line up with element
boundaries.  (An array inside a struct { int foo; int64_t arr[10]; } would
violate this for i386 SysV).

Anyway, I think use-cases like these are why the standard is worded the way it
is, and why it makes sense for alignof() to report the guaranteed/required
minimum.  The recommended or actual alignment is useful, too, though, for other
cases, so it's nice that GNU __alignof() is also available to report that.



Semi-related: gcc depends on 8-byte alignment for C11 _Atomic int64_t but still
fails to provide it inside structs on the i386 SysV ABI (Bug 65146), using the
same alignment rules as regular int64_t.

C++11 std::atomic is fine, getting the required natural alignment even
on i386 SysV so SSE2 movq is atomic and lock add is efficient.

This change to what alignof() reports in C++ had no effect on C at all, or on
any alignment choices made by the compiler in either C or C++.  I only mention
it as another interesting case where i386 SysV's under-alignment of 64-bit
types requiring special care, but that one will require an ABI change of some
sort to fix.

[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics

2018-04-30 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274

--- Comment #2 from Peter Cordes  ---
The stray LEA bug seems to be fixed in current trunk (9.0.0 20180429), at least
for this testcase.  Gcc's stack-alignment strategy seems to be improved overall
(not copying the return address when not needed), so probably it's really
fixed.

It's still present in 7.3.

[Bug tree-optimization/85585] New: switch to select a string based on an enum can profitably optimize away the table of pointers/offsets into fixed-length char[] blocks. Or use byte offsets into a st

2018-05-01 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585

Bug ID: 85585
   Summary: switch to select a string based on an enum can
profitably optimize away the table of pointers/offsets
into fixed-length char[] blocks.  Or use byte offsets
into a string table
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

Bug 84011 shows some really silly code-gen for PIC code and discussion
suggested using a table of offsets instead of a table of actual pointers, so
you just need one base address.

A further optimization is possible when the strings are all similar length,
and/or the longest one isn't much longer than a pointer:

Pad all strings to the same length with trailing 0 bytes, and calculate a
pointer instead of loading it from an array.  This removes the possibility of
multiple entries sharing the same suffix (which is a missed optimization gcc
wasn't already doing), but avoids needing any space for storing pointers in
memory at all.

In the case discussed in bug 84011 (Linux's phy.h const char
*phy_modes(phy_interface_t interface)), the longest strings are 11 bytes
(including the \0), and there are 23 of them.  So it takes 253 bytes of char
data to store everything (not counting the "unknown" for the default: special
case) with all strings padded to 11 bytes.



The current strings + pointer-table implementation doesn't merge string
literals where one string is a suffix of another; this is another a
missed-optimization that would save many bytes here.  (e.g. instead of .string
"mii" and .string "gmii", just have .LC4 .byte 's'; .LC3: .byte 'g'; .LC2:
.string "mii".)

That optimization plus byte or 16-bit offsets into the table would be nice and
compact, and most CPUs have efficient zero-extending narrow loads.  So for
cases where the other optimization I'm suggesting isn't good, that would
probably be best.



The current packed string-data takes 158 bytes , so with 4-byte offsets it
takes 158+23*4 = 250 bytes.  Or with 8-byte pointers/offsets, it takes 158 +
23*8 = 342 bytes.  Or with 1-byte offsets, 158 + 23*1 = 181 bytes: load with
movzbl.  (If you can't use the offset directly as an 8-byte memory source
operand for ADD to a pointer, there's no point making it 32 bits instead of 8.)

The code for *using* such a table is quite simple.  This C source compiles to
what I'm suggesting:

https://godbolt.org/g/E8J3iS

struct foo {
char str[11];
} const table[23] = {};

const char *lookup(unsigned long idx) {
if(idx > 23) {
return "unknown";
//idx=23;
}
return table[idx].str;
}

Multiply by 11 only takes 2 LEA instructions on x86, so for PIC code with a
RIP-relative LEA we end up with 4 ALU instructions total to get a string
address, after checking the if condition:

   # gcc7.3 -march=haswell -O3 -fPIE output:  https://godbolt.org/g/qMzaY8
leaq.LC0(%rip), %rax# "unknown"
cmpq$23, %rdi
ja  .L4 # branchless is also an option
leaq(%rdi,%rdi,4), %rax
leaqtable(%rip), %rdx   # RIP-relative table base address
leaq(%rdi,%rax,2), %rax
addq%rdx, %rax  # table + 11*idx
.L4:
ret

This is even better in no-PIE mode where a static address is usable as a signed
32-bit immediate:

lookup(unsigned long):
movl$.LC0, %eax
cmpq$23, %rdi
ja  .L4
leaq(%rdi,%rdi,4), %rax
leaqtable(%rdi,%rax,2), %rax# 3 cycle latency for 3-component
LEA on SnB-family
.L4:
ret

So this has extremely low code-size cost on x86-64, for the benefit of removing
a table load in the dependency chain from enum to string data.  It does cost
significant data size vs. a byte-offset table with suffix-merging, but it's 
better than what gcc is doing now in non-PIE (table of qword pointers), and
*much* better in PIE (insane jump table).

-

The byte-index version is equivalent to transforming the C source like this:

const char packedstrings[158] = {};
const unsigned char offsets[23] = {};
const char *lookup_byteidx(unsigned long idx) {
if(idx>23)
return "unknown";
return &packedstrings[offsets[idx]];
}

leaq.LC0(%rip), %rax  # "unknown"
cmpq$23, %rdi
ja  .L9
leaqoffsets(%rip), %rax
leaqpackedstrings(%rip), %rdx
movzbl  (%rax,%rdi), %eax
addq%rdx, %rax
.L9:
ret

We can save an instruction here by making the relativ

[Bug tree-optimization/85585] switch to select a string based on an enum can profitably optimize away the table of pointers/offsets into fixed-length char[] blocks. Or use byte offsets into a string

2018-05-01 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585

--- Comment #1 from Peter Cordes  ---
By comparison, the no-PIE table of pointers only needs one instruction:

movqCSWTCH.4(,%rdi,8), %rax

So all my suggestions cost 1 extra instruction on x86 in no-PIE mode, but at a
massive savings in data size.

clang -fPIE compiles the plain switch to the obvious / sane 2 instruction
sequence which should be our baseline for normal cases.

# clang6.0 -fPIE -O3  (switch compilers on the Godbolt link)
leaq.Lswitch.table.phy_modes(%rip), %rcx
movq(%rcx,%rax,8), %rax

Clang is willing to make a table that needs relocations for the entries.  (My
suggestions all avoid that because they're based on offsets, not a table of
pointers.  Avoiding rodata relocations that dirty a page and prevent sharing
has some non-zero value, although it's low on many architectures where memory
is cheap.)

[Bug tree-optimization/84011] Optimize switch table with run-time relocation

2018-05-01 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #9 from Peter Cordes  ---
(In reply to rguent...@suse.de from comment #4)
> An optimization would be to
> add an indirection by, say, only recording the constant offset
> into an "array of strings" in the table, thus effectively
> 
>   "case1\0case2\0..."[CSWITCH[i]]
> 
> which would require only a relocation to access the single string
> constant.  But it would prohibit cases of string merging within
> those strings unless we implement that as well for this optimization.

gcc already totally misses optimizations here where one string is a suffix of
another.  "mii" could just be a pointer to the 3rd byte of "sgmii", but we
instead duplicate all the characters.  That's where major savings are possible
for this function.

> Note this might be profitable unconditionally, not just with -fpie/pic
> as the CSWITCH table would be smaller (dependent on the total
> size of the merged string).

Indeed, I wrote up bug 85585 with ideas for optimizing this.  A table of byte
or uint16_t offsets into a static buffer of packed strings looks good for PIC
and for position-dependent.

To avoid any runtime relocations, all you need is the ability to get a static
address into a register (e.g. RIP-relative LEA) and do an indexed load relative
to it, just like using a normal static char[].  Then add the load result to
that address.  Runtime relocation is nice to avoid even if you don't *need* to
avoid it.

Also possible is padding each string out to a constant length and calculating
an index into that, removing a level of indirection.  (Good when strings are
similar length and/or all short, and there aren't many strings that are
duplicates or suffixes of others.)  Again you just need to get a static address
into a register, and add it to 11*enum_value.  This is all ADD + LEA (with one
of them being RIP-relative).

[Bug tree-optimization/84011] Optimize switch table with run-time relocation

2018-05-01 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011

--- Comment #12 from Peter Cordes  ---
(In reply to Jakub Jelinek from comment #10)
> (In reply to Peter Cordes from comment #9)
> > gcc already totally misses optimizations here where one string is a suffix
> > of another.  "mii" could just be a pointer to the 3rd byte of "sgmii", but
> > we instead duplicate all the characters.  That's where major savings are
> > possible for this function.
> 
> ??  That is the task for the linker SHF_MERGE|SHF_STRINGS handling.
> Why should gcc duplicate that?

Oops, right I was only looking at gcc's asm output, didn't check an actual
linked binary.

Will the linker currently catch a case like this?

.LC_base:
.LC2: .string "mii"
.LC3: .string "gmii"

table:
.byte  .LC2 - .LC_base,  .LC3 - .LC_base

and drop .string "mii" entirely + rewrite the table to
.byte  .LC3+1 - .LC_base,  .LC3 - .LC_base

(This discussion should probably be happening on bug 85585.)

Sorry I don't know the actual mechanism by which gcc signals to the linker that
it can / can't merge.  I guess only in some sections?  Because gcc couldn't
allow it if was emitting an array like this, where dropping a string would
change the offsets for later data and break offset calculations:

const struct { char str[11]; } table[] = { {"mii"}, {"gmii"} };

[Bug tree-optimization/84011] Optimize switch table with run-time relocation

2018-05-01 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011

--- Comment #13 from Peter Cordes  ---
(In reply to Jakub Jelinek from comment #10)
> ??  That is the task for the linker SHF_MERGE|SHF_STRINGS handling.
> Why should gcc duplicate that?

Because gcc would benefit from knowing if merging makes the total block of
strings for a switch() table short enough to use a uint8_t offset[] instead of
uint16_t.

If we don't know at compile time, we'd have to be conservative and potentially
use a wider offset table.  (Although as Joseph points out
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585#c2, without more linker
support for this we could end up missing out on literal merging across
compilation units.  So perhaps a first step in applying this idea would be to
use 32-bit offsets from the start of the .rodata.str1.1 section, so we can
still let the linker merge strings and end up with them non-contiguous without
having to force the one that gets kept to be the one that's part of our block
of strings.)

[Bug tree-optimization/69615] 0 to limit signed range checks don't always use unsigned compare

2018-06-02 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69615

--- Comment #5 from Peter Cordes  ---
Update: https://godbolt.org/g/ZQDY1G

gcc7/8 optimizes this to and / cmp / jb, while gcc6.3 doesn't.

void rangecheck_var(int64_t x, int64_t lim2) {
  //lim2 >>= 60;
  lim2 &= 0xf;  // let the compiler figure out the limited range of limit
  if (x>=0 && x=0 && x<=(INT_MAX-1)) ext(); }  // clang and
gcc use 2 branches

[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

2018-06-09 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833

--- Comment #14 from Peter Cordes  ---
I happened to look at this old bug again recently.

re: extracting high the low two 32-bit elements:

(In reply to Uroš Bizjak from comment #11)
> > Or without SSE4 -mtune=sandybridge (anything that excluded Nehalem and other
> > CPUs where an FP shuffle has bypass delay between integer ops)
> > 
> > movd %xmm0, %eax
> > movshdup %xmm0, %xmm0  # saves 1B of code-size vs. psrldq, I think.
> > movd %xmm0, %edx
> > 
> > Or without SSE3,
> > 
> > movd %xmm0, %eax
> > psrldq   $4,  %xmm0# 1 m-op cheaper than pshufd on K8
> > movd %xmm0, %edx
> 
> The above two proposals are not suitable for generic moves. We should not
> clobber input value, and we are not allowed to use temporary.

SSE3 movshdup broadcasts the high element within each pair of 32-bit elements
so 

   movshdup  %xmm0, %xmm1
   movd  %xmm1, %eax

saves a byte of code vs  pshufd / movd, and saves a uop on Merom and avoids a
flt->int.  (According to Agner Fog's tables, pshufd is flt->int domain, i.e. it
wants input in the float domain.  While movshdup ironically is only an integer
shuffle.)

Probably not worth looking for that optimization, though, because it's not
worth using universally (Nehalem has worse latency for float shuffles between
int instructions).


With just SSE2, PSHUFLW is the same size as PSHUFD and faster on Merom / K8
(slowshuffle CPUs where PSHUFD is multiple uops).  It's not slower on any
current CPUs.  I could imagine some future CPU having better throughput for
32-bit element size shuffles than 16-bit, though.  That's already the case for
wider lane-crossing shuffles (VPERMW YMM is multiple uops on Skylake-AVX512). 
This would be a definite win for tune=core2 or k8, and Pentium M, but those are
so old it's probably not worth adding extra code to look for it.

I think it's pretty future-proof, though, unless Intel or AMD add an extra
shuffle unit for element sizes of 32-bit or wider on another port.

[Bug target/80820] _mm_set_epi64x shouldn't store/reload for -mtune=haswell, Zen should avoid store/reload, and generic should think about it.

2018-06-09 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820

--- Comment #5 from Peter Cordes  ---
AVX512F with marge-masking for integer->vector broadcasts give us a single-uop
replacement for vpinsrq/d, which is 2 uops on Intel/AMD.

See my answer on
https://stackoverflow.com/questions/50779309/loading-an-xmm-from-gp-regs.  I
don't have access to real hardware, but according to reported uop counts, this
should be very good: 1 uop per instruction on Skylake-avx512 or KNL

vmovq xmm0, rax1 uop p5   2c latency
vpbroadcastq  xmm0{k1}, rdx   ; k1 = 0b00101 uop p5   3c latency
vpbroadcastq  ymm0{k2}, rdi   ; k2 = 0b01001 uop p5   3c latency
vpbroadcastq  ymm0{k3}, rsi   ; k3 = 0b10001 uop p5   3c latency

xmm vs. ymm vs. zmm makes no difference to latency, according to InstLatx64

(For a full ZMM vector, maybe start a 2nd dep chain and vinsert to combine
256-bit halves.  Also means only 3 k registers instead of 7)

vpbroadcastq  zmm0{k4}, rcx   ; k4 =0b1 3c latency
... filling up the ZMM reg


Starting with k1 = 2 = 0b0010, we can init the rest with KSHIFT:

mov  eax, 0b0010 = 2
kmovwk1, eax
KSHIFTLW k2, k1, 1
KSHIFTLW k3, k1, 2

  #  KSHIFTLW k4, k1, 3
 ...

KSHIFT runs only on port 5 (SKX), but so does KMOV; moving from integer
registers would just cost extra instructions to set up integer regs first.

It's actually ok if the upper bytes of the vector are filled with broadcasts,
not zeros, so we could use 0b1110 / 0b1100 etc. for the masks.  We could start
with kxnor to generate a -1 and left-shift that, but that's 2 port5 uops vs.
mov eax,2 / kmovw k1, eax being p0156 + p5.

Loading k registers from memory is not helpful: according to IACA, it costs 3
uops.  (But that includes p237, and a store-AGU uop makes no sense, so it might
be wrong.)

[Bug rtl-optimization/86352] New: setc/movzx introduced into loop to provide a constant 0 value for a later rep stos

2018-06-28 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86352

Bug ID: 86352
   Summary: setc/movzx introduced into loop to provide a constant
0 value for a later rep stos
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

The wrong-code bug 86314 also revealed some very weird code-gen decisions,
which the fix didn't improve.

(I think the lock bts peephole is seen pretty late, and that's one necessary
factor for this problem.  But even without it, an unnecessary data dependency
between the lock bts loop and clearing memory is silly.)

This ended up being about 5 separate bugs, but IDK which belong together or are
already reported:

* useless mov %rsi, %rcx and useless mov %rdx, %rdi
* using setc/movzx instead of xor %eax,%eax to get a constant 0; slower and
creating a data dependency
* Doing that inside the loop instead of after
* Not adjusting register allocation to allow xor / set-flags / setc
* rep stos vs. vector stores as a zeroing strategy vs. any other repeated
value.



The reproducer test-case for bug 86314  loops until it finds and claims a zero
bit in a uint64_t, then returns a Bucket() object (with a constructor that
zero-initializes it) with no data dependency on anything.

But gcc decides to introduce a flag -> integer 0/1 inside the acquire() loop
instead of just using  xor eax,eax  before rep stosq.  The loop can only exit
when CF = 0, so RAX = 0, so it's not a correctness problem.

The loop is branching on CF as set by BTS, so there's no need to have the 0/1
in a register at all inside the loop, and setc/movzx from a known-zero CF is
more expensive that xor-zeroing.  (Plus it gives the STOSQ a data dependency on
the LOCK BTS flag result which it wouldn't have otherwise.  The stores can't
commit until after the lock memory barrier, but they can execute.)

This is the actual code-gen from (GCC-Explorer-Build) 9.0.0 20180627
https://godbolt.org/g/XGF5tR


BucketMap::acquireBucket():
movq%rdi, %rdx
movq%rsi, %rcx  # useless, lock bts can use (%rsi)
.L2:
movq(%rsi), %rax
andl$1, %eax# source is simplified to only check positions
0 or 1
lock btsq   %rax, (%rcx)  # Why not (%rsi)?
setc%al
movzbl  %al, %eax   # xor / bts / setc would have been possible
with a different reg
jc  .L2
# rax = 0 because the loop can only exit when CF=0

# should use  xor %eax,%eax  here instead

movq%rdx, %rdi  # Useless, RDI still == RDX
movl$16, %ecx
rep stosq
movq%rdx, %rax  # can't be done before rep stosq: RAX needs to
be 0
ret 



With -m32, where 64-bit lock bts isn't available, we have lock cmpxchg8b ending
with an OR.  So there is a zero in an integer register from that, but it's not
in EAX, so the code gen includes an extra `mov %esi, %eax`, which is not
cheaper than xor %eax,%eax especially with -march=haswell.  Sandybridge-family
has xor-zeroing as cheap as a NOP, but mov-elimination isn't always perfect and
SnB itself doesn't have it.

And of course mov still has a data dependency on the source of the zero, so it
defeats the effect of branch prediction + speculative breaking (control)
dependencies.  This last applies on any out-of-order x86.

I guess the lock bts peephole is seen too late to notice that it can't recycle
the 0 from the loop condition anymore, and ends up generating code to
materialize it.  But why inside the loop?

--


Even if we *did* need an integer 0/1 in a register inside the loop, we could
still use the xor / set-flags / setcc optimization: Simply use a register other
than RAX for the load / AND $1 / bts source.  And you can hoist the xor-zeroing
out of the loop.


xor %eax, %eax
.L2:
movq(%rsi), %rcx
andl$1, %ecx
lock btsq   %rax, (%rsi)
setc%al
# use %rax
jc  .L2


---

Separately:

If the initializer is non-zero, it uses SSE or AVX stores.  That makes no sense
either: if rep stosq is optimal, use  mov eax, 1 for the all-ones case.  (See
the ifdef in the Godbolt link to try it)

If it's not optimal, use xorps xmm0,xmm0 to create an all-zero vector.

I guess gcc is checking for all-zeros as a common special case, but doesn't
check for repeats of any other value, except for repeated bytes recognized as
memset.

So it makes sense that gcc uses a different strategy, but I think for only 16x
8 bytes (128 bytes) that vector stores beat rep stos on current CPUs.  (That
may change when IceLake introduces fast

[Bug tree-optimization/91026] switch expansion produces a jump table with trivial entries

2019-07-29 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91026

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #3 from Peter Cordes  ---
(In reply to Martin Liška from comment #2)
> Switch conversion bails out because it knowns that a jump table (or a bit
> test can) be used for this snippet. Then we prefer to use a jump table then
> a bit test. With -fno-jump-tables we generate the same code.
> That said, I confirm it's a small limitation.

This regression appeared in GCC9 for this test-case, and is present in GCC9.1
on Godbolt: https://godbolt.org/z/fDjTxN

bool is_vowel(char c) {
switch (c) {
case 'a': case 'e': case 'i': case 'o': case 'u': case 'y':
  return 1;
default:
  return 0;
}
}


But simplifying it

 case 'a': case 'e': case 'i':

to those 3 cases gets gcc9 and trunk to use an immediate bitmap.

With gcc8 and earlier, the x86-64 asm for the 2 versions is identical except
for the immediate used with TEST EAX, imm32.



(And BTW, there's a missed optimization here of using  mask & (1<>n) & 1.  Or better, looking for that conversion in user source code /
logic because people often write tests that way requiring the creation of an
actual 1 in a register.

Or for ISAs with flags, have the mask already right-shifted by 1 so the bit
shifted out is the one we want.  Then CF = result with no extra test.

Also an x86 missed optimization: BT reg,reg is very efficient (single uop) on
Intel and Ryzen, and avoids needing a 3-uop-on-Intel shift-by-CL or a mov reg,1

I'll report these ideas separately if/when I get around to it.

[Bug c/91398] Possible missed optimization: Can a pointer be passed as hidden pointer in x86-64 System V ABI

2019-08-09 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91398

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #4 from Peter Cordes  ---
EAD neglected to link previous discussion about this in the initial bug report.

https://stackoverflow.com/a/57377890/224132 points out that the SysV ABI
wording is 

> If the type has class MEMORY, then **the caller provides space** for the 
> return value and passes the address of this storage in  %rdi

We can argue semantics, but in my answer on the same question, I argued that
the implication is that that space won't alias any other space.  (Because the
return-value object exists in the C abstract machine, so the default assumption
should be that it exists for real in the calling convention.)



Whether it's practical to look for this optimization or not, I'm still curious
about the point that @M.M made about the semantics of  restrict  

https://stackoverflow.com/questions/57377314/what-prevents-the-usage-of-a-function-argument-as-hidden-pointer/57436765#comment101288442_57403379

Does the callee do_something() reading a global count as happening inside the
block scope of use(Vec3 *restrict out) { ... }?  The ISO C standard wording
talks about reaching the end of a block, which hasn't happened even though
`out` is not in scope inside the other function.

If so, then calling use(&global) creates UB when *out = do_something();
executes because it writes the pointed-to memory via a restrict-pointer in the
same block where it reads it from a pointer that's not derived from out.

If so, restrict would make this optimization safe if we can prove that
do_something is "noexcept" and doesn't longjmp.

[Bug middle-end/91515] missed optimization: no tailcall for types of class MEMORY

2019-08-27 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91515

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #1 from Peter Cordes  ---
The real missed optimization is that GCC is returning its own incoming arg
instead of returning the copy of it that create() will return in RAX.

This is what blocks tailcall optimization; it doesn't "trust" the callee to
return what it's passing as RDI.

See https://stackoverflow.com/a/57597039/224132 for my analysis (the OP asked
the same thing on SO before reporting this, but forgot to link it in the bug
report.)

The RAX return value tends to rarely be used, but probably it should be; it's
less likely to have just been reloaded recently.

RAX is more likely to be ready sooner than R12 for out-of-order exec.  Either
reloaded earlier (still in the callee somewhere if it's complex and/or
non-leaf) or never spilled/reloaded.

So we're not even gaining a benefit from saving/restoring R12 to hold our
incoming RDI.  Thus it's not worth the extra cost (in code-size and
instructions executed), IMO.  Trust the callee to return the pointer in RAX.

[Bug target/82887] ICE: in extract_insn, at recog.c:2287 (unrecognizable insn) with _mm512_extracti64x4_epi64

2019-10-13 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82887

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #4 from Peter Cordes  ---
Since some code is apparently still avoiding this because of old broken GCC
(e.g.
https://github.com/pmem/pmdk/blob/a6031710f7c102c6b8b6b19dc9708a3b7d43e87b/src/libpmem/x86_64/memset/memset_nt_avx512f.h#L193
)

Perhaps a workaround of  _mm512_castsi512_si256 would be useful?  Or does that
ICE as well?  I can't repro the bug on Godbolt so IDK.

Doing _mm512_set1_epi8(c) and a separate _mm256_set1_epi8(c) doesn't CSE with
GCC, only clang.  https://godbolt.org/z/uZ4lv-   And if you leave out 
-march=skylake-avx512 you get even worse asm from GCC.

[Bug tree-optimization/92080] New: Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

2019-10-13 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080

Bug ID: 92080
   Summary: Missed CSE of _mm512_set1_epi8(c) with
_mm256_set1_epi8(c)
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

As a workaround for PR 82887 some code (e.g. a memset) uses

__m512i zmm = _mm512_set1_epi8((char)c);
__m256i ymm = _mm256_set1_epi8((char)c);

instead of 

  ymm = _mm512_castsi512_si256(zmm);

(found in the persistent-memory library
https://github.com/pmem/pmdk/blob/a6031710f7c102c6b8b6b19dc9708a3b7d43e87b/src/libpmem/x86_64/memset/memset_nt_avx512f.h#L193
)

Obviously we'd like to CSE that instead of actually broadcasting twice.  MVCE:

#include 

__m512i sinkz;
__m256i sinky;
void foo(char c) {
sinkz = _mm512_set1_epi8(c);
sinky = _mm256_set1_epi8(c);
}

https://godbolt.org/z/CeXhi8  g++ (Compiler-Explorer-Build) 10.0.0 20191012

# g++ -O3 -march=skylake-avx512  (AVX512BW + AVX512VL are the relevant ones)
foo(char):
vpbroadcastb%edi, %zmm0
vmovdqa64   %zmm0, sinkz(%rip)
vpbroadcastb%edi, %ymm0  # wasted insn
vmovdqa64   %ymm0, sinky(%rip)   # wasted EVEX prefix
vzeroupper
ret

Without AVX512VL it wastes even more instructions (vmovd + AVX2 vpbroadcastb
xmm,ymm), even though AVX512BW vpbroadcastb zmm does set the YMM register. 
(There are no CPUs with AVX512BW but not AVX512VL; if people compile that way
it's their own fault.  But this might be relevant for set1_epi32() on KNL).

Clang finds this optimization, and uses a shorter vmovdqa for the YMM store
saving another 2 bytes of code size:

vpbroadcastb%edi, %zmm0
vmovdqa64   %zmm0, sinkz(%rip)
vmovdqa %ymm0, sinky(%rip)
vzeroupper
ret

[Bug target/82887] ICE: in extract_insn, at recog.c:2287 (unrecognizable insn) with _mm512_extracti64x4_epi64

2019-10-13 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82887

--- Comment #5 from Peter Cordes  ---
Reported bug 92080 for the missed CSE

[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes

2017-12-13 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837

--- Comment #5 from Peter Cordes  ---
(In reply to Jakub Jelinek from comment #4)
> Can't reproduce.  It is true that we now emit the __atomic_load_16 call, but
> that was intentional change

Yup.

>, and it can't be easily tail call, because the
> tailcall pass doesn't understand that the low 8 bytes of the 16 byte
> structure are returned the same as the whole structure

Ok that's disappointing, but hopefully is very rare after inlining.

> But I certainly can't reproduce any significant value masking etc., tried
> r235002 (+- gcc 6 branchpoint), r247000 (+- gcc 7 branchpoint) as well as
> current trunk.
> Unless it is something that has been broken on the 7 branch and later fixed.
> 
> But have just tried gcc 7.1.0 release and can't reproduce even there.

I can't repro it locally with gcc7.1.1 either.  This is the version info from
-fverbose-asm on the godbolt.org link (which does still repro it)

# GNU C++11 (GCC-Explorer-Build) version 7.1.0 (x86_64-linux-gnu)
#   compiled by GNU C version 5.4.0 20160609, GMP version 6.1.0, MPFR
version 3.1.4, MPC version 1.0.3, isl version isl-0.16.1-GMP

It's not present in the gcc7.2 build on Godbolt.org either.

I asked Matt Godbolt what exact version the compiler explorer site is using for
the gcc7.1.0 dropdown
(https://github.com/mattgodbolt/compiler-explorer/issues/684).  Hopefully he
can help us track down a gcc SVN revision to repro it, or confirm that it was a
misconfigured or buggy gcc or something.  Just to rule out the possibility of a
now-dormant bug lurking somewhere.

[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes

2017-12-14 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837

--- Comment #6 from Peter Cordes  ---
(In reply to Jakub Jelinek from comment #4)
> But have just tried gcc 7.1.0 release and can't reproduce even there.

Matt says the Compiler Explorer backend uses upstream release tarballs like
`URL=ftp://ftp.gnu.org/gnu/gcc/gcc-${VERSION}/${TARBALL}`.  (where TARBALL is
`gcc-${VERSION}.tar.xz` for recent gcc where .xz is available).

The compiler config used is
https://github.com/mattgodbolt/compiler-explorer-image/blob/master/gcc/build/build.sh#L78:

CONFIG=""
CONFIG+=" --build=x86_64-linux-gnu"
CONFIG+=" --host=x86_64-linux-gnu"
CONFIG+=" --target=x86_64-linux-gnu"
CONFIG+=" --disable-bootstrap"
CONFIG+=" --enable-multiarch"
CONFIG+=" --with-abi=m64"
CONFIG+=" --with-multilib-list=m32,m64,mx32"
CONFIG+=" --enable-multilib"
CONFIG+=" --enable-clocale=gnu"
CONFIG+=" --enable-languages=c,c++,fortran" # used to have go, but is
incompatible with m32/mx32
CONFIG+=" --enable-ld=yes"
CONFIG+=" --enable-gold=yes"
CONFIG+=" --enable-libstdcxx-debug"
CONFIG+=" --enable-libstdcxx-time=yes"
CONFIG+=" --enable-linker-build-id" 
CONFIG+=" --enable-lto"
CONFIG+=" --enable-plugins"
CONFIG+=" --enable-threads=posix"
CONFIG+=" --with-pkgversion=GCC-Explorer-Build"
BINUTILS_VERSION=2.29.1


Does that help figure out how to build a gcc7.1.0 that can repro this?

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2018-01-14 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 80846, which changed state.

Bug 80846 Summary: auto-vectorized AVX2 horizontal sum should narrow to 128b 
right away, to be more efficient for Ryzen and Intel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|FIXED   |---

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-14 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846

Peter Cordes  changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|FIXED   |---

--- Comment #21 from Peter Cordes  ---
(In reply to Richard Biener from comment #20)
> Fixed.

Unfortunately only fixed for integer, not FP.  The OpenMP and vanilla float
array sum functions from the godbolt link in the initial bug report still use
256b shuffles, including a gratuitous vperm2f128 when the upper half isn't
used, so vextractf128 would have done the same job in 1 uop on Ryzen instead of
8.

Even on Intel CPUs, they're optimized for code-size, not performance (vhaddps
instead of shuffle / vaddps).  Remember that Intel CPUs with AVX only have one
FP shuffle unit.  (Including Sandy/Ivybridge, which has 2 integer-128 shuffle
units)

float sumfloat_autovec(const float arr[]) {
  arr = __builtin_assume_aligned(arr, 64);
  float sum=0;
  for (int i=0 ; i<1024 ; i++)
sum = sum + arr[i];
  return sum;
}

# gcc 20180113 -mavx2 -ffast-math -O3
#  (tune=generic, and even with arch=znver1 no-prefer-avx128)
...
vhaddps %ymm0, %ymm0, %ymm0
vhaddps %ymm0, %ymm0, %ymm1
vperm2f128  $1, %ymm1, %ymm1, %ymm0   # why not vextract?
vaddps  %ymm1, %ymm0, %ymm0   # gratuitous 256b
vzeroupper

This bug is still present for FP code: it narrows from 256b to scalar only in
the last step.

Every VHADDPS is 2 shuffles + 1 add on Intel.  They're in-lane shuffles, but
it's still 2 uops for port5 vs. VSHUFPS + VADDPS.  (Costing an extra cycle of
latency because with only 1 shuffle port, the 2 interleave-shuffles that feed a
vertical-add uop can't run in the same cycle.)  (V)HADDPS with the same input
twice is almost never the best choice for performance.

On Ryzen it's an even bigger penalty: HADDPS xmm is 4 uops (vs. 3 on Intel). 
It's also 7c latency (vs. 3 for ADDPS).  256b VHADDPS ymm is 8 uops, one per 3
cycle throughput, and Agner Fog reports that it's "mixed domain", i.e. some
kind of penalty for ivec / fp domain crossing.  I guess the shuffles it uses
internally are ivec domain?

With multiple threads on the same core, or even with ILP with surrounding code,
uop throughput matters as well as latency, so more uops is worse even if it
didn't have latency costs.

The sequence I'd recommend (for both Intel and AMD) is:
(See also
http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86/35270026#35270026)


vextractf128$1, %ymm0, %xmm1
vaddps  %xmm1, %xmm0, %xmm0  # narrow to 128b

vmovshdup   %xmm0, %xmm0, %xmm1  # copy high->low in each
pair
vaddps  %xmm1, %xmm0, %xmm0

vmovhlps%xmm0, %xmm0, %xmm1  # duplicate high 64b
vaddps  %xmm1, %xmm0, %xmm0

The MOVSHDUP / MOVHLPS sequence is also what you want without VEX, so you can
do a 128b hsum with 4 instructions, with no MOVAPS.

Intel: 6 uops total, 3 shuffles.  vs. 8 total, 5 shuffles

AMD Ryzen: 6 uops, 3 shuffles.  vs. 26 total uops, 20 of them shuffles.  And
much worse latency, too.

Even just fixing this specific bug without fixing the rest of the sequence
would help AMD *significantly*, because vextractf128 is very cheap, and vhaddps
xmm is only half the uops of ymm.  (But the latency still sucks).

-

Even for integer, this patch didn't fix the MOVDQA + PSRLDQ that we get without
AVX.  PSHUFD or PSHUFLW to copy+shuffle is cheaper.  I guess I need to report
that bug separately, because it probably won't get fixed soon: if I understand
correctly, there's no mechanism for the back-end to tell the auto-vectorizer
what shuffles it can do efficiently!

It usually won't make too much difference, but for very small arrays (like 8
`int` elements) the hsum is a big part of the cost, although it's probably
still worth auto-vectorizing *if* you can do it efficiently.

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-14 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846

--- Comment #22 from Peter Cordes  ---
Forgot the Godbolt link with updated cmdline options:
https://godbolt.org/g/FCZAEj.

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-14 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846

--- Comment #25 from Peter Cordes  ---
We're getting a spill/reload inside the loop with AVX512:

.L2:
vmovdqa64   (%esp), %zmm3
vpaddd  (%eax), %zmm3, %zmm2
addl$64, %eax
vmovdqa64   %zmm2, (%esp)
cmpl%eax, %edx
jne .L2

Loop finishes with the accumulator in memory *and* in ZMM2.  The copy in ZMM2
is ignored, and we get

# narrow to 32 bytes using memory indexing instead of VEXTRACTI32X8 or
VEXTRACTI64X4
vmovdqa 32(%esp), %ymm5
vpaddd  (%esp), %ymm5, %ymm0

# braindead: vextracti128 can write a new reg instead of destroying xmm0
vmovdqa %xmm0, %xmm1
vextracti128$1, %ymm0, %xmm0
vpaddd  %xmm0, %xmm1, %xmm0

... then a sane 128b hsum as expected, so at least that part went
right.

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-16 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846

--- Comment #28 from Peter Cordes  ---
(In reply to Richard Biener from comment #27)
> Note that this is deliberately left as-is because the target advertises
> (cheap) support for horizontal reduction.  The vectorizer simply generates
> a single statement for the reduction epilogue:
>  [...]
> so either the target shouldn't tell the vectorizer it supports this or
> it simply needs to expand to better code.  Which means - can you open
> a separate bug for this?

Yes; I was incorrectly assuming the inefficient asm had the same cause as
before.  I agree *this* is fixed, thanks for the explanation of how gcc was
arriving at this sequence.

I'll have a look at the backend canned sequence defs and see if there are any
other sub-optimal ones, or if it was only AVX.

Having canned sequences for different target instruction sets instead of
leaving it to arch-independent code seems like it should be an improvement over
the old design.

[Bug target/38959] Additional switches to disallow processor supplementary instructions

2019-02-12 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38959

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #3 from Peter Cordes  ---
We can maybe close this as fixed (if -march=i386 didn't exist/work at the time)
or invalid.  Or maybe we want to add some CPU-level awareness to code-gen for
__builtin_ia32_rdtsc / rdpmc / rdtscp.

The cmov / fcomi / fcomi proposed switches are already supported as part of
-march=pentium -mtune=generic or lower, e.g. -march=i386.  (The 32-bit default
is something like arch=i686 and tune=generic, with it being possible to
configure gcc so SSE2 is on by default in 32-bit code.)

Those are the important ones, because they're emitted automatically by the
compiler's back-end.  The other options would just be trying to save you from
yourself, e.g. rejecting source that contains __rdtsc() /
__builtin_ia32_rdtsc()



I'm not sure what the situation is with long NOPs.  GCC doesn't (normally?)
emit them, just using .p2align directives for the assembler.  In 32-bit mode,
GAS appears to avoid long NOPs, using either 2-byte xchg ax,ax or pseudo-nops
like   LEA esi,[esi+eiz*1+0x0] that add a cycle of latency to the dep chain
involving ESI.

Even with -march=haswell, gcc+gas fail to use more efficient long NOPs for
padding between functions.


---

I'm not sure if CPUID is ever emitted by gcc's back-end directly, only from
inline asm.  i386/cpuid.h uses inline asm.  But __get_cpuid_max() checks if
CPUID is even supported in a 386-compatible way, checking if a bit in EFLAGS is
sticky or not.  If your source code is written safely, you won't have a problem
unless possibly __builtin_cpu_init runs CPUID without checking, in programs
that use __builtin_cpu_supports() or _is().


__builtin_ia32_rdpmc() and __rdtsc() do *not* check -march= before emitting
rdpmc and rdtsc.  Neither does __rdtscp(), which is interesting because that
instruction is new enough that some still-relevant CPUs don't support it.

__rdpmc() isn't "volatile", though, so stop-start optimizes to 0.  (I found
this bug looking for existing reports of that issue.)



Test cases:  https://godbolt.org/z/hqPdza

FCMOV and CMOV are also handled correctly, but I didn't write functions for
them.

int fcomi(double x, double y) {
return x Proposed switches:
> 
> --nocpuid  This option causes the compiler to not generate cpuid opcodes
> --nocmov   This option causes the compiler to not generate cmov opcodes
> --nofcmov  This option causes the compiler to not generate fcmov opcodes
> --nofcomi  This option causes the compiler to not generate fcomi opcodes
> --nonopl   This option causes the compiler to not generate fcomi opcodes
> --nordpmc  This option causes the compiler to not generate rdpmc opcodes
> --nordtsc  This option causes the compiler to not generate rdtsc opcodes
> 
> Possibly a general switch that is equivalent to all of the above
> 
> --nosupplementaryinstructions
> 
> Rationale
> 
> It is possible that a developer still wants to compile for a particular
> architecture (for example the i486), but does not wish to generate code with
> supplementary instructions (such as cpuid), that may be present on that
> architecture.

[Bug target/38959] Additional switches to disallow processor supplementary instructions

2019-02-12 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38959

--- Comment #4 from Peter Cordes  ---
The __builtin_ia32_rdpmc being a pure function bug I mentioned in my previous
comment is already reported and fixed (in gcc9 only): bug 87550

It was present since at least gcc 5.0
https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/595214

[Bug target/80571] AVX allows multiple vcvtsi2ss/sd (integer -> float/double) to reuse a single dep-breaking vxorps, even hoisting it out of loops

2019-02-22 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80571

--- Comment #2 from Peter Cordes  ---
I think hjl's patch for PR 89071 / PR 87007 fixes (most of?) this, at least for
AVX.

If register pressure is an issue, using a reg holding a arbitrary constant
(instead of xor-zeroed) is a valid option, as this bug points out.  So I'm not
sure we should close this as a duplicate of those fixed bugs.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-22 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #22 from Peter Cordes  ---
Nice, that's exactly the kind of thing I suggested in bug 80571.  If this
covers 

* vsqrtss/sd  (mem),%merge_into, %xmm 
* vpcmpeqd%same,%same, %dest# false dep on KNL / Silvermont
* vcmptrueps  %same,%same, %ymm # splat -1 without AVX2.  false dep on all
known uarches

as well as int->FP conversions, then we could probably close that as fixed by
this as well.

bug 80571 does suggest that we could look for any cold reg, like a non-zero
constant, instead of requiring an xor-zeroed vector, so it might go slightly
beyond what this patch does.

And looking for known-to-be-ready dead regs from earlier in the same dep chain
could certainly be useful for non-AVX code-gen, allowing us to copy-and-sqrt
without introducing a dependency on anything that's not already ready.

(In reply to h...@gcc.gnu.org from comment #21)
> Author: hjl
> Date: Fri Feb 22 15:54:08 2019
> New Revision: 269119

[Bug target/88809] do not use rep-scasb for inline strlen/memchr

2019-04-09 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #4 from Peter Cordes  ---
Yes, rep scasb is abysmal, and gcc -O3's 4-byte-at-a-time scalar loop is not
very good either.

With 16-byte alignment, (which we have from calloc on x86-64 System V), we can
inline a *much* better SSE2 loop.  See
https://stackoverflow.com/a/55589634/224132 for more details and
microbenchmarks; 

On Skylake it's about 4 to 5x faster than the current 4-byte loop for large
strings, 3x faster for short strings.  For short strings (strlen=33), it's
about 1.5x faster than calling strlen.  For very large strings (too big for L2
cache), it's ~1.7x slower than glibc's AVX2 strlen.

The lack of VEX encoding for pxor and pmovmskb is just me being lazy; let gcc
emit them all with VEX if AVX is enabled.

   # at this point gcc has `s` in RDX, `i` in ECX

pxor   %xmm0, %xmm0 # zeroed vector to compare against
.p2align 4
.Lstrlen16: # do {
#ifdef __AVX__
vpcmpeqb   (%rdx), %xmm0, %xmm1
#else
movdqa (%rdx), %xmm1
pcmpeqb%xmm0, %xmm1   # xmm1 = -1 where there was a 0 in memory
#endif

add $16, %rdx # ptr++
pmovmskb  %xmm1, %eax # extract high bit of each byte to a
16-bit mask
test   %eax, %eax
jz.Lstrlen16# }while(mask==0);
# RDX points at the 16-byte chunk *after* the one containing the terminator
# EAX = bit-mask of the 0 bytes, and is known to be non-zero
bsf%eax, %eax   # EAX = bit-index of the lowest set bit

# terminator is at rdx+rax - 16
#  movb   $'A', -16(%rdx, %rax)  // for a microbench that used
s[strlen(s)]='A'
sub%rbp, %rdx   # p -= start
lea   -16(%rdx, %rax)   # p += byte_within_vector - 16

We should actually use  REP BSF  because that's faster on AMD (tzcnt), and same
speed on Intel.


Also an inline-asm implementation of it with a microbenchmark adapted from the
SO question.  (Compile with -DUSE_ASM -DREAD_ONLY to benchmark a fixed length
repeatedly)
https://godbolt.org/z/9tuVE5

It uses clock() for timing, which I didn't bother updating.  I made it possible
to run it for lots of iterations for consistent timing.  (And so the real work
portion dominates the runtime so we can use perf stat to measure it.)




If we only have 4-byte alignment, maybe check the first 4B, then do (p+4) & ~7
to either overlap that 4B again or not when we start 8B chunks.  But probably
it's good to get to 16-byte alignment and do whole SSE2 vectors, because
repeating an aligned 16-byte test that overlaps an 8-byte test costs the same
as doing another 8-byte test.  (Except on CPUs like Bobcat that split 128-bit
vectors into 64-bit halves).  The extra AND to round down to an alignment
boundary is all it takes, plus the code-size cost of peeling 1 iteration each
of 4B and 8B before a 16-byte loop.

We can use 4B / 8B with movd / movq instead of movdqa.  For pmovmskb, we can
ignore the compare-true results for the upper 8 bytes by testing the result
with `test %al,%al`, or in general with `test $0x0F, %al` to check only the low
4 bits of EAX for the 4-byte case.



The scalar bithack version can use BSF instead of CMOV binary search for the
byte with a set high bit.  That should be a win if we ever wanted to do scalar
on some x86 target especially with 8-byte registers, or on AArch64.  AArch64
can rbit / clz to emulate bsf and find the position of the first set bit.

(Without efficient SIMD compare result -> integer_mask, or efficient SIMD ->
integer at all on some ARM / AArch64 chips, SIMD compares for search loops
aren't always (ever?) a win.  IIRC, glibc strlen and memchr don't use vectors
on ARM / AArch64, just scalar bithacks.)

[Bug target/90568] New: stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-21 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568

Bug ID: 90568
   Summary: stack protector should use cmp or sub, not xor, to
allow macro-fusion on x86
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

cmp/jne is always at least as efficient as xor/jne, and more efficient on CPUs
that support macro-fusion of compare and branch.  Most support cmp/jne fusion
(including all mainstream Intel and AMD, not low-power), but none support
xor/jne fusion.

void foo() {
volatile int buf[4];
buf[1] = 2;
}

gcc trunk on Godbolt, but same code-gen all the way back to gcc4.9

foo:
subq$40, %rsp
movq%fs:40, %rax
movq%rax, 24(%rsp)
xorl%eax, %eax
movl$2, 4(%rsp)
movq24(%rsp), %rax
xorq%fs:40, %rax  ## This insn should be CMP
jne .L5
addq$40, %rsp
ret
.L5:
call__stack_chk_fail

As far as I can tell, the actual XOR result value in RAX is not an input to
__stack_chk_fail because gcc sometimes uses a different register.

Therefore we don't need it, and can use any other way to check for equality.

If we need to avoid "leaking" the canary value in a register, we can use SUB,
otherwise CMP is even better and can macro-fuse on more CPUs.

Only Sandybridge-family can fuse SUB/JCC.  (And yes, it can fuse even with a
memory-source and a segment override prefix.  SUB %fs:40(%rsp), %rax / JNE  is
a single uop on Skylake; I checked this with perf counters in an asm loop.)

AMD can fuse any TEST or CMP/JCC, but only those instructions (so SUB is as bad
as XOR for AMD).  See Agner Fog's microarch PDF.



Linux test program (NASM) that runs  sub (mem), %reg with an FS prefix to prove
that it does macro-fuse and stays micro-fused as a single uop:


default rel
%use smartalign
alignmode p6, 64

global _start
_start:

cookie equ 12345
mov  eax, 158   ; __NR_arch_prctl
mov  edi, 0x1002; ARCH_SET_FS
lea  rsi, [buf]
syscall
   ;  wrfsbase   rsi; not enabled by the kernel
mov  qword [fs: 0x28], cookie

mov ebp, 10

align 64
.loop:
mov   eax, cookie
sub   rax, [fs: 0x28]
jne   _start
and   ecx, edx

dec ebp
jnz .loop
.end:

xor edi,edi
mov eax,231   ; __NR_exit_group
syscall   ; sys_exit_group(0)


section .bss
align 4096
buf:resb 4096



nasm -felf64  branch-fuse-mem.asm &&
ld -o branch-fuse-mem  branch-fuse-mem.o
to make a static executable

taskset -c 3 perf stat
-etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u
-r2 ./branch-fuse-mem

On my i7-6700k

 Performance counter stats for './branch-fuse-mem' (2 runs):

240.78 msec task-clock:u  #0.999 CPUs utilized 
  ( +-  0.23% )
 2  context-switches  #0.010 K/sec 
  ( +- 20.00% )
 0  cpu-migrations#0.000 K/sec  
 3  page-faults   #0.012 K/sec  
 1,000,764,258  cycles:u  #4.156 GHz   
  ( +-  0.00% )
 2,000,000,076  branches:u# 8306.384 M/sec 
  ( +-  0.00% )
 6,000,000,088  instructions:u#6.00  insn per cycle
  ( +-  0.00% )
 4,000,109,615  uops_issued.any:u # 16613.222 M/sec
  ( +-  0.00% )
 5,000,098,334  uops_executed.thread:u# 20766.367 M/sec
  ( +-  0.00% )

  0.240935 +- 0.000546 seconds time elapsed  ( +-  0.23% )

Note 1.0 billion cycles (1 per iteration), and 4B fused-domain uops_issued.any,
i.e. 4 uops per loop iteration.

(5 uops *executed* is because one of those front-end uops has a load
micro-fused).

Changing SUB to CMP has no effect.

With SUB changed to XOR, the loop takes 1.25 cycles per iteration, and the
front-end issues 5 uops per iteration.  Other counters are the same.

Skylake's pipeline is 4-wide, like all Intel since Core2, so an extra uop for
the front-end creates a bottleneck.

--

On Intel pre Haswell, the decoders will only make at most 1 fusion per decode
group, so you may need to make the loop larger to still get fusion.  Or use
this as the loop-branch, e.g. with a  1  in memory

   sub  rax, [fs: 0x28]
   jnz  .loop

or with a 0 in memory, sub or cmp or xor will all set flags according to the
register being non-zero.  But sub or xor will introduce an extra cycle of
latency on the critical path for the loop counter.

[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-21 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568

--- Comment #1 from Peter Cordes  ---
https://godbolt.org/z/hHCVTc

Forgot to mention, stack-protector also disables use of the red-zone for no
apparent reason, so that's another missed optimization.  (Perhaps rarely
relevant; probably most functions that get stack protection are big enough that
they need more stack, or non-leaf.  I sidestepped that with volatile.)

[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-22 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568

--- Comment #3 from Peter Cordes  ---
(In reply to Jakub Jelinek from comment #2)
> The xor there is intentional, for security reasons we do not want the stack
> canary to stay in the register afterwards, because then it could be later
> spilled or accessible to some exploit in another way.

Ok, so we can't use CMP, therefore we should use SUB, which as I showed does
help on Sandybridge-family vs. XOR.

x - x = 0   just like 
x ^ x = 0

Otherwise SUB wouldn't set ZF.

SUB is not worse than XOR on any other CPUs; there are no CPUs with better XOR
throughput than ADD/SUB.

In the canary mismatch case, leaving  attacker_value - key  in a register seems
no worse than leaving attacker_value ^ key in a register.  Either value
trivially reveals the canary value to an attacker that knows what they
overwrote the stack with, if it does somehow leak.  We jump to __stack_chk_fail
in that case, not relying on the return value on the stack, so a ROP attack
wouldn't be sufficient to leak that value anywhere.

[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-22 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568

--- Comment #5 from Peter Cordes  ---
And BTW, this only helps if the SUB and JNE are consecutive, which GCC
(correctly) doesn't currently optimize for with XOR.

If this sub/jne is different from a normal sub/branch and won't already get
optimized for macro-fusion, we may get even more benefit from this change by
teaching gcc to keep them adjacent.

GCC currently sometimes splits up the instructions like this:

xorq%fs:40, %rdx
movl%ebx, %eax
jne .L7

from gcc8.3 (but not 9.1 or trunk in this case) on https://godbolt.org/z/nNjQ8u


#include 
unsigned int get_random_seed() {
std::random_device rd;
return rd();
}

Even with -O3 -march=skylake.
That's not wrong because XOR can't macro-fuse, but the point of switching to
SUB is that it *can* macro-fuse into a single sub-and-branch uop on
Sandybridge-family.  So we might need to teach gcc about that.

So when you change this, please make it aware of optimizing for macro-fusion by
keeping the sub and jne back to back.  Preferably with tune=generic (because
Sandybridge-family is fairly widespread and it doesn't hurt on other CPUs), but
definitely with -mtune=intel or -mtune=sandybridge or later.

Nehalem and earlier can only macro-fuse test/cmp

The potential downside of putting it adjacent instead of 1 or 2 insns earlier
for uarches that can't macro-fuse SUB/JNE should be about zero on average. 
These branches should predict very well, and there are no in-order x86 CPUs
still being sold.  So it's mostly just going to be variations in fetch/decode
that help sometimes, hurt sometimes, like any code alignment change.

[Bug target/90582] New: AArch64 stack-protector wastes an instruction on address-generation

2019-05-22 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90582

Bug ID: 90582
   Summary: AArch64 stack-protector wastes an instruction on
address-generation
   Product: gcc
   Version: 8.2.1
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

void protect_me() {
volatile int buf[2];
buf[1] = 3;
}

https://godbolt.org/z/xdlr5w AArch64 gcc8.2 -O3 -fstack-protector-strong

protect_me:
stp x29, x30, [sp, -32]!
adrpx0, __stack_chk_guard
add x0, x0, :lo12:__stack_chk_guard ### this instruction
mov x29, sp # frame pointer even though
-fomit-frame-pointer is part of -O3.  Goes away with explicit
-fomit-frame-pointer

ldr x1, [x0]# copy the cookie
str x1, [sp, 24]
mov x1,0# and destroy the reg

mov w1, 3   # right before it's already
destroyed
str w1, [sp, 20] # buf[1] = 3

ldr x1, [sp, 24]# canary
ldr x0, [x0]# key destroys the key pointer
eor x0, x1, x0
cbnzx0, .L5
ldp x29, x30, [sp], 32  # FP and LR save/restore (for
some reason?)
ret
.L5:
  # can the store of the link register go here, for backtracing?
bl  __stack_chk_fail

A function that returns a global can embed the low 12 bits of the address into
the load instruction.  AArch64 instructions are fixed-width, so there's no
reason (AFAIK) not to do this.

f:
adrpx0, foo
ldr w0, [x0, #:lo12:foo]
ret

I'm not an AArch64 performance expert; it's plausible that zero displacements
are worth spending an extra instruction on for addresses that are used twice,
but unlikely.

So we should be doing 

adrpx0, __stack_chk_guard
ldr x1, [x0, #:lo12:__stack_chk_guard]  # in prologue to copy
cookie
... 
ldr x0, [x0, #:lo12:__stack_chk_guard]  # in epilogue to check
cookie

This also avoids leaving an exact pointer right to __stack_chk_guard in a
register, in case a vulnerable callee or code in the function body can be
tricked into dereferencing it and leaking the cookie.  (In non-leaf functions,
we generate the pointer in a call-preserved register like x19, so yes it will
be floating around in a register for callees).

I'd hate to suggest destroying the pointer when copying to the stack, because
that would require another adrp later.

Finding a gadget that has exactly the right offset (the low 12 bits of
__stack_chk_guard's address) is a lot less likely than finding an  ldr from
[x0].  Of course this will introduce a lot of LDR instructions with an
#:lo12:__stack_chk_guard offset, but hopefully they won't be part of useful
gadgets because they lead to writing the stack, or to EOR/CBNZ to
__stack_chk_fail



I don't see a way to optimize canary^key == 0 any further, unlike x86-64 PR
90568.  I assume EOR / CBNZ is as at least as efficient as SUBS / BNE on
all/most AArch64 microarchitectures, but someone should check.



-O3 includes -fomit-frame-pointer according to -fverbose-asm, but functions
protected with -fstack-protector-strong still get a frame pointer in x29
(costing a MOV x29, sp instruction, and save/restore with STP/LDP along with
x30.)

However, explicitly using -fomit-frame-pointer stops that from happening.  Is
that a separate bug, or am I missing something?



Without stack-protector, the function is vastly simpler

protect_me:
sub sp, sp, #16
mov w0, 3
str w0, [sp, 12]
add sp, sp, 16
ret

Does stack-protector really need to spill/reload x29/x30 (FP and LR)?  Bouncing
the return address through memory seems inefficient, even though branch
prediction does hide that latency.

Is that just so __stack_chk_fail can backtrace?  Can we move the store of the
link register into the __stack_chk_fail branch, off the fast path?

Or if we do unconditionally store x30 (the link register), at least don't
bother reloading it in a leaf function if register allocation didn't need to
clobber it.  Unlike x86-64, the return address can't be attacked with buffer
overflows if it stays safe in a register the whole function.

Obviously my test-case with a volatile array and no inputs at all is making
-fstack-protector-strong look dumb by protecting a perfectly safe function. 
IDK how common it is to have leaf functions with arrays or structs that just
use them for some computation on function args or globals and then return,
maybe after copying the array b

[Bug target/91103] New: AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2019-07-06 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

Bug ID: 91103
   Summary: AVX512 vector element extract uses more than 1 shuffle
instruction; VALIGND can grab any element
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

GCC9.1 and current trunk aren't good at extracting high elements, whether it's
with GNU C native vector syntax, or when auto-vectorizing something that ends
with the result in the high element.

Using VALIGND we can get any element with one immediate instruction, but its
better to use AVX2 VPERMPD(immediate) when possible.  Or inside loops,
VPERMPS(vector), or VPERMT2PS(vector).  Or of course vextractf32x4 if possible
(element at the bottom of a 128-bit lane).

Or with only AVX2 available, VPERMPD(immediate) for high elements in __m256 and
__m256d vectors is still a big win.

#include 
float elem12(__m512 v) {  return v[12]; }
float elem15(__m512 v) {  return v[15]; }

gcc -Ofast -march=skylake-avx512
https://godbolt.org/z/241r8p

elem15:
vextractf32x8   ymm0, zmm0, 0x1
vextractf128xmm0, ymm0, 0x1# elem12 ends here, after these 2
insns
vshufps xmm0, xmm0, xmm0, 255
 # no vzeroupper I guess because the caller must have __m512 vars too,
recent optimization
ret

But AVX512F has vextractf32x4 to extract a 128-bit lane, which would preclude
the need for AVX2 vextractf128.  That's what clang does.

Obviously inside a loop it would be *much* better to use a single lane-crossing
VPERMPS to also avoid the shufps.  Intel Skylake easily bottlenecks on shuffle
throughput.  We'd need a 15 in an XMM register as a control vector, but loading
it would be off the latency critical path.  (If we needed the scalar
zero-extended instead of garbage in high elements, we could VPERMI2PS or
VPERMT2PS with a zeroed vector and a shuffle-control.)

---

If the element we want is an even element in the low 256 bits, we can get it
with a VPERMPD-immediate.  GCC does this:

elem6(float __vector(16)): # GCC 10 trunk
vextractf128xmm0, ymm0, 0x1
vunpckhps   xmm0, xmm0, xmm0
ret

Instead it should be AVX2   vpermpd ymm0, ymm0, 3
This bug also applies to __m256, not just __m512

https://www.felixcloutier.com/x86/vpermpd
VPERMPD is a 64-bit granularity lane-crossing shuffle.  The AVX512F immediate
version reuses the immediate for another 256-bit wide shuffle in the upper
half; only the vector-control version can bring an element from the top half of
a ZMM down to the bottom.  But if we're going to use a vector control, we might
as well use VPERMPS.

For the integer version of this bug, use VPERMQ

--

But we can do even better by using an integer VALIGND (AVX512F) shuffle on FP
data.  There unfortunately isn't an FP flavour of VALIGND, just integer.

AFAIK, Skylake-AVX512 still has no bypass-delay penalty for integer shuffles
between FP math instructions, i.e. the shift unit is connected to both FP and
integer forwarding networks.  Intel's optimization manual for Skylake (client)
has a bypass-latency table that shows 0 extra latency cycles for SHUF/5/1,3
reading from anything, or anything reading from it.

https://www.felixcloutier.com/x86/valignd:valignq  It's a 4 or 8-byte
granularity version of palignr, except that it's lane-crossing so the 256 and
512-bit versions are actually useful.  The immediate shift count can thus bring
*any* element down to the bottom.  (Using the same input twice makes it a
rotate).

VALIGND is good on Knight's Landing, too: unlike most 2-input shuffles, it has
1 per clock throughput.

For *any* compile-time-constant index, we can always compile v[i] to this:

extract15:
   valigndzmm0, zmm0, zmm0, 15   # I think this is right.
   ret

The only downside I'm aware of is that some future AVX512 CPU might not run
VALIGND as efficiently as SKX and KNL.




For vector elements narrower than 32 bits, we may need 2 shuffles even if we
consider using a shuffle-control vector.  On Skylake-AVX512,  AVX512BW  vpermw 
will get the job done, but costs 2 shuffle uops.  On CannonLake (and presumably
other future Intel), it and  AVX512VBMI vpermb are only 1 uop, so it's
definitely worth creating a shuffle-control vector if it can be reused.


Also worth considering instead of 2 shuffles: *unaligned* spill / reload like
ICC does for GNU C native vector indexing.  Store-forwarding latency is only 6
or 7 cycles I think, and it avoids any port 5 pressure.  Not generally a good
choice IMO when we can get the job done in one shuffle, but worth considering
if we need multiple elements.  If the function doe

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2019-07-08 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

--- Comment #4 from Peter Cordes  ---
We should not put any stock in what ICC does for GNU C native vector indexing. 
I think it doesn't know how to optimize that because it *always* spills/reloads
even for `vec[0]` which could be a no-op.  And it's always a full-width spill
(ZMM), not just the low XMM/YMM part that contains the desired element.  I
mainly mentioned ICC in my initial post to suggest the store/reload strategy in
general as an *option*.

ICC also doesn't optimize intriniscs: it pretty much always faithfully
transliterates them to asm.  e.g. v = _mm_add_epi32(v, _mm_set1_epi32(1)); 
twice compiles to two separate paddd instructions, instead of one with a
constant of set1(2).

If we want to see ICC's strided-store strategy, we'd need to write some pure C
that auto-vectorizes.



That said, store/reload is certainly a valid option when we want all the
elements, and gets *more* attractive with wider vectors, where the one extra
store amortizes over more elements.

Strided stores will typically bottleneck on cache/memory bandwidth unless the
destination lines are already hot in L1d.  But if there's other work in the
loop, we care about OoO exec of that work with the stores, so uop throughput
could be a factor.


If we're tuning for Intel Haswell/Skylake with 1 per clock shuffles but 2 loads
+ 1 store per clock throughput (if we avoid indexed addressing modes for
stores), then it's very attractive and unlikely to be a bottleneck.

There's typically spare load execution-unit cycles in a loop that's also doing
stores + other work.  You need every other uop to be (or include) a load to
bottleneck on that at 4 uops per clock, unless you have indexed stores (which
can't run on the simple store-AGU on port 7 and need to run on port 2/3, taking
a cycle from a load).   Cache-split loads do get replayed to grab the 2nd half,
so it costs extra execution-unit pressure as well as extra cache-read cycles.

Intel says Ice will have 2 load + 2 store pipes, and a 2nd shuffle unit.  A
mixed strategy there might be interesting: extract the high 256 bits to memory
with vextractf32x8 and reload it, but shuffle the low 128/256 bits.  That
strategy might be good on earlier CPUs, too.  At least with movss + extractps
stores from the low XMM where we can do that directly.

AMD before Ryzen 2 has only 2 AGUs, so only 2 memory ops per clock, up to one
of which can be a store.  It's definitely worth considering extracting the high
128-bit half of a YMM and using movss then shuffles like vextractps: 2 uops on
Ryzen or AMD.


-

If the stride is small enough (so more than 1 element fits in a vector), we
should consider  shuffle + vmaskmovps  masked stores, or with AVX512 then
AVX512 masked stores.

But for larger strides, AVX512 scatter may get better in the future.  It's
currently (SKX) 43 uops for VSCATTERDPS or ...DD ZMM, so not very friendly to
surrounding code.  It sustains one per 17 clock throughput, slightly worse than
1 element stored per clock cycle.  Same throughput on KNL, but only 4 uops so
it can overlap much better with surrounding code.




For qword elements, we have efficient stores of the high or low half of an XMM.
 A MOVHPS store doesn't need a shuffle uop on most Intel CPUs.  So we only need
1 (YMM) or 3 (ZMM) shuffles to get each of the high 128-bit lanes down to an
XMM register.

Unfortunately on Ryzen, MOVHPS [mem], xmm costs a shuffle+store.  But Ryzen has
shuffle EUs on multiple ports.

[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2018-07-31 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459

--- Comment #3 from Peter Cordes  ---
I had another look at this with current trunk.  Code-gen is similar to before
with -march=skylake-avx512 -mprefer-vector-width=512.  (If we improve code-gen
for that choice, it will make it a win in more cases.)

https://godbolt.org/g/2dfkNV

Loads are folding into the shifts now, unlike with gcc7.3.  (But they can't
micro-fuse because of the indexed addressing mode.  A pointer increment might
save 1 front-end uop even in the non-unrolled loop)

The separate integer loop counter is gone, replaced with a compare against an
end-index.

But we're still doing 2x vpmovwb + vinserti64x4 instead of vpackuswb + vpermq. 
Fewer instructions and (more importantly) 1/3 the shuffle uops.  GCC knows how
to do this for the 256-bit version, so it's apparently a failure of the
cost-model that it doesn't for the 512-bit version.  (Maybe requiring a
shuffle-control vector instead of immediate puts it off?  Or maybe it's
counting the cost of the useless vpand instructions for the pack / permq
option, even though they're not part of the shuffle-throughput bottleneck?)



We do use vpackuswb + vpermq for 256-bit, but we have redundant AND
instructions with set1_epi16(0x00FF) after a right shift already leaves the
high byte zero.

---

Even if vmovdqu8 is not slower, it's larger than AVX vmovdqu.  GCC should be
using the VEX encoding of an instruction whenever it does exactly the same
thing.  At least we didn't use vpandd or vpandq EVEX instructions.

(I haven't found any confirmation about vmovdqu8 costing an extra ALU uop as a
store with no masking.  Hopefully it's efficient.)

[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2018-07-31 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459

--- Comment #4 from Peter Cordes  ---
The VPAND instructions in the 256-bit version are a missed-optimization.

I had another look at this with current trunk.  Code-gen is similar to before
with -march=skylake-avx512 -mprefer-vector-width=512.  (If we improve code-gen
for that choice, it will make it a win in more cases.)

https://godbolt.org/g/2dfkNV

Loads are folding into the shifts now, unlike with gcc7.3.  (But they can't
micro-fuse because of the indexed addressing mode.  A pointer increment might
save 1 front-end uop even in the non-unrolled loop)

The separate integer loop counter is gone, replaced with a compare against an
end-index.

But we're still doing 2x vpmovwb + vinserti64x4 instead of vpackuswb + vpermq. 
Fewer instructions and (more importantly) 1/3 the shuffle uops.  GCC knows how
to do this for the 256-bit version, so it's apparently a failure of the
cost-model that it doesn't for the 512-bit version.  (Maybe requiring a
shuffle-control vector instead of immediate puts it off?  Or maybe it's
counting the cost of the useless vpand instructions for the pack / permq
option, even though they're not part of the shuffle-throughput bottleneck?)



We do use vpackuswb + vpermq for 256-bit, but we have redundant AND
instructions with set1_epi16(0x00FF) after a right shift already leaves the
high byte zero.

---

Even if vmovdqu8 is not slower, it's larger than AVX vmovdqu.  GCC should be
using the VEX encoding of an instruction whenever it does exactly the same
thing.  At least we didn't use vpandd or vpandq EVEX instructions.

(I haven't found any confirmation about vmovdqu8 costing an extra ALU uop as a
store with no masking.  Hopefully it's efficient.)

[Bug libstdc++/71660] [6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86

2018-03-13 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660

--- Comment #17 from Peter Cordes  ---
(In reply to Jonathan Wakely from comment #16)
> But what we do care about is comment 2, i.e. _Atomic(T) and std::atomic
> should have the same alignment (both in an out of structs). Maybe that needs
> the C front-end to change how _Atomic works, or maybe it needs the C++
> library to change how std::atomic works, but I want to keep this bug open
> while comment 2 gives different answers for C and C++.

Right, gcc's C _Atomic ABI is still broken for long long on 32-bit x86.  It
only aligned _Atomic long long to 32 bits (inside structs), but then assumes
that 8-byte loads / stores (with x87 or SSE1/2) are atomic.

It also leads to abysmal performance for  LOCK CMPXCHG  or other RMW operations
if the atomic object is split across a cache line.

That's bug 65146, so we can close this one.  (I never got around to posting in
the google group for the ABI.  By far the best good solution is giving _Atomic
long long (and other 8-byte objects) a boost to their _Alignof, up to 8 byte
alignment even inside structs.)

[Bug target/85038] New: x32: unnecessary address-size prefix when a pointer register is already zero-extended

2018-03-22 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85038

Bug ID: 85038
   Summary: x32: unnecessary address-size prefix when a pointer
register is already zero-extended
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

Bug 82267 was fixed for RSP only.  (Or interpreted narrowly as only being about
RSP vs. ESP).

This bug is about the general case of using address-size prefixes in cases
where we could prove they're not needed.  Either because out-of-bounds is UB so
we don't care about wrap vs. going outside 4GiB, or (simpler) the
single-register case when we know the pointer is already zero-extended.  Maybe
we want separate bugs to track parts of this that can be fixed with separate
patches, but I won't consider this fixed until -mx32 emits optimal code for all
the cases listed here.

I realize this won't be any time soon, but it's still code-size (and thus
indirectly performance) that gcc is leaving on the table.  Being smarter about
using 64-bit address-size is even more useful for AArch64 -mabi=ilp32, because
it doesn't have 32-bit address-size overrrides, so it always costs an extra
instruction every time we fail to prove that 64-bit is safe.  (And AArch64
ILP32 may get more use than x32 these days).  I intended this bug to be about
x32, though.



Useless 0x67 address-size override prefixes hurt code-size and thus performance
on everything, with more serious problems on some CPUs that have trouble with
more than 3 prefixes (especially Silvermont).  See Bug 82267 for the details
which I won't repeat.


We still have tons of useless 0x67 prefixes in the default -maddress-mode=short
mode (for every memory operand other than RSP, or RIP-relative), and
-maddress-mode=long has lots of missed optimizations resulting in wasted LEA
instructions, so neither one is good.


float doublederef(float **p){
return **p;
}
 // https://godbolt.org/g/exb74t
 // gcc 8.0.1 (trunk) -O3 -mx32 -march=haswell -maddress-mode=short
movl(%edi), %eax
vmovss  (%eax), %xmm0# could/should be (%rax)
ret

-maddress-mode=long gets that right, using (%rax), and also (%rdi) because the
ABI doc specifies that x32 passes pointers zero-extended.  mode=short still
ensures that, so failure to take advantage is still a missed-opt.

Note that clang -mx32 violates that ABI guarantee by compiling
pass_arg(unsigned long long ptr) { ext_func((void*)ptr); } to just a tailcall
(while gcc does zero-extend).  See output in the godbolt link above.  IDK if we
care about being bug-compatible with clang for that corner case for this rare
ABI, though.  A less contrived case would be a struct arg or return value
packed into a register passed on as just a pointer.


-

// arr+offset*4 is strictly within the low 32 bits because of range limits

float safe_offset(float *arr, unsigned offset){
unsigned tmp = (unsigned)arr;
arr = (void*)(tmp & -4096);  // round down to a page
offset &= 0xf;
return arr[offset];
}
   // on the above godbolt link
#mode=short
andl$-4096, %edi
andl$15, %esi
vmovss  (%edi,%esi,4), %xmm0
# (%rdi,%rsi,4) would have been safe, but that's maybe not worth
looking for.
# most cases have less pointer alignment than offset range

#mode=long
andl$-4096, %edi
andl$15, %esi
leal(%rdi,%rsi,4), %eax
vmovss  (%eax), %xmm0 # 32-bit addrmode after using a separate
LEA

So mode=long is just braindead here.  It gets the worst of both worlds, using a
separate LEA but then not taking advantage of the zero-extended pointer.  The
only way this could be worse is the LEA operand-size was 64-bit.

Without the masking, both modes just use  vmovss (%edi,%esi,4), %xmm0, but the
extra operations defeat mode=long's attempts to recognize this case, and it
picks an LEA instead of (or as well as?!?) an address-size prefix.

---

With a 64-bit offset, and a pointer that's definitely zero-extended to 64 bits:

   // same for signed or unsigned
float ptr_and_offset_zext(float **p, unsigned long long offset){
float *arr = *p;
return arr[offset];
}

# mode=short
movl(%edi), %eax  # mode=long uses (%rdi) here
vmovss  (%eax,%esi,4), %xmm0  # but still 32-bit here.
ret

Why are we using address-size prefixes to stop a base+index from going outside
4G on out of bounds UB?  (%rax,%rsi,4) should work for a signed / unsigned
64-bit offset when the pointer is known to be zero-extended.

ISO C11 says that pointer+integer produces a result of pointer type, with UB if
the result goes

[Bug target/85038] x32: unnecessary address-size prefix when a pointer register is already zero-extended

2018-03-22 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85038

--- Comment #1 from Peter Cordes  ---
Correction for AArch64: it supports addressing modes with a 64-bit base
register + 32-bit index register with zero or sign extension for the 32-bit
index.  But not 32-bit base registers.

As a hack that's better than nothing, AArch64 could use a 32-bit pointer as the
index with a UXTW mode, using a zeroed register as the base (unless indexed
modes have any perf downside on real AArch64 chips).  But unfortunately, the
architectural zero register isn't usable as the base: that encoding means the
stack pointer for this instruction.  ldr w1,[xzr,w2,uxtw] doesn't assemble,
only x0-x30 or SP.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0801b/BABBGCAC.html


http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/LDR_reg_gen.html
describes LDR  Wt, [Xn|SP, Rm{, extend {amount}}]
where Rm can be an X or W register, and "extend" can be SXTW or UXTW for word
regs, or LSL for X regs.  (SXTX is a synonym for LSL).  Any of the modes can
use a left-shift amount, applied *after* extension to 64-bit.

See
https://community.arm.com/processors/b/blog/posts/a64-shift-and-extend-operations-operand-modifiers
for details on operand-modifiers.


gcc6.3 doesn't take advantage with -mabi=ilp32, and Godbolt doesn't have later
AArch64 gcc.

So gcc will need to know about zero-extended pointers, and the signedness of
32-bit values, to take advantage of AArch64's addressing modes for the common
case of a 32-bit index.  Teaching gcc to track signed/unsigned in RTL would
benefit x32 and AArch64 ILP32, if I understand the situation correctly.

[Bug target/69576] New: tailcall could use a conditional branch on x86, but doesn't

2016-01-31 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69576

Bug ID: 69576
   Summary: tailcall could use a conditional branch on x86, but
doesn't
   Product: gcc
   Version: 5.3.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: i386-*, x86_64-*

In x86, both jmp and jcc can use either a rel8 or rel32 displacement.  Unless
I'm misunderstanding something, the rel32 displacement in a jcc can be
relocated at link time identically to the way the rel32 in a jmp can be.


void ext(void);
void foo(int x) {
  if (x > 10) ext();
}

compiles to (gcc 5.3 -O3 -mtune=haswell)

cmpl$10, %edi
jg  .L4
ret
.L4:
jmp ext

Is this a missed optimization, or is there some reason gcc must avoid
conditional branches for tail-calls that makes this not a bug?  This sequence
is clearly better, if it's safe:

cmpl$10, %edi
jg  ext
ret


If targeting a CPU which statically predicts unknown forward branches as
not-taken, and you can statically predict the tail-call as strongly taken, then
it could make sense to use clang 3.7.1's sequence:

cmpl$11, %edi
jl  .LBB0_1
jmp ext # TAILCALL
.LBB0_1:
retq

According to Agner Fog's microarch guide, AMD CPUs use this static prediction
strategy, but Pentium M / Core2 assign a BTB entry and use whatever prediction
was in that entry already.  He doesn't specifically mention static prediction
for later Intel CPUs, but they're probably similar.   (So using clang's
sequence only helps on (some?) AMD CPUs, even if the call to ext() always
happens.)

AFAICT, gcc's sequence has no advantages in any case.

Note that the code for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69569
demonstrates this bug as well, but is a separate issue.  It's pure coincidence
that I noticed this the day after that bug was filed.

[Bug rtl-optimization/69615] New: 0 to limit signed range checks don't always use unsigned compare

2016-02-01 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69615

Bug ID: 69615
   Summary: 0 to limit signed range checks don't always use
unsigned compare
   Product: gcc
   Version: 5.3.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

gcc sometimes misses the unsigned-compare trick for checking if a signed value
is between 0 and limit (where limit is known to be <= INT_MAX).

It seems that gcc fails when the upper limit is a variable, even if I shift or
mask it down to a small range.  clang handles this case, so I'm sure I
constructed my test case in a way that could be optimized.



All the code in this bug report is on godbolt, for ease of trying with older
versions of gcc (including for ARM64/ARM/PPC), and with clang / icc13. 
http://goo.gl/V7PFmv.   (I used -x c to compile as C, even though it only
provides c++ compilers).

This appears to be arch-independent (unless my quick skim of asm for ISAs I
barely know misled me...)



The simplest case is when the upper limit is a compile-time constant.  There's
one case where gcc and clang fail to optimize:  x<=(INT_MAX-1), or
equivalently, x
#include 
extern void ext(void);

// clang and gcc both optimize range checks up to INT_MAX-2 to a single
unsigned compare
void r0_to_imax_2(int x){ if (x>=0 && x<=(INT_MAX-2)) ext(); }  // good code
void r0_to_imax_1(int x){ if (x>=0 && x<=(INT_MAX-1)) ext(); }  // bad code
void r0_to_imax  (int x){ if (x>=0 && x<=(INT_MAX-0)) ext(); }  // good code
(test/js.  not shown)

gcc 5.3.0 -Ofast -mtune=haswell  compiles this to:

r0_to_imax_2:
cmpl$2147483645, %edi   # that's 0x7ffd
jbe .L4
ret
.L4:jmp ext

r0_to_imax_1:
movl%edi, %eax
subl$0, %eax   ## Without any -mtune, uses test edi,edi
js  .L5
cmpl$2147483647, %edi   # that's 0x7fff
je  .L5
jmp ext
.L5:ret

ICC13 compiles this last one to cmp  edi, 0x7ffe / ja, so unless my mental
logic is wrong *and* icc13 is buggy, gcc and clang should still be able to  use
the same optimization as for smaller upper-limits.  They don't: both clang and
gcc use two compare-and-branches for r0_to_imax_1.

BTW, the movl %edi, %eax / subl $0, %eax sequence is used instead of the test
instruction with -mtune=haswell, and even worse with -march=bdver2 where it
even prevents fusion into a compare-and-branch m-op.  I'll file a separate bug
report for that if anyone wants me to.  Agner Fog's microarch guide doesn't
mention anything that would give that sequence an advantage over test, unless
I'm missing something.  It slows AMD down more than (recent) Intel, but that's
not what tuning for Haswell means. :P



Now, on to the case where the limit is variable, but can easily be proven to
itself be in the range [0 .. INT_MAX-1) or much smaller.  (If the limit can be
negative (or unsigned greater than INT_MAX) the optimization is impossible: 
INT_MIN and other negative numbers could be "below" the limit.)


// gcc always fails to optimize this to an unsigned compare, but clang succeeds
void rangecheck_var(int64_t x, int64_t lim2) {
  //lim2 >>= 60;
  lim2 &= 0xf;  // let the compiler figure out the limited range of limit
  if (x>=0 && x

[Bug tree-optimization/69615] 0 to limit signed range checks don't always use unsigned compare

2016-02-02 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69615

--- Comment #3 from Peter Cordes  ---
@Richard and Jakub:

That's just addressing the first part of my report, the problem with  x <=
(INT_MAX-1), right?

You may have missed the second part of the problem, since I probably buried it
under too much detail with the first:

In the case where the limit is variable, but can easily be proven to itself be
in the range [0 .. INT_MAX-1) or much smaller:

// gcc always fails to optimize this to an unsigned compare, but clang succeeds
void rangecheck_var(int64_t x, int64_t lim2) {
  //lim2 >>= 60;
  lim2 &= 0xf;  // let the compiler figure out the limited range of limit
  if (x>=0 && x

[Bug target/69622] New: compiler reordering of non-temporal (write-combining) stores produces significant performance hit

2016-02-02 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69622

Bug ID: 69622
   Summary: compiler reordering of non-temporal (write-combining)
stores produces significant performance hit
   Product: gcc
   Version: 5.3.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: i386-linux-gnu, x86_64-linux-gnu

IDK whether to mark this as "target" or something else.  Other architectures
might have similar write-combining stores that are sensitive to writing whole
cache-lines at once.


For background, see this SO question:
http://stackoverflow.com/questions/25778302/wrong-gcc-generated-assembly-ordering-results-in-performance-hit

In an unrolled copy loop, gcc decides to emit  vmovntdq  stores in a different
order than they appear in the source.  There's no correctness issue, but the
amount of fill-buffers is very limited (maybe each core has 10 or so?).  So
it's *much* better to write all of one cacheline, then all of the next
cacheline.  See my answer on that SO question for lots of discussion and links.

The poster of that question got a 33% speedup (from ~10.2M packets per second
to ~13.3M packets per second by putting the loads and stores in source order in
the binary.  (Unknown hardware and surrounding code, but presumably this loop
is *the* bottleneck in his app).  Anyway, real numbers show that this isn't
just a theoretical argument that some code would be better.


Compilable test-case that demonstrates the issue:

#include 
#include 

//#define compiler_writebarrier()   __asm__ __volatile__ ("")
#define compiler_writebarrier()   // empty.

void copy_mcve(void *const destination, const void *const source, const size_t
bytes)
{
  __m256i *dst  = destination;
const __m256i *src  = source;
const __m256i *dst_endp = (destination + bytes);

while (dst < dst_endp)  { 
  __m256i m0 = _mm256_load_si256( src + 0 );
  __m256i m1 = _mm256_load_si256( src + 1 );
  __m256i m2 = _mm256_load_si256( src + 2 );
  __m256i m3 = _mm256_load_si256( src + 3 );

  _mm256_stream_si256( dst+0, m0 );
  compiler_writebarrier();   // even one anywhere in the loop is enough for
current gcc
  _mm256_stream_si256( dst+1, m1 );
  compiler_writebarrier();
  _mm256_stream_si256( dst+2, m2 );
  compiler_writebarrier();
  _mm256_stream_si256( dst+3, m3 );
  compiler_writebarrier();

  src += 4;
  dst += 4;
}

}

compiles (with the barriers defined as a no-op) to (gcc 5.3.0 -O3
-march=haswell:  http://goo.gl/CwtpS7): 

copy_mcve:
addq%rdi, %rdx
cmpq%rdx, %rdi
jnb .L7
.L5:
vmovdqa 32(%rsi), %ymm2
subq$-128, %rdi
subq$-128, %rsi
vmovdqa -64(%rsi), %ymm1
vmovdqa -32(%rsi), %ymm0
vmovdqa -128(%rsi), %ymm3
 # If dst is aligned, the four halves of two cache lines are {A B} {C D}:
vmovntdq%ymm2, -96(%rdi) # B
vmovntdq%ymm1, -64(%rdi) # C
vmovntdq%ymm0, -32(%rdi) # D
vmovntdq%ymm3, -128(%rdi)# A
cmpq%rdi, %rdx
ja  .L5
vzeroupper
.L7:ret


If the output buffer is aligned, that B C D A store ordering maximally
separates the two halves of the first cache line, giving the most opportunity
for partially-full fill buffers to get flushed.


Doing the +32 load first makes no sense with that placement of the
pointer-increment instructions.  Doing the +0 load first could save a byte of
code-size by not needing a displacement byte.  I'm guessing that's what one
optimizer function was going for when it put the subs there, but then something
else came along and re-ordered the loads.

Is there something that tries to touch both cache-lines as early as possible,
to trigger the loads?  Assuming the buffer is 64B-aligned?

Doing the subs after the last store would save another insn byte, because one
of the stores could use an empty displacement as well.  That's where clang puts
the pointer increments (and it keeps the loads and stores in source order). 
clang also uses vmovaps / vmovntps.  It's probably a holdover from saving an
insn byte in the non-VEX encoding of the 128b insn, but does make the output
work with AVX1 instead of requiring AVX2.


Using a 2-register addressing mode for the loads could save a sub instruction
inside the loop.  Increment dst normally, but reference src with a 2-register
addressing mode with dst and a register initialized with src-dst.  (In the
godbolt link, uncomment the #define ADDRESSING_MODE_HACK.  With ugly enough
source, gcc can be bludgeoned into making code like that.  It wastes insns in
the intro, though, a

[Bug tree-optimization/68557] Missed x86 peephole optimization for multiplying by a bool

2016-02-03 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68557

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #2 from Peter Cordes  ---
Besides code-size, uop-cache size is a factor for Intel CPUs.  imul is only a
single uop, while neg/and is 2 uops.  Total number of instructions is a factor
for other CPUs, too, but only locally.  (Saving uop-cache space can mean
speedups for *other* code that doesn't get evicted).


If the operation isn't part of a long dependency chain, imul is a better choice
on almost all CPUs.  Let OOO execution sort it out.

When latency matters some, we have to weigh the tradeoff of code-size / more
insns and uops vs. slightly (or much) higher latency.

Agner Fog's instruction tables indicate that 32bit imul is probably ok for
tune=generic, but 64bit imul should maybe only be used with -mtune=intel (but
absolutely not with tune=atom.  Maybe not with tune=silvermont either, but it
does have modest OOO capabilities to hide the latency.  It's not as wide, so
saving insns maybe matters more?).  I'm not sure if tune=intel is supposed to
put much weight on pre-Silvermont Atom.

From Agner Fog's spreadsheet, updated 2016-Jan09:

   uops/m-ops   latency   recip-throughput   execution pipe/port
Intel:SnB-family(Sandybridge through Skylake)
imul r32,r32:  1  31  p1
imul r64,r64:  1  31  p1

AMD:bdver1-3
imul r32,r32:  1  42  EX1
imul r64,r64:  1  64  EX1


Intel:Silvermont
imul r32,r32:  1  31  IP0
imul r64,r64:  1  52  IP0

AMD:bobcat/jaguar
imul r32,r32:  1  31  I0
imul r64,r64:  1  64  I0




old HW
Intel:Nehalem
imul r32,r32:  1  31  p1  
imul r64,r64:  1  31  p0
Intel:Merom/Penryn(Core2)
imul r32,r32:  1  31  p1  
imul r64,r64:  1  52  p0  (same as FP mul, maybe
borrows its wider multiplier?)

Intel:Atom
imul r32,r32:  1  52  Alu0,Mul
imul r64,r64:  6 13   11  Alu0,Mul

AMD:K8/K10
imul r32,r32:  1  31  ALU0
imul r64,r64:  1  42  ALU0_1 (uses units 0 and 1)

VIA:Nano3000
imul r32,r32:  1  21  I2
imul r64,r64:  1  52  MA


If gcc keeps track of execution port pressure at all, it should also avoid imul
when surrounding code is multiply-heavy (or doing other stuff that also
contends for the same resources as imul).  I didn't check on neg/and, but I
assume every microarchitecture can run them on any port with one cycle latency
each.

getting off topic here:

tune=generic should account for popularity of CPUs, right?  So I hope it won't
sacrifice much speed for SnB-family in order to avoid something that's slow on
Pentium4, I hope.  (e.g. P4 doesn't like inc/dec, but all other CPUs rename the
carry flag separately to avoid the false dep.  Not a great example, because
that only saves a couple code bytes.  shrd isn't a good example, because it's
slow even on AMD Bulldozer.)

Is there a tune=no_glass_jaws that *will* give up speed (or code size) for
common CPUs in order to avoid things that are *really* bad on some rare
microarchitectures, (especially old ones)?  Or maybe a tune=desktop to doesn't
care what's slow on Atom/Jaguar?  People distributing binaries that probably
won't be used on Atom/Silvermont netbooks might use that.

Anyway, I think it would be neat to have the option of making a binary that
will be quite good on SnB, not have major problems on recent AMD, but I don't
care if it has the occasional slow instruction on Atom or K8.  Or alternatively
to have a binary that doesn't suck badly anywhere.

[Bug middle-end/51837] Use of result from 64*64->128 bit multiply via __uint128_t not optimized

2016-02-04 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51837

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #1 from Peter Cordes  ---
This bug isn't limited to __uint128.  There seems to be a general problem with
full-multiplies (at least on x86).  I see similar excess-mov instructions (or
saving/restoring a register that is not touched) with a 32*32->64b full
multiply in 32bit code.  gcc does a bad job with mulx as well.


Reminder: mulx has two read-only input operands: implicit rdx and explicit
source operand (first operand in AT&T syntax).  The next two operands are
write-only output operands: high half and low half, in that order in AT&T
syntax.


summary of things I noticed:

* sometimes mul r32 is the right choice, even when imul r64,r64 is available. 
It's faster on some CPUs, but even -mtune=atom (the most extreme case of slow
64bit imul) uses imul r64,r64.

* mulx r64,r64,r64 has one cycle more latency than mul r64 on Haswell.

* gcc doesn't notice that it should put C in edx to for mulx when it needs B*C
and A*C.

* gcc is terrible at choosing output registers for mulx

* It can matter which order multiplies appear in the source.  If one of the
multiplies has only half its results used, it can save instructions to do it
first, since only one of rdx and rax need to be saved.  compilers should
re-order for optimal code regardless of source order.

* is it normal for -m32 -Os to default to making stack frames?  clang -m32 -Os
doesn't.



test cases:  (godbolt link: http://goo.gl/2iL7f2)


// 32bit version of Steven's testcase
#include 
uint64_t foo64(unsigned x, unsigned y) {
uint64_t z = (uint64_t)x * y;
return z ^ (z >> 32);
}
gcc 5.3.0 -O3 -m32
pushl   %ebx
movl12(%esp), %eax
mull8(%esp)
popl%ebx   # save/restore of an otherwise-unused register. 
Regression from 4.9.2
xorl%edx, %eax

gcc 5.3.0 -O3 -m32 -mbmi2
pushl   %ebx
movl12(%esp), %eax
movl%eax, %edx   # oops we're using mulx, not mul?
mulx8(%esp), %ecx, %ebx
movl%ecx, %eax   # this is sub-optimal even with that choice of
output reg for mulx
movl%ebx, %edx
xorl%ebx, %eax   # note that even with ideal sequences, mulx
didn't gain anything
popl%ebx

64b: gcc 5.3.0 -O3 -mbmi2   # 64bit mode: 32bit mulx would help significantly,
but it isn't used
movl%edi, %eax
movl%esi, %esi
imulq   %rax, %rsi
movq%rsi, %rax
shrq$32, %rax
xorq%rsi, %rax

hand-optimized 64bit: same as the 32bit case.
mov %edi, %eax
mul %esi# mulx isn't helpful here
xor %edx, %eax
Even when inlined somewhere that doesn't need the upper halves of input
regs zeroed, its 3 uops on SnB for mul r32 vs 3 for imul r,r + mov + shr, with
better or equal latency (depending on mov being 0 or 1c)

I guess this would require recognizing when we want 2 halves of a multiply, and
using the otherwise-slower single-operand form of mul.  Note that AMD BD-family
runs `mul r32` faster than `imul r64, r64`, and so does Atom (but not
Silvermont).

---

//Steven's function:
uint64_t foo128(uint64_t x, uint64_t y) {
__uint128_t z = (__uint128_t)x * y;
return z ^ (z >> 64);
}
   gcc 5.3.0 -O3: same as Steven reported for gcc 4.7

   gcc 5.3.0 -O3 -march=haswell
movq%rdi, %rdx # correct startup sequence
mulx%rsi, %r9, %r10# bad choice of output regs, like 32bit
movq%r10, %rax # correct sequence for handling the
badly-chosen mulx outputs
xorq%r9, %rax
   At 64bit operand size, mul has one cycle lower latency than mulx on Haswell,
so it's only a better choice when the choice of outputs helps, or the different
implicit input (rdx instead of rax).

Obviously we can avoid the mov and the REX prefixes by choosing different
output registers.  clang uses rcx and rax as output registers for mulx, which
is the obvious choice.  (or overwrite an input register).



// A slightly more complex function:

struct DE64 { uint64_t D,E; };

struct DE64 f64_structret(uint64_t A, uint64_t B, uint64_t C) {
  __uint128_t AC = A * (__uint128_t)C;  // gcc makes slightly better code with
BC first.  Order shouldn't matter
  __uint128_t BC = B * (__uint128_t)C;
  uint64_t D = AC >> 64; // high half
  uint64_t E = AC + (BC >> 64);
  struct DE64 retval = { D, E };
  return retval;
}

 # C is already in rdx, which is perfect for mulx.  In the 32bit case (below),
gcc doesn't realize it should put C into edx for easy reuse.

  g

[Bug c++/67461] Multiple atomic stores generate a StoreLoad barrier between each one, not just at the end

2016-02-04 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67461

--- Comment #2 from Peter Cordes  ---
(In reply to Andrew Pinski from comment #1)
> Hmm, I think there needs to be a barrier between each store as each store
> needs to be observed by the other threads.

On x86, stores are already ordered wrt. other stores.  A full-barrier
(including a StoreLoad barrier) after the last store will prevent it from
passing (appearing after) any subsequent loads.

StoreStore, LoadLoad, and LoadStore barriers are implicit between every memory
operation.  (except non-temporal ones). 
http://preshing.com/20120710/memory-barriers-are-like-source-control-operations/

I *think* that's enough for sequential consistency.  If *I'm* misunderstanding
this (which is possible), then please clue me in.

There's definitely a problem on ARM, though.  There's no way two consecutive
dmb sy   instructions are useful.

[Bug tree-optimization/69908] New: recognizing idioms that check for a buffer of all-zeros could make much better code

2016-02-22 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69908

Bug ID: 69908
   Summary: recognizing idioms that check for a buffer of
all-zeros could make *much* better code
   Product: gcc
   Version: 5.3.0
Status: UNCONFIRMED
  Keywords: missed-optimization, ssemmx
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

Checking a block of memory to see if it's all-zero, or to find the first
non-zero, seems to be a not-uncommon problem.  It's really hard to get gcc to
emit code that's even half-way efficient.

The most recent stackoverflow question about this (with links to previous ones)
is
http://stackoverflow.com/questions/35450237/fastest-way-to-check-mass-data-if-null-in-c

Summary:

* gcc would benefit a lot from recognizing zero-checking idioms (with a
suggested asm loop for x86).
* one zero-checking function compiles to bad x86 code in multiple ways
* even a simple byte-at-a-time loop on a fixed-size buffer compiles to
byte-at-a-time asm.
* gcc is bad at horizontal reductions, esp with AVX2

I'm using x86 asm for examples of whether gcc auto-vectorizes or not, but this
is architecture-independent.

-


Ideally we'd like the main loop in these functions to test 64B at a time (a
whole cache-line on all modern x86 microarchitectures), something like:

... some intro stuff ...
pxor %xmm5, %xmm5
.Lvec_loop:
movdqa   (%rsi), %xmm0
por16(%rsi), %xmm0
por32(%rsi), %xmm0
por48(%rsi), %xmm0
#ptest %xmm0, %xmm0  # SSE4.1
#jnz  .Lnonzero_found
pcmpeqb   %xmm5, %xmm0
pmovmskb  %xmm0, %eax
cmp   $0x, %eax  # check that all the bytes compared equal to zero
jne   .Lnonzero_found

add$64, %rsi
cmppointer, end
jb  $Lvec_loop 
# Intel: 9 fused-domain uops in the loop: 
# one too many to issue in 2 cycles and saturate 2 loads per cycle.

# epilogue for the final partial cache-line
# We can test some bytes again,
# e.g. using 16B unaligned loads that end at the correct place
movdqu  -16(end), %xmm0
movdqu  -32(end), %xmm1
por %xmm1, %xmm0
movdqu  -48(end), %xmm1
por %xmm1, %xmm0
movdqu  -64(end), %xmm3
por %xmm1, %xmm0
# ptest or pcmpeq / pmovmskb / cmp

We'd have the intro code handle inputs smaller than 64B, so the epilogue
couldn't access memory from before the start of the buffer.

pcmpeq / pmovmsk / cmp is better than pshufd / por / movq / test, esp for 32bit
where another round of horizontal reduction would be needed.

It might be better to use two accumulators to make better use of two load ports
from a single cache-line, but hopefully the loads will dispatch mostly in
program order, so there hopefully won't be many cache-bank conflicts on SnB/IvB
when multiple iterations are in flight at once.  The POR dep chain is not
loop-carried, and out-of-order execution should hide it nicely.


I have no idea how to write C (without intrinsics) that would auto-vectorize to
anything like that, or even to something acceptable.  It would be nice if there
was some kind of idiom that compilers could recognize and make good code for,
without needing custom code for every platform where we want non-terrible
output.


--

The most solid attempt on that SO question ORs together eight size_t elements
in the main loop, then uses a byte cleanup loop.  gcc makes a mess of it:

Summary of problems with gcc 5.3.0 -O3 -march=haswell for this function:

(I can report separate bugs for the separate problems; Other than recognizing
zero-checking idioms, most of these problems could probably be fixed
separately.)

* gcc doesn't realize that we're ultimately testing for all-zero, and just
treats OR as any other associative operation.

* even a simple byte-loop over a fixed-size buffer doesn't get optimized at all
(different function, see below)

* main loop not auto-vectorized
* word-at-a-time and byte-at-a-time cleanup loops generate full loops:
  gcc doesn't realize they're just cleanup that will only do less than one
vector of data.
* word-at-a-time cleanup loop gets a bloated fully-unrolled scalar intro (which
is all that will ever run)
* byte cleanup loop auto-vectorization unpacks vectors of bytes to longs before
ORing, with a big chain of vextracti128 / vmovzx.
* Without AVX2, gcc does a full-unroll of the unaligned-epilogue for the byte
cleanup autovectorization.


The bad auto-vectorized cleanup-loop code will never run, only their scalar
intros, because of the logic of the function.  Presumably gcc would generate
the nasty pmovzx byte-unpacking code in situations where it would actually run.

The byte cleanup loop has a byte-at-a-time scalar intro loop (not unrolled),
wh

[Bug rtl-optimization/69933] New: non-ideal branch layout for an early-out return

2016-02-23 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69933

Bug ID: 69933
   Summary: non-ideal branch layout for an early-out return
   Product: gcc
   Version: 5.3.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

(just guessing about this being an RTL bug, please reassign if it's
target-specific or something else).

This simple linked-list traversal compiles to slightly bulkier code than it
needs to:

int traverse(struct foo_head *ph)
{
  int a = -1;
  struct foo *p, *pprev;
  pprev = p = ph->h;
  while (p != NULL) {
pprev = p;
p = p->n;
  }
  if (pprev)
a = pprev->a;
  return a;
}

 (gcc 5.3.0 -O3 on godbolt: http://goo.gl/r8vb5L)

movq(%rdi), %rdx
movl$-1, %eax   ; only needs to happen in the early-out case
testq   %rdx, %rdx
jne .L3 ; jne/ret or je / fall through would be better
jmp .L9
.L5:
movq%rax, %rdx
.L3:
movq(%rdx), %rax
testq   %rax, %rax
jne .L5
movl8(%rdx), %eax
ret
.L9:
   ; ARM / PPC gcc 4.8.2 put the a=-1 down here
ret ; this is a rep ret without -mtune=intel


Clang 3.7 chooses a better layout with a je .early_out instead the jne / jmp. 
It arranges the loop so it can enter at the top.  It actually look pretty
optimal:

movq(%rdi), %rcx
movl$-1, %eax
testq   %rcx, %rcx
je  .LBB0_3
.LBB0_1:# %.lr.ph
movq%rcx, %rax
movq(%rax), %rcx
testq   %rcx, %rcx
jne .LBB0_1
movl8(%rax), %eax
.LBB0_3:# %._crit_edge.thread
retq

Getting the mov $-1 out of the common case would require a separate mov/ret
block after the normal ret, so it's a code-size tradeoff which isn't worth it,
because a mov-immediate is dirt cheap.

Anyway, there are a couple different ways to lay out the branches and the mov
$-1, %eax, but gcc's choice is in no way optimal. :(

[Bug tree-optimization/69935] New: load not hoisted out of linked-list traversal loop

2016-02-23 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69935

Bug ID: 69935
   Summary: load not hoisted out of linked-list traversal loop
   Product: gcc
   Version: 5.3.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

(please check the component.  I guessed tree-optimization since it's
cross-architecture.)

gcc doesn't hoist the p->a load out of the loop in this linked-list function

int traverse_loadloop(struct foo_head *ph)
{
  int a = -1;
  struct foo *p = ph->h;
  while (p) {
a = p->a;
p = p->n;
  }
  return a;
}

I checked on godbolt with gcc 4.8 on ARM/PPC/ARM64, and gcc 4.5.3 for AVR.
For x86, gcc 5.3.0 -O3 on godbolt (http://goo.gl/r8vb5L) does this:

movq(%rdi), %rdx
movl$-1, %eax
testq   %rdx, %rdx
je  .L10
.L11:
movl8(%rdx), %eax ; load p->a inside the loop, not hoisted
movq(%rdx), %rdx
testq   %rdx, %rdx
jne .L11
.L10:
rep ret

This is nice and compact, but less hyperthreading-friendly than it could be. 
(The mov reg,reg alternative doesn't even take an execution unit on recent
CPUs).

The load of p->a every time through the loop might also delay the p->n load by
a cycle on CPUs with only one load port, or when there's a cache-bank conflict.
 This might take the loop from one iteration per 4c to one per 5c (if L1
load-use latency is 4c).

Clang hoists the load out of the loop, producing identical asm output for this
function and one with the load hoisted in the C source.  (The godbolt link has
both versions.  Also see bug 69933 which I just reported, since gcc showed a
separate branch-layout issue for the source-level hoisting version.)

[Bug rtl-optimization/69943] New: expressions with multiple associative operators don't always create instruction-level parallelism

2016-02-24 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69943

Bug ID: 69943
   Summary: expressions with multiple associative operators don't
always create instruction-level parallelism
   Product: gcc
   Version: 5.3.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

separate problems (which maybe should be separate bugs, let me know):

* associativity not exploited for ILP in integer operations
* using a mov from memory instead of an add
* FP ILP from associativity generates two extra mov instructions


gcc 5.3.0 -O3 (http://goo.gl/IRdw05) has two problems compiling this:

int sumi(int a, int b,int c,int d,int e,int f,int g,int h) {
  return a+b+c+d+e+f+g+h;
}
addl%edi, %esi
movl8(%rsp), %eax # when an arg comes from memory, it forgets
to use lea as a 3-arg add
addl%esi, %edx
addl%edx, %ecx
addl%ecx, %r8d
addl%r8d, %r9d
addl%r9d, %eax
addl16(%rsp), %eax

The expression is evaluated most in order from left to right, not as
((a+b) + (c+d)) + ((e+f) + (g+h)).  This gives is a latency of 8 clocks.  If
the inputs became ready at one-per-clock, this would be ideal (only one add
depends on the last input), but we shouldn't assume that when we can't see the
code that generated them.

The same lack of parallelism happens on ARM, ARM64, and PPC.

-

The FP version of the same *does* take advantage of associativity for
parallelism with -ffast-math, but uses two redundant mov instructions:

float sumf(float a, float b,float c,float d,float e,float f,float g,float h) {
  return a+b+c+d+e+f+g+h;
}
addss   %xmm4, %xmm5  # e, D.1876
addss   %xmm6, %xmm7  # g, D.1876
addss   %xmm2, %xmm3  # c, D.1876
addss   %xmm0, %xmm1  # a, D.1876
addss   %xmm7, %xmm5  # D.1876, D.1876
movaps  %xmm5, %xmm2# D.1876, D.1876
addss   %xmm3, %xmm2  # D.1876, D.1876
movaps  %xmm2, %xmm0# D.1876, D.1876
addss   %xmm1, %xmm0  # D.1876, D.1876

clang avoids any unnecessary instructions, but has less FP ILP, and the same
lack of integer ILP.

Interestingly, clang lightly auto-vectorizes sumf when the expression is
parenthesised for ILP, but only *without* -ffast-math.  http://goo.gl/Pqjtu1.

As usual, IDK whether to mark this as RTL, tree-ssa, or middle-end.  The
integer ILP problem is not target specific.

[Bug tree-optimization/69943] expressions with multiple associative operators don't always create instruction-level parallelism

2016-02-24 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69943

--- Comment #3 from Peter Cordes  ---
(In reply to ktkachov from comment #2)
> On second thought, reassociating signed addition is not legal in general
> because we might introduce signed overflow where one didn't exist before.

 In an intermediate result, yes.  The final result can't change on 2's
complement hardware, though, so it's a legal optimization.  Good thinking that
the compiler might treat signed and unsigned integers differently, though.

You only need to avoid it on hardware where signed overflow has side-effects. 
(e.g. setting a "sticky" flag that can be checked after a sequence of
operations to see if there were any overflows along the way.)  And I think MIPS
has signed vs. unsigned add instructions, and can raise an exception?

Anyway, x86 doesn't have any of those things, and the calling convention lets
flags be in any arbitrary state when the function return.  So this optimization
is valid for signed integers on x86.

BTW, using unsigned results in two LEA instructions, even though there's still
a MOV from memory.  ADD is shorter to encode, and can run on more execution
ports.  It also has higher latency on Atom.

[Bug target/69986] New: smaller code possible with -Os by using push/pop to spill/reload

2016-02-27 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69986

Bug ID: 69986
   Summary: smaller code possible with -Os by using push/pop to
spill/reload
   Product: gcc
   Version: 5.3.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: minor
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86-64-*-*

#include 
int f(int a) { close(a); return a; }

 push   rbx
 movebx,edi
 call   400490 
 moveax,ebx
 poprbx
 ret

with gcc 5.3 -Os.

It could be smaller:

 push   rbi
 call   400490 
 poprax
 ret

saving 4 bytes (mov reg,reg is two bytes).

More generally, push/pop are 1 byte each, much smaller than mov [rsp-8], edi or
something.

This might not be a desirable optimization, though, because a round-trip
through memory increases latency.  It's one of those code-size optimizations
that will might often have a negative impact on performance in the case where
the function is already hot in L1 I-cache.

It would be nice if there was a way to optimize a bit for code-size without
making bad performance sacrifices, and also another option to optimize for code
size without much regard for performance.  -Oss vs. -Os?  Or -OS?  I assume
tuning these options is a lot of work.

[Bug rtl-optimization/70408] New: reusing the same call-preserved register would give smaller code in some cases

2016-03-25 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70408

Bug ID: 70408
   Summary: reusing the same call-preserved register would give
smaller code in some cases
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: enhancement
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

int foo(int);  // not inlineable
int bar(int a) {
  return foo(a+2) + 5 * foo (a);
}

gcc (and clang and icc) all make bigger code than necessary for x86.  gcc uses
two call-preserved registers to save `a` and `foo(a+2)`.  Besides the extra
push/pop, stack alignment requires a sub/add esp,8 pair.

Combining data-movement with arithmetic wherever possible is also a win (using
lea), but gcc also misses out on that.

# gcc6 snapshot 20160221 on godbolt (with -O3): http://goo.gl/dN5OXD
pushq   %rbp
pushq   %rbx
movl%edi, %ebx
leal2(%rdi), %edi  # why lea instead of add rdi,2?
subq$8, %rsp
callfoo# foo(a+2)
movl%ebx, %edi
movl%eax, %ebp
callfoo# foo(a)
addq$8, %rsp
leal(%rax,%rax,4), %eax
popq%rbx
addl%ebp, %eax
popq%rbp
ret

clang 3.8 makes essentially the same code (but wastes an extra mov because it
doesn't produce the result in %eax).

By hand, the best I can come up with is:

push%rbx
lea 2(%rdi), %ebx  # stash ebx=a+2
callfoo# foo(a)
mov %ebx, %edi
lea (%rax,%rax,4), %ebx# reuse ebx to stash 5*foo(a)
callfoo# foo(a+2)
add %ebx, %eax
pop %rbx
ret

Note that I do the calls to foo() in the other order, which allows more folding
of MOV into LEA.  The savings from that are somewhat orthogonal to the savings
from reusing the same call-preserved register.

Should I open a separate bug report for the failure to optimize by reordering
the calls?

I haven't tried to look closely at ARM or PPC code to see if they succeed at
combining data movement with math (prob. worth testing with `foo(a) * 4` since
x86's shift+add LEA is not widely available).  I didn't mark this as an
i386/x86-64 but, because the reuse of call-preserved registers affects all
architectures.


IDK if teaching gcc about either of these tricks would help with real code in
many cases, or how hard it would be.

[Bug c/70408] reusing the same call-preserved register would give smaller code in some cases

2016-03-25 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70408

--- Comment #2 from Peter Cordes  ---
Should I open a separate bug for the reusing call-preserved regs thing, and
retitle this one to the call-reordering issue we ended up talking about here?

I always have a hard time limiting an optimization bug report to a single
issue, sorry.

(In reply to Andrew Pinski from comment #1)
> Note teaching this trick requires a huge amount of work as you need to teach
> GCC more about order of operands does not matter; this requires work in the
> front-end and then in the gimple level and then maybe the middle-end.  

Ok :(

> Is it worth it for the gain, most likely not, you are more likely just to
> get better code by not depending on unspecified behavior in C.

Writing the code this way intentionally leaves it up to the compiler to choose
the optimal order to evaluate foo(a+2) and foo(a).  I don't see why forcing the
compiler into one choice or the other should be considered "better" for
performance, just because gcc doesn't take advantage of its options.  (Better
for maintainability in case someone adds side-effects to foo(), sure).

I should have used  __attribute__((pure)) int foo(int);
to make it clear that the order of the function calls didn't matter.  That
would make reordering legal even the calls were separated by a sequence point,
wouldn't it?  (Of course, it sounds like gcc still wouldn't consider doing the
reordering).

> ># why lea instead of add rdi,2?
> 
> Because lea does not clobber the flags, so this might be faster, it depends
> on the machine.

Every OOO x86 CPU renames EFLAGS, because almost every instruction writes
flags.  There aren't any CPUs where instructions that don't write flags are
faster for that reason.  (Not writing flags is useful when it lets you reuse
some already-set flags for another check with a different condition, or stuff
like that, but that's not the case here).

On Intel Haswell for example, the LEA can run on port 1 or 5, but the add can
run on port 0,1,5,6.  Otherwise they're the same (latency, total uops, and
code-size).  Using `-mtune=haswell` doesn't get it to choose  add edi,2  :(

(From http://agner.org/optimize/ instruction tables, and Agner's microarch pdf)

LEA is special on Atom.  I don't remember exactly what its effect is on latency
in Atom's in-order pipeline, but LEA happens at a different pipeline stage from
normal ALU instructions (actually running on the AGUs).  IIRC, that's an
earlier stage, so inputs need to be ready sooner.

> Also try -Os you might see a difference code.

No change with -Os

[Bug c++/71245] New: std::atomic load/store bounces the data to the stack using fild/fistp

2016-05-23 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71245

Bug ID: 71245
   Summary: std::atomic load/store bounces the data to the
stack using fild/fistp
   Product: gcc
   Version: 6.1.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: i386-linux-gnu

Same result with gcc 4.8, gcc5, and gcc6.1.  Didn't test exhaustively.

#include 

std::atomic d(5.0);
void foo_d(void) {
  d =  d + 1.0;
  // d+=1.0; // unimplemented
}

with gcc6.1 -m32 -O3 -march=i586 (https://godbolt.org/g/w3VKpG) this compiles
to

foo_d():
subl$20, %esp   #,
fildq   d ## 
fistpq  (%esp)  # %sfp # copy `d`'s bits to the stack (with no
loss)
fldl(%esp)  # %sfp
fadds   .LC0  #
fstpl   (%esp)# %sfp   # store d + 1.0 to the stack
fildq   (%esp)# %sfp   # 
fistpq  d   #  # atomic store using fild
lock orl$0, (%esp) ## mfence equivalent
addl$20, %esp   #,
ret

I assume fild/fistp is gcc's trick for implementing atomic loads and stores
without resorting to cmpxchg8b.  Clever, since 80bit float can't munge the
data.  The fild/fistp pairs are of course not necessary in this case, where the
data is already 64bit float.  The function should just fld / fstp to  d 
directly.

With -march=i486 or lower, gcc correctly doesn't assume that 64bit FP
loads/stores are atomic, so it calls a library function to do the atomic load
and store.  With SSE or SSE2 available, it uses an SSE load/store to copy to
the stack.  

With -msse2 and -mfpmath=sse, we finally load/store directly from/to d, with
movq / addsd / movq.  movq vs. movsd shouldn't make a performance difference, I
think.



We don't need to allocate any stack space.  We could implement the StoreLoad
barrier with  lock or $0, -4(%esp) instead of reserving extra stack to avoid
doing it to our return address (which would introduce extra store-forwarding
delay before the ret could eventually retire).

[Bug target/71321] New: [6 regression] x86: worse code for uint8_t % 10 and / 10

2016-05-27 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71321

Bug ID: 71321
   Summary: [6 regression] x86: worse code for uint8_t % 10 and /
10
   Product: gcc
   Version: 6.1.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: i386-linux-gnu, x86_64-linux-gnu

If we have an integer (0..99), we can modulo and divide by 10 to get two
decimal digits, then convert to a pair of ASCII bytes with a newline by adding
`00\n`.   When replacing div and mod with a multiplicative inverse, gcc 6.1
uses more instructions than gcc 5.3, due to poor choices.

See also https://godbolt.org/g/vvS5J6

#include 
// assuming little-endian
__attribute__((always_inline)) 
unsigned cvt_to_2digit(uint8_t i, uint8_t base) {
  return ((i / base) | (uint32_t)(i % base)<<8);
}
  // movzbl %dil,%eax# 5.3 and 6.1, with -O3 -march=haswell
  // div%sil
  // movzwl %ax,%eax

// at -Os, gcc uses a useless  AND eax, 0xFFF, instead of a movzx eax,ax.  I
think to avoid partial-register stalls?
unsigned cvt_to_2digit_ascii(uint8_t i) {
  return cvt_to_2digit(i, 10) + 0x0a3030;// + "00\n" converts to ASCII
}

Compiling with -O3 -march=haswell
## gcc 5.3 ## gcc 6.1
movzbl  %dil, %edx movzbl  %dil, %eax
leal(%rdx,%rdx,4), %ecxleal0(,%rax,4), %edx   #
requires a 4B zero displacement
leal(%rdx,%rcx,8), %edxmovl%eax, %ecx # lea
should let us avoid mov
leal(%rdx,%rdx,4), %edxaddl%eax, %edx
   leal(%rcx,%rdx,8), %edx
   leal0(,%rdx,4), %eax   #
requires a 4B zero displacement
   addl%eax, %edx
shrw$11, %dx   shrw$11, %dx
leal(%rdx,%rdx,4), %eaxleal0(,%rdx,4), %eax   #
requires a 4B zero displacement.  gcc5.3 didn't use any of these
   addl%edx, %eax
movzbl  %dl, %edx  movzbl  %dl, %edx   # same after
this
addl%eax, %eax addl%eax, %eax
subl%eax, %edi subl%eax, %edi
movzbl  %dil, %eax movzbl  %dil, %eax
sall$8, %eax   sall$8, %eax
orl %eax, %edx orl %eax, %edx
leal667696(%rdx), %eax leal667696(%rdx), %eax

with -mtune=haswell, it's  prob. best to merge with   mov ah, dil  or
something, rather than movzx/shift/or.  Haswell has no penalty for
partial-registers, but still has partial-reg renaming to avoid false
dependencies: the best of both worlds.



BTW, with -Os, both gcc versions compile it to

movb$10, %dl
movzbl  %dil, %eax
divb%dl
andl$4095, %eax  # partial reg stall.  gcc does this even with
-march=core2 where it matters
addl$667696, %eax

The AND appears to be totally useless, because the upper bytes of eax are
already zero (from movzbl %dil, %eax before div).  I thought the movzbl %ax,
%eax  in the unknown-divisor version was to avoid partial-register slowdowns,
but maybe it's just based on the possible range of the result.

Off-topic, but I noticed this while writing FizzBuzz in asm. 
http://stackoverflow.com/a/37494090/224132

[Bug target/71245] std::atomic load/store bounces the data to the stack using fild/fistp

2016-05-27 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71245

--- Comment #3 from Peter Cordes  ---
(In reply to Uroš Bizjak from comment #2)
> Recently x86 linux changed the barrier to what you propose. If it is worth,
> we can change it without any problems.

I guess it costs a code byte for a disp8 in the addressing mode, but it avoids
adding a lot of latency to a critical path involving a spill/reload to (%esp),
in functions where there is something at (%esp).

If it's an object larger than 4B, the lock orl could even cause a
store-forwarding stall when the object is reloaded.  (e.g. a double or a
vector).

Ideally we could do the  lock orl  on some padding between two locals, or on
something in memory that wasn't going to be loaded soon, to avoid touching more
stack memory (which might be in the next page down).  But we still want to do
it on a cache line that's hot, so going way up above our own stack frame isn't
good either.

> OTOH, we have "orl" here - should we
> change it to "addl" to be consistent with kernel?

That's the common idiom I've seen, but there's no reason I know of to favour
ADD instead of OR.  They both write all the flags, and both can run on any ALU
port on every microarchitecture.  Since gcc has been using OR already with I
assume nobody reporting perf problems, we should keep it.

A 32bit operand size is still a good choice.  (The obvious alternative being
8bit, but that doesn't save any code size.  From Agner Fog's insn tables, I
don't see any different entry for locked instructions with m8 vs. m32 operands,
but naturally-aligned 32bit loads/stores are probably the safest bet.)

[Bug rtl-optimization/59511] [4.9 Regression] FAIL: gcc.target/i386/pr36222-1.c scan-assembler-not movdqa with -mtune=corei7

2016-06-02 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59511

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #6 from Peter Cordes  ---
Created attachment 38629
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38629&action=edit
extra-movdqa-with-gcc5-not-4.9.cpp

[Bug rtl-optimization/59511] [4.9 Regression] FAIL: gcc.target/i386/pr36222-1.c scan-assembler-not movdqa with -mtune=corei7

2016-06-02 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59511

--- Comment #7 from Peter Cordes  ---
I'm seeing the same symptom, affecting gcc4.9 through 5.3.  Not present in 6.1.

IDK if the cause is the same.

(code from an improvement to the horizontal_add functions in Agner Fog's vector
class library)

#include 
int hsum16_gccmovdqa (__m128i const a) {
__m128i lo= _mm_cvtepi16_epi32(a); // sign-extended
a0, a1, a2, a3
__m128i hi= _mm_unpackhi_epi64(a,a); // gcc4.9 through 5.3
wastes a movdqa on this
hi= _mm_cvtepi16_epi32(hi);
__m128i sum1  = _mm_add_epi32(lo,hi);  // add
sign-extended upper / lower halves
//return horizontal_add(sum1);  // manually inlined.
// Shortening the code below can avoid the movdqa
__m128i shuf  = _mm_shuffle_epi32(sum1, 0xEE);
__m128i sum2  = _mm_add_epi32(shuf,sum1);  // 2 sums
shuf  = _mm_shufflelo_epi16(sum2, 0xEE);
__m128i sum4  = _mm_add_epi32(shuf,sum2);
return  _mm_cvtsi128_si32(sum4);   // 32 bit sum
}

gcc4.9 through gcc5.3 output (-O3 -mtune=generic -msse4.1):

movdqa  %xmm0, %xmm1
pmovsxwd%xmm0, %xmm2
punpckhqdq  %xmm0, %xmm1
pmovsxwd%xmm1, %xmm0
paddd   %xmm2, %xmm0
...

gcc6.1 output:

pmovsxwd%xmm0, %xmm1
punpckhqdq  %xmm0, %xmm0
pmovsxwd%xmm0, %xmm0
paddd   %xmm0, %xmm1
...



In a more complicated case, when inlining this code or not, there's actually a
difference between gcc 4.9 and 5.x: gcc5 has the extra movdqa in more cases. 
See my attachment, copied from https://godbolt.org/g/e8iQsj

[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes

2017-08-20 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837

--- Comment #3 from Peter Cordes  ---
Seems to be fixed in gcc7.2.0: https://godbolt.org/g/jRwtZN

gcc7.2 is fine with -m32, -mx32, and -m64, but x32 is the most compact.  -m64
just calls __atomic_load_16


gcc7.2 -O3 -mx32 output:
follow_nounion(std::atomic*):
movq(%edi), %rax
movl%eax, %eax
ret

vs.

gcc7.1 -O3 -mx32
follow_nounion(std::atomic*):
movq(%edi), %rcx
xorl%edx, %edx
movzbl  %ch, %eax
movb%cl, %dl
movq%rcx, %rsi
movb%al, %dh
andl$16711680, %esi
andl$4278190080, %ecx
movzwl  %dx, %eax
orq %rsi, %rax
orq %rcx, %rax
ret

---


gcc7.2 -O3 -m64 just forwards its arg to __atomic_load_16 and then returns:

follow_nounion(std::atomic*):
subq$8, %rsp
movl$2, %esi
call__atomic_load_16
addq$8, %rsp
ret

It unfortunately doesn't optimize the tail-call to

movl$2, %esi
jmp __atomic_load_16

presumably because it hasn't realized early enough that it takes zero
instructions to extract the 8-byte low half of the 16-byte __atomic_load_16
return value.

[Bug inline-asm/82001] New: [5/6/7/8 regression] wrong code when two functions differ only in inline asm register constraints

2017-08-27 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82001

Bug ID: 82001
   Summary: [5/6/7/8 regression] wrong code when two functions
differ only in inline asm register constraints
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: inline-asm
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

When a single compilation unit contains two functions that are identical other
than specific-register constraints in an asm statement, they are incorrectly
treated as exactly identical and the same code is emitted for both.

This happens at -O2 or higher, not -O1.

I was able to construct this test-case with two functions that are both
plausibly useful.  (Although this actually came up while discussing a beginner
SO question about inline asm. 
https://stackoverflow.com/questions/45910530/how-to-write-a-short-block-of-inline-gnu-extended-assembly-to-swap-the-values-of#comment78780443_45910796)

int mullo(int a, int b) {
asm("mul %%edx   # %%1 was %1" : "+a" (a), "+d" (b));
return a;
}

int mulhi(int a, int b) {
asm("mul %%edx   # %%1 was %1" : "+d" (a), "+a" (b));
return a;
}


gcc8.0.0-snapshot 20170827 -O3 (https://godbolt.org/g/CYjnGg) compiles them
both to

movl%edi, %eax  # a, a
movl%esi, %edx  # b, b
mul %edx   # %1 was %edx# b
ret

Any difference in the asm string, or in the clobber registers, makes them not
match.  Also, an "m" constraint is seen as different from an "r" or
specific-register constraint, but "imr" can "match" an "r"

In gcc6/7/8, both functions use the correct asm for the 1st function.  Swapping
the order changes the asm to the other one.

In gcc5.x, both functions use the correct asm for the 2nd function.

In gcc4.9.4, both functions are compiled correctly.

[Bug target/53687] _mm_cmpistri generates redundant movslq %ecx,%rcx on x86-64

2017-09-02 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53687

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #1 from Peter Cordes  ---
This behaviour is pretty understandable.  gcc doesn't know that the
return-value range is only 0-16, i.e. guaranteed non-negative integers.  Since
you used a signed int offset, makes sense that it *sign* extends from 32 to 64.

If you use  unsigned offset, the missed-optimization becomes more obvious. 
gcc7.2 still uses a  movl%ecx, %ecx  to zero-extend into rcx.

https://godbolt.org/g/wWvqpa

(Incidentally, same,same is the worst possible choice of registers for Intel
CPUs.  It means the mov can never be eliminated in the rename stage, and always
needs an execution port with non-zero latency.)

Even uintptr_t offset doesn't avoid it, because then the conversion from the
intrinsic to the variable results in sign-extension up to 64-bit.  It treats it
exactly like a function that returns int, which in the SysV ABI is allowed to
have garbage in the upper32.


(BTW, this use of flags from inline asm is not guaranteed to be safe.  Nothing
stops the optimizer from doing the pointer-increment after the `pcmpistri`,
which would clobber flags.  You could do `pcmpistri` inside the asm and produce
a uintptr_t output operand, except that doesn't work with goto.  So really you
should write the whole loop in inline asm)


Or better, don't use inline asm at all: gcc can CSE _mm_cmpistri with
_mm_cmpistra, so you can just use the intrinsic twice to get multiple operands,
and it will compile to a single instruction.  This is like using `/` and `%`
operators to get both results of a `div`.

[Bug target/65146] alignment of _Atomic structure member is not correct

2017-09-05 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #4 from Peter Cordes  ---
Created attachment 42125
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42125&action=edit
C11 / pthreads test case for tearing of atomic_llong.  compile with -m32
-pthread

This is a real bug.  See attached C11 testcase for atomic_llong tearing of
loads/stores in practice on real x86 hardware with gcc -m32.  (It also compiles
as C++11 to show the difference).

One thread writes 0 and -1 alternating, the other thread reads and checks that
the value is either 0 or -1.  (Or with `double`, 0 and -1.1, or with a
two-member struct, {-1,-1}).

Compile with gcc -m32 -march=native -O3 -pthread
atomic_llong-underaligned-in-struct.c  && ./a.out

offset of x in AtomicStruct = 60. addr=0x565a80bc.  lockfree() = 1
sizeof(test_type) = 8.  test_type = long long
alignof(AtomicStruct) = 4, alignof(atomic_ttype) = 4, alignof(test_type) = 4
found tearing: tmp = 0x00

If there's a problem, the whole program uses under a millisecond of CPU time,
probably most of it on startup and printing.  (e.g. perf counters for
machine_clears.memory_ordering = 72, so it didn't spend long in the read loop
with the writer going).  I get the object to cross a cache-line boundary if the
type is under-aligned by using struct AtomicStruct { char filler[57];
atomic_ttype x; }; and using alignas(64).


In the x86 32-bit System V ABI, gcc under-aligns an atomic_llong so it can
split across the boundary between two cache lines.  This makes loads and stores
non-atomic on every CPU (except uniprocessor of course).  Some AMD CPUs are
potentially non-atomic when 8B or 16B boundaries are crossed. 
(https://stackoverflow.com/questions/36624881/why-is-integer-assignment-on-a-naturally-aligned-variable-atomic).

Here are some ways to fix this for the i386 System V ABI.  (I think the Windows
32-bit ABI aligns long long and double to 64b, but an _Atomic struct still
potentially needs more than its default alignment to be efficiently lock-free).

1. Change the C stdatomic ABI to match the C++ std::atomic ABI, requiring
lock-free atomic objects to be naturally aligned.  (But don't change anything
for non-atomic objects.  The i386 SysV ABI only aligns 64-bit long long to 4B,
and we do need to preserve that.)
2. Use lock cmpxchg8b to implement .load() and .store() instead of SSE or x87
load or store instructions on all 8-byte objects, like for 16-byte objects in
x86-64 with cmpxchg16b.
3. Make 64-bit objects check alignment before using lock-free sequences,
otherwise use locking (or lock cmpxchg8b).  Checks can be optimized out in
cases where the compiler can prove that an object is 8B-aligned.  e.g. an
object with static storage since we get to align it.  Unless we're linking with
code compiled by an old gcc that under-aligned static 64b objects.
4. Make 64-bit objects never lock-free in the i386 SysV ABI.
5. Option 4 + define a new 32-bit ABI that doesn't suck.  (pass args in
registers, use SSE FP, etc., and align 64-bit types to 64-bit.)  Not realistic
because nobody cares enough about 32-bit code outside of Windows, so the small
benefit wouldn't justify the pain of having 2 incompatible ABIs.

Clang is already doing option 1, so gcc and clang are currently incompatible
for struct layout for structs with a C11 _Atomic member.

Option 1 is by far the best long-term option for performance and simplicity. 
(Not counting option 5).  

Option 2 will work, but is always horrible for performance with pure loads or
non-seq-cst stores, even with aligned objects.

lock cmpxchg8b is atomic even when it crosses a cache-line or page boundary
(like all x86 atomic RMW operations using the lock prefix), or whatever
boundary is atomic for regular loads/stores.  This is **catastrophic** for
performance, though, because instead of just internally locking a line of L1D
cache (by delaying responses to Invalidate its copy), the CPU has to make sure
the change to both cache lines propagates all the way to memory, I think.  (x86
locked instructions are atomic with respect to I/O and DMA observers, not just
other CPUs, so it can't just keep both cache lines locked).  On my Skylake
i7-6700k, it's literally a 132x slowdown for a single thread doing `lock add`
aligned vs. crossing a cache line boundary.

These penalties will happen by chance for more 8B objects on AMD hardware if
crossing a 16B or 32B boundary really is non-atomic for regular loads/stores,
instead of only having a penalty at 64B boundaries.

004000e0 <_start.loop>:
  4000e0:   f0 48 83 47 3f 01   lock add QWORD PTR [rdi+0x3f],0x1  
  ## rdi is page-aligned
  4000e6:   f0 48 83 47 7f 01   loc

[Bug libstdc++/71660] [5/6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86

2017-09-05 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660

--- Comment #7 from Peter Cordes  ---
C++11 std::atomic<> is correct, and the change was necessary.

8B alignment is required for 8B objects to be efficiently lock-free (using SSE
load / store for .load() and .store(), see
https://stackoverflow.com/questions/36624881/why-is-integer-assignment-on-a-naturally-aligned-variable-atomic),
and to avoid a factor of ~100 slowdown if lock cmpxchg8b is split across a
cache-line boundary.

What needs to change is the C11 stdatomic default alignment for 64-bit objects
in and out of structs.  (This includes _Atomic struct {int a,b;};)

Currently, atomic_llong **is not atomic** in gcc, only in g++.  I attached a
testcase showing tearing to the still-unfixed C11 bug
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146#c4).  (It was filed at the
same time as the C++11 bug that led to the change in std::atomic.)


re: long double: it can't be lock-free in -m32.  10-byte x87 load / store
instructions are not guaranteed to be atomic, and in fact even on real Intel
CPUs are done as two separate load or store uops.

alignof(long double) in 32-bit is different from alignof(long double) in
64-bit.  std::atomic or _Atomic long double should always have the
same alignment as long double.

[Bug target/65146] alignment of _Atomic structure member is not correct

2017-09-05 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146

--- Comment #6 from Peter Cordes  ---
My test-case on godbolt: https://godbolt.org/g/MmLycw.  gcc8 snapshot still
only has 4B alignment

Fun fact: clang4.0 -m32 inlines lock cmpxchg8b for 8-byte atomic load/store. 
This is ironic, because it *does* align  _Atomic 64-bit objects to 8 bytes so
it could safely use SSE loads/stores.  It would work correctly if called from
gcc-compiled code that passed it a misaligned atomic_llong *.  But since gcc
and clang don't agree on i386 SysV struct layout for _Atomic 64-bit members, so
clang should really just start using movq for 64-bit atomic objects in 32-bit
mode.

[Bug libstdc++/71660] [5/6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86

2017-09-05 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660

--- Comment #11 from Peter Cordes  ---
(In reply to Thiago Macieira from comment #10)
> Actually, PR 65146 points out that the problem is not efficiency but
> correctness. An under-aligned type could cross a cacheline boundary and thus
> fail to be atomic in the first place.

As I pointed out there, you technically could solve the correctness problem by
checking alignment and falling back to locking for objects where a plain 8B
load or 8B store wouldn't be atomic.  That's what I meant by "efficiently
lock-free".  And we're talking about *huge* inefficiencies here compared to
always being able to inline and SSE load/store.

That would let you keep struct layouts the same, but it would still be an ABI
change, since everything has to agree about which objects are lock-free and
which aren't.  Now that I think about it, all of my suggested fixes on PR 65146
are effectively ABI changes.

> Those structures were disasters waiting to happen.

Yes, exactly.

Basically any existing binaries compiled with a gcc that allows under-aligned
atomic objects are unsafe, so keeping compatibility with them is not important.

[Bug target/65146] alignment of _Atomic structure member is not correct

2017-09-05 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146

--- Comment #8 from Peter Cordes  ---
BTW, all of my proposals are really ABI changes, even if struct layout stays
the same.

All code has to agree on which objects are lock-free or not, and whether they
need to check alignment before using an SSE load instead of lock cmpxchg8b or
something.

It won't be safe to link with binaries compiled with old gcc that assumes a
simple SSE load/store is atomic on an 8B atomic_llong*, if the new design can
still pass it underaligned pointers.  The existing ABI is broken.

Some code may happen to not be affected, especially when running on Intel
hardware (where only 64B boundaries matter, not 8B boundaries for x86 in
general).  Or because they only depend on atomic RMW being atomic, not pure
load or pure store, so they just take the ~100x performance hit without losing
correctness in cases where a boundary is crossed.

[Bug tree-optimization/82135] New: Missed constant propagation through possible unsigned wraparound, with std::align() variable pointer, constant everything else.

2017-09-07 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82135

Bug ID: 82135
   Summary: Missed constant propagation through possible unsigned
wraparound, with std::align() variable pointer,
constant everything else.
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

The code in this report is easiest to look at here:
https://godbolt.org/g/DffP3J, with asm output.

When g++ inlines this (copied version of std::align from include/c++/memory),
it fails to optimize to just rounding up to the next power of 2 when
align=size=64 and space=1024, but ptr is variable.

(If __ptr is also constant, it's fine.)

#include 
#include 
inline void*
libalign(size_t __align, size_t __size, void*& __ptr, size_t& __space) noexcept
{
  const auto __intptr = reinterpret_cast(__ptr);
  const auto __aligned = (__intptr - 1u + __align) & -__align;
//if (__aligned < __size)   __builtin_unreachable();
  const auto __diff = __aligned - __intptr;
//if (__diff > __size)  __builtin_unreachable();
  if ((__size + __diff) > __space)
return (void*)123456; //nullptr;   // non-zero constant is obvious in the
asm
  else
{
  __space -= __diff;
  return __ptr = reinterpret_cast(__aligned);
}
}

void *libalign64(void *voidp) {
std::size_t len = 1024;
 //if (voidp+len < voidp) __builtin_unreachable();   // doesn't
help
voidp = 
  libalign(64, 64, voidp, len);
return voidp;
}

g++ -O3 -std=c++14  -Wall -Wextra  (trunk 8.0.0 20170906)

# x86-64.  Other targets do the same compare/cmov or branch
leaq63(%rdi), %rax
andq$-64, %rax
movq%rax, %rdx
subq%rdi, %rdx
addq$65, %rdx
cmpq$1025, %rdx
movl$123456, %edx
cmovnb  %rdx, %rax
ret


libalign64 gives exactly the same result as just rounding up to the next power
of 2 (including wrapping around to zero with addresses very close to the top). 
But gcc doesn't spot this, I think getting confused about what can happen with
unsigned wraparound.

char *roundup2(char *p) {
auto t = (uintptr_t)p;
t = (t+63) & -64;
return (char*)t;
}

leaq63(%rdi), %rax
andq$-64, %rax
ret

For easy testing, I made wrappers that call with a constant pointer, so I can
test that it really does wrap around at exactly the same place as roundup2(). 
(It does: libalign64(-64) = -64, libalign64(-64) = 0.)  So it can safely be
compiled to 2 instructions on targets where unsigned integer wraparound works
normally, without all that adding constants and comparing against constants.

static char* const test_constant = (char*)-63ULL;

char *test_roundup2() {
return roundup2(test_constant);
}
void *test_libalign() {
return libalign64(test_constant);
}


Uncommenting this line I added:
   if (__diff > __size)  __builtin_unreachable();

lets it compile to just two instructions, but that condition isn't really
always true.  __diff will be huge when __aligned wraps around.

clang, icc, and msvc also fail to make this optimization.  IDK if it's
particularly useful in real life for anything other than abusing std::align as
a simple round-up function.

[Bug target/80568] x86 -mavx256-split-unaligned-load (and store) is affecting AVX2 code, but probably shouldn't be.

2017-09-07 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568

Peter Cordes  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #3 from Peter Cordes  ---
Bug 78762 is asking for the same thing: disable at least load-splitting in
-mtune=generic when -mavx2 is enabled.

Or more generally, ISA-aware tune=generic.

*** This bug has been marked as a duplicate of bug 78762 ***

1 2 3 >

1 - 100 of 268 matches

Mail list logo