https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459

            Bug ID: 82459
           Summary: AVX512F instruction costs: vmovdqu8 stores may be an
                    extra uop, and vpmovwb is 2 uops on Skylake and not
                    always worth using
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

gcc bottlenecks on shuffle uops when auto-vectorizing this for skylake-avx512

* Perhaps the cost model is wrong for vpmovwb (it's 2 port5 uops), or gcc
doesn't consider any cheaper alternatives.  My version with 2x shift +
vpacksswb + vpermq has 3x the theoretical throughput (with hot caches).  In
general, AVX512BW lane crossing shuffles of 8 or 16-bit elements are multi-uop
on SKX, but in-lane byte/word shuffles are single-uop just like their AVX2
versions.

* Using vmovdqu8 as a store costs a port5 ALU uop even with no masking,
according to Intel (not tested).  We should always use AVX512F vmovdqu32 or 64
for unmasked loads/stores, not AVX512BW vmovdqu8 or 16.  Intel's docs indicate
that current hardware doesn't handle unmasked vmovdqu8/16 as efficiently as
32/64, and there's no downside.

* Using vinserti64x4 instead of 2 separate stores is worse because it makes the
shuffle bottleneck worse, and 2 stores wouldn't bottleneck on load/store
throughput.  (Avoiding vpmovwb makes this moot in this case, but presumably
whatever decided to shuffle + store instead of store + store will make that
mistake in other cases too.)

 SKX shuts down port 1 (except for scalar integer) when there are 512b uops in
flight, so extra loads/stores are relatively cheaper than using more ALU uops,
compared to 256b or 128b vectors where the back-end can keep up even when 3 of
the 4 uops per clock are vector-ALU (if they go to different ports).

#include <stdint.h>
#include <stddef.h>
void pack_high8_baseline(uint8_t *__restrict__ dst, const uint16_t
*__restrict__ src, size_t bytes) {
  uint8_t *end_dst = dst + bytes;
  do{
     *dst++ = *src++ >> 8;
  } while(dst < end_dst);
}

// https://godbolt.org/g/kXjEp1
gcc8 -O3 -march=skylake-avx512

.L5:  # inner loop
        vmovdqa64       (%rsi,%rax,2), %zmm0
        vmovdqa64       64(%rsi,%rax,2), %zmm1
        vpsrlw  $8, %zmm0, %zmm0             # memory operand not folded: bug
82370
        vpsrlw  $8, %zmm1, %zmm1
        vpmovwb %zmm0, %ymm0                 # 2 uops each
        vpmovwb %zmm1, %ymm1
        vinserti64x4    $0x1, %ymm1, %zmm0, %zmm0
        vmovdqu8        %zmm0, (%rcx,%rax)   # Intel says this is worse than
vmovdqu64
        addq    $64, %rax
        cmpq    %rax, %rdi         # using an indexed addr mode, but still
doing separate add/cmp
        jne     .L5

IACA says gcc's loop will run at one 64B store per 6 clocks, bottlenecked on 6
port5 uops (including the vmovdqu8.  vmovdqu64 gives one store per 5 clocks,
still bottlenecked on port5).  Using 2 stores instead of vinserti64x4 gives us
one store per 4 clocks.  (Still twice as slow as with vpacksswb + vpermq, which
produces one 512b vector per 2 shuffle uops instead of one 256b vector per 2
shuffle uops.)

See
https://stackoverflow.com/questions/26021337/what-is-iaca-and-how-do-i-use-it
for more about Intel's static analysis tool.


related: pr 82370 mentions vectorization strategies for this.

Fortunately gcc doesn't unroll the startup loop to reach an alignment boundary.
 (And BTW, aligned pointers are more important with AVX512 than AVX2, in my
testing with manual vectorization of other code on Skylake-avx512.)  Of course,
a potentially-overlapping unaligned first vector would be much better than a
scalar loop here.

----

Anyway, does gcc know that vpmovwb %zmm, %ymm is 2 uops for port 5, while
vpackuswb zmm,zmm,zmm in-lane 2-input shuffle is 1 uop (for port 5)?  The xmm
source version is single-uop, because it's in-lane.

Source: Intel's IACA2.3, not testing on real hardware.  SKX port-assignment
spreadsheet:
https://github.com/InstLatx64/InstLatx64/blob/master/AVX512_SKX_PortAssign_v102_PUB.ods
It's based on IACA output for uops, but throughputs and latencies are from real
hardware AIDA64 InstLatx64, with a 2nd column for Intel's published
tput/latency (which as usual doesn't always match).  vpmovwb real throughput is
one per 2 clocks, which is consistent with being 2 uops for p5.

It makes some sense from a HW-design perspective that all lane-crossing
shuffles with element size smaller than 32-bit are multi-uop.  It's cool that
in-lane AVX512 vpshufb zmm  vpacksswb zmm are single-uop, but it means it's
often better to use more instructions to do the same work in fewer total
shuffle uops.  (Any loop that involves any shuffling can *easily* bottleneck on
shuffle throughput.)

Related: AVX512 merge-masking can something into a 2-input shuffle.  But we
vpsrlw  $8, 64(%rsi), %zmm0{k1}  doesn't work to get all our data into one
vector, because it masks at word granularity, not byte.  vmovdqu8 with masking
needs an ALU uop (on SKX according to IACA).

-------------

Here's a hand-crafted efficient version of the inner loop.  It doesn't use any
weird tricks (I haven't thought of any that are actually a win on SKX), so it
should be possible to get gcc to emit something like this.

.Lloop:
    vpsrlw  $8, 0(%rsi), %zmm0
    vpsrlw  $8, 64(%rsi), %zmm1
    vpackuswb %zmm1, %zmm0, %zmm0            # 1 uop for a 2-input shuffle
    vpermq   %zmm7, %zmm0, %zmm0             # lane-crossing fixup for
vpackuswb
    vmovdqu64 %zmm0, (%rdi, %rdx)

    add   $(2*64), %rsi
    add   $64, %rdx          # counts up towards zero
    jnc .Lloop

Note that the folded loads use non-indexed addressing modes so they can stay
micro-fused.

The store will stay micro-fused even with an indexed addressing mode, so we can
count an index up towards zero (and index from the end of the array) saving a
CMP instruction in the loop.  (add/jnc will macro-fuse on SKX).  A second
pointer-increment + cmp/jcc would be ok.

IACA thinks that indexed stores don't stay micro-fused, but that's only true
for SnB/IvB.  Testing on Haswell/Skylake (desktop) shows they do stay fused,
and IACA is wrong about that.  I think it's a good guess that AVX512 indexed
stores will not un-laminate either.

IACA analysis says it will run at 1x 64B store per 2 clocks.  If the store
stays micro-fused, it's really only 7 fused-domain uops per clock, so we have
front-end bandwidth to spare and the only bottleneck is on the ALU ports (p0
and p5 saturated with shift and shuffle respectively).

Because of the ALU bottleneck, I didn't need the store to be able to run on p7.
 p2/p3 only need to handle 1.5 uops per 2 clocks.  Using a non-indexed store
addressing mode would let it use p7, but that increases loop overhead slightly.


$ iaca -arch SKX testloop.o

Intel(R) Architecture Code Analyzer Version - 2.3 build:246dfea (Thu, 6 Jul
2017 13:38:05 +0300)
Analyzed File - testloop.o
Binary Format - 64Bit
Architecture  - SKX
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 2.00 Cycles       Throughput Bottleneck: FrontEnd

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6  
|  7   |
---------------------------------------------------------------------------------------
| Cycles | 2.0    0.0  | 1.0  | 1.5    1.0  | 1.5    1.0  | 1.0  | 2.0  | 1.0 
| 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV -
Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened

| Num Of |                    Ports pressure in cycles                     |   
|
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |   
|
---------------------------------------------------------------------------------
|   2^   | 1.0       |     | 1.0   1.0 |           |     |     |     |     | CP
| vpsrlw zmm0, zmmword ptr [rsi], 0x8
|   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP
| vpsrlw zmm1, zmmword ptr [rsi+0x40], 0x8
|   1    |           |     |           |           |     | 1.0 |     |     | CP
| vpackuswb zmm0, zmm0, zmm1
|   1    |           |     |           |           |     | 1.0 |     |     | CP
| vpermq zmm0, zmm0, zmm7
|   2    |           |     | 0.5       | 0.5       | 1.0 |     |     |     |   
| vmovdqu64 zmmword ptr [rdi+rdx*1], zmm0
|   1    |           | 1.0 |           |           |     |     |     |     |   
| add rsi, 0x80
|   1    |           |     |           |           |     |     | 1.0 |     |   
| add rdx, 0x40
|   0F   |           |     |           |           |     |     |     |     |   
| jnb 0xffffffffffffffd3
Total Num Of Uops: 10

(The count is unfused-domain, which is pretty dumb to total up.)

Source for this:

#if 1
#define IACA_start  mov $111, %ebx; .byte 0x64, 0x67, 0x90
#define IACA_end    mov $222, %ebx; .byte 0x64, 0x67, 0x90
#else
#define IACA_start
#define IACA_end
#endif

.global pack_high8_avx512bw
pack_high8_avx512bw:   # (dst, src, size)

#define unroll 1   // unroll factor

.altmacro
.macro packblock count
    .if \count
        packblock %(count-1)
    .endif
#    vmovdqu8 \count*2*64 + 1(%rsi),  %ymm0 {%k1}{z}   # IACA says: x/y/zmm
load uses an ALU port with masking, otherwise not.
#    vmovdqu8 \count*2*64 + 65(%rsi), %ymm1 {%k1}{z}
    vpsrlw  $8, \count*2*64 + 0(%rsi), %zmm0
    vpsrlw  $8, \count*2*64 + 64(%rsi), %zmm1
    vpackuswb %zmm1, %zmm0, %zmm0            # 1 uop for a 2-input shuffle
    vpermq   %zmm7, %zmm0, %zmm0             # lane-crossing fixup for
vpackuswb
    vmovdqu64 %zmm0, 64*\count(%rdi, %rdx)
.endm

    # set up %zmm7
    #movabs $0x5555555555555555, %rax
    #kmovq   %rax, %k1              # for offset vmovdqu8 instead of shift

    # mov   $1024, %rdx

    add   %rdx, %rdi
    neg   %rdx         # then use (%rdi, %rdx) for store addresses?  No port 7,
but should micro-fuse.
    IACA_start
.Lloop:
    packblock (unroll-1)
    add   $(unroll*2*64), %rsi
          # add   $(unroll*1*64), %rdi
    add   $(unroll*1*64), %rdx
    jnc .Lloop
IACA_end

    # not shown: unaligned and/or cleanup loop
    ret

Reply via email to