[Bug inline-asm/102264] New: Macro Intrinsics fail to use all the registers on the machine

2021-09-09 Thread ntukanov at cmu dot edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102264

Bug ID: 102264
   Summary: Macro Intrinsics fail to use all the registers on the
machine
   Product: gcc
   Version: 9.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: inline-asm
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ntukanov at cmu dot edu
  Target Milestone: ---

I am trying to use custom intrinsics in order to have more control over the
assembly that the compiler is generating. The concept of these custom
intrinsics comes from http://users.ece.cmu.edu/~franzf/papers/wpmvp16.pdf.

For performance reasons, my code requires me to use all the available SIMD
registers on the machine, but when I use my custom intrinsics, I am only
getting half of the SIMD registers which leads to register spilling.

This is the code and generated assembly in question:
https://godbolt.org/z/fqn53G9qT

Any help would be greatly appericated.

[Bug inline-asm/102264] Macro Intrinsics fail to use all the registers on the machine

2021-09-09 Thread ntukanov at cmu dot edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102264

--- Comment #2 from Nicholai Tukanov  ---
(In reply to Andrew Pinski from comment #1)
> There seems to be some extra moves the register allocator cannot remove and
> that is causing some extra spilling.
>
> Your loop has 32 live variables and that is just at the limit.

Can the register allocator be modified to recognize the other registers? The
problem seems limited to the compute instruction (vpdpwssd in this case). 

I specifically choose 32 to max out the registers. Since the compute
instruction gets limited to half of that (zmm0-zmm15), the extra moves are
killing the performance.