https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117487
Bug ID: 117487 Summary: Power8 optimizations for math library aren't done in power9 or power10 (PR target/71977) Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: meissner at gcc dot gnu.org Target Milestone: --- I was answering an email about something else, and I wanted to look up code that I added in January 4th, 2017 (PR target/71977, PR target/70568, PR target/78823). I noticed while this code is optimized on power8, it is not optimized on power9 or power10. The code (gcc.target/pr71977-1.c) is: #include <stdint.h> typedef union { float value; uint32_t word; } ieee_float_shape_type; float mask_and_float_var (float f, uint32_t mask) { ieee_float_shape_type u; u.value = f; u.word &= mask; return u.value; } The initial code generated before the January 4th, 2017 changes was: xscvdpspn 0,1 mfvsrwz 9,0 and 9,9,4 sldi 9,9,32 mtvsrd 1,9 xscvspdpn 1,1 blr Note, there is a direct move from the FPR/vector registers, the logical operation is done in the GPR registers and then a direct move back to the FPR/vector registers. After the changes, the code for power8 is: xscvdpspn 0,1 sldi 9,4,32 mtvsrd 32,9 xxland 1,0,32 xscvspdpn 1,1 blr In this case, we avoid a direct register move from the FPR/vector registers to the GPR registers, and we do the logical operation in the vector registers. If we look at the power10/power9 code, it is: xscvdpspn 0,1 mfvsrwz 2,0 and 2,2,4 mtvsrws 1,2 xscvspdpn 1,1 blr I.e. we do 2 direct moves between the GPR registers and the FPR/vector registers and do the logical operation in the GPR registers. The reason for this is we have the MTVSRWS instruction in power9/power10 (splat bottom 32-bits of a GPR register into a FPR register). In the power8 case, we don't have MTVSRWS, so instead we need to do a shift left 32-bits (SLDI) and then direct move to the FPR/vector registers before we can do XSCVSPDPN. The XSCVSPDPN instruction wants the value in the upper 32-bits. We do this either by a left shift or by a splat operation. To fix this, we would need a similar define_peephole2 to the one around line 6318 of vsx.md that matches using the splat operation instead of a shift and 64-bit move.