https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064
--- Comment #16 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #13)
> So, both the following patches should fix it IMHO, but no idea which one if
> any is right.
> With
> --- gcc/config/rs6000/vsx.md.jj 2019-01-01 12:37:44.305529527 +0100
> +++ gcc/config/rs6000/vsx.md 2019-01-18 18:07:37.194899062 +0100
> @@ -4356,7 +4356,9 @@
> ""
> [(const_int 0)]
> {
> - rtx hi = gen_highpart (DFmode, operands[1]);
> + rtx hi = (BYTES_BIG_ENDIAN
> + ? gen_highpart (DFmode, operands[1])
> + : gen_lowpart (DFmode, operands[1]));
> rtx lo = (GET_CODE (operands[2]) == SCRATCH)
> ? gen_reg_rtx (DFmode)
> : operands[2];
>
> the assembly changes:
> --- reduction-3.s1 2019-01-18 18:05:14.313229730 +0100
> +++ reduction-3.s2 2019-01-18 18:10:20.617233358 +0100
> @@ -27,7 +27,7 @@ MAIN__._omp_fn.0:
> addi 9,9,16
> bdnz .L2
> # vec_extract to same register
> - lfd 12,-8(1)
> + lfd 12,-16(1)
> xsmaxdp 0,12,0
> stfd 0,0(10)
> blr
> with:
> --- gcc/config/rs6000/vsx.md.jj 2019-01-01 12:37:44.305529527 +0100
> +++ gcc/config/rs6000/vsx.md 2019-01-18 18:16:30.680186709 +0100
> @@ -4361,7 +4361,9 @@
> ? gen_reg_rtx (DFmode)
> : operands[2];
>
> - emit_insn (gen_vsx_extract_v2df (lo, operands[1], const1_rtx));
> + emit_insn (gen_vsx_extract_v2df (lo, operands[1],
> + BYTES_BIG_ENDIAN
> + ? const1_rtx : const0_rtx));
> emit_insn (gen_<VEC_reduc_rtx>df3 (operands[0], hi, lo));
> DONE;
> }
This is what looks right to me. This code all pre-dates little-endian support,
and I think we missed changing the element to be extracted in this spot. There
is probably something wrong with _v4sf_scalar also -- the gen_vsx_xxsldwi_v4sf
probably needs to be adjusted also for little-endian, but I have a hard time
following this code and I'm not certain.
Bill
> the assembly changes:
> --- reduction-3.s1 2019-01-18 18:05:14.313229730 +0100
> +++ reduction-3.s3 2019-01-18 18:17:18.977397458 +0100
> @@ -26,7 +26,7 @@ MAIN__._omp_fn.0:
> xxpermdi 0,0,0,2
> addi 9,9,16
> bdnz .L2
> - # vec_extract to same register
> + xxpermdi 0,0,0,3
> lfd 12,-8(1)
> xsmaxdp 0,12,0
> stfd 0,0(10)
>
> So just judging from this exact testcase, the first patch seems to be more
> efficient, though still unsure about that, because it goes through memory in
> either case, wouldn't it be better to emit a xxpermdi from 0 to 12 that
> swaps the two elements instead of loading it from memory?