https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064
--- Comment #16 from Bill Schmidt <wschmidt at gcc dot gnu.org> --- (In reply to Jakub Jelinek from comment #13) > So, both the following patches should fix it IMHO, but no idea which one if > any is right. > With > --- gcc/config/rs6000/vsx.md.jj 2019-01-01 12:37:44.305529527 +0100 > +++ gcc/config/rs6000/vsx.md 2019-01-18 18:07:37.194899062 +0100 > @@ -4356,7 +4356,9 @@ > "" > [(const_int 0)] > { > - rtx hi = gen_highpart (DFmode, operands[1]); > + rtx hi = (BYTES_BIG_ENDIAN > + ? gen_highpart (DFmode, operands[1]) > + : gen_lowpart (DFmode, operands[1])); > rtx lo = (GET_CODE (operands[2]) == SCRATCH) > ? gen_reg_rtx (DFmode) > : operands[2]; > > the assembly changes: > --- reduction-3.s1 2019-01-18 18:05:14.313229730 +0100 > +++ reduction-3.s2 2019-01-18 18:10:20.617233358 +0100 > @@ -27,7 +27,7 @@ MAIN__._omp_fn.0: > addi 9,9,16 > bdnz .L2 > # vec_extract to same register > - lfd 12,-8(1) > + lfd 12,-16(1) > xsmaxdp 0,12,0 > stfd 0,0(10) > blr > with: > --- gcc/config/rs6000/vsx.md.jj 2019-01-01 12:37:44.305529527 +0100 > +++ gcc/config/rs6000/vsx.md 2019-01-18 18:16:30.680186709 +0100 > @@ -4361,7 +4361,9 @@ > ? gen_reg_rtx (DFmode) > : operands[2]; > > - emit_insn (gen_vsx_extract_v2df (lo, operands[1], const1_rtx)); > + emit_insn (gen_vsx_extract_v2df (lo, operands[1], > + BYTES_BIG_ENDIAN > + ? const1_rtx : const0_rtx)); > emit_insn (gen_<VEC_reduc_rtx>df3 (operands[0], hi, lo)); > DONE; > } This is what looks right to me. This code all pre-dates little-endian support, and I think we missed changing the element to be extracted in this spot. There is probably something wrong with _v4sf_scalar also -- the gen_vsx_xxsldwi_v4sf probably needs to be adjusted also for little-endian, but I have a hard time following this code and I'm not certain. Bill > the assembly changes: > --- reduction-3.s1 2019-01-18 18:05:14.313229730 +0100 > +++ reduction-3.s3 2019-01-18 18:17:18.977397458 +0100 > @@ -26,7 +26,7 @@ MAIN__._omp_fn.0: > xxpermdi 0,0,0,2 > addi 9,9,16 > bdnz .L2 > - # vec_extract to same register > + xxpermdi 0,0,0,3 > lfd 12,-8(1) > xsmaxdp 0,12,0 > stfd 0,0(10) > > So just judging from this exact testcase, the first patch seems to be more > efficient, though still unsure about that, because it goes through memory in > either case, wouldn't it be better to emit a xxpermdi from 0 to 12 that > swaps the two elements instead of loading it from memory?