https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |dje at gcc dot gnu.org,
| |meissner at gcc dot gnu.org,
| |segher at gcc dot gnu.org
Component|libgomp |target
--- Comment #11 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Seems to be a powerpc64le backend bug or RA bug.
Reduced testcase for -fopenacc -O1:
program reduction_3
implicit none
integer, parameter :: n = 10, vl = 32
integer :: i
double precision :: vresult, rv
double precision, parameter :: e = 0.001
double precision, dimension (n) :: array
do i = 1, n
array(i) = i
end do
rv = 0
vresult = 0
!$acc parallel vector_length(vl) copy(rv)
!$acc loop reduction(max:rv) vector
do i = 1, n
rv = max (rv, array(i))
end do
!$acc end parallel
do i = 1, n
vresult = max (vresult, array(i))
end do
if (abs (rv - vresult) .ge. e) STOP 11
end program reduction_3
In *.optimized it looks all correct:
<bb 3> [local count: 437450368]:
# vect_M.23_45 = PHI <vect_cst__39(2), vect_M.27_34(3)>
# ivtmp.34_3 = PHI <ivtmp.34_43(2), ivtmp.34_4(3)>
_2 = (void *) ivtmp.34_3;
vect__28.26_44 = MEM[base: _2, offset: 0B];
vect_M.27_34 = MAX_EXPR <vect__28.26_44, vect_M.23_45>;
ivtmp.34_4 = ivtmp.34_3 + 16;
if (ivtmp.34_4 != _25)
goto <bb 3>; [80.00%]
else
goto <bb 4>; [20.00%]
<bb 4> [local count: 437450371]:
stmp_M.28_8 = .REDUC_MAX (vect_M.27_34);
*_10 = stmp_M.28_8;
and the loop indeed iterates properly and we end up with { 10.0, 9.0 } vector
which REDUC_MAX ifn should reduce to 10.0.
During early RTL opts it also looks correct:
(insn 20 19 21 4 (parallel [
(set (reg:V2DF 134)
(smax:V2DF (vec_concat:V2DF (vec_select:DF (reg:V2DF 128 [
vect_M.23 ])
(parallel [
(const_int 1 [0x1])
]))
(vec_select:DF (reg:V2DF 128 [ vect_M.23 ])
(parallel [
(const_int 0 [0])
])))
(reg:V2DF 128 [ vect_M.23 ])))
(clobber (scratch:V2DF))
]) 1330 {vsx_reduc_smax_v2df}
(nil))
(insn 21 20 22 4 (set (reg:DF 123 [ stmp_M.28 ])
(vec_select:DF (reg:V2DF 134)
(parallel [
(const_int 0 [0])
]))) 1219 {vsx_extract_v2df}
(nil))
Then combine turns that into:
(insn 21 20 22 4 (parallel [
(set (reg:DF 123 [ stmp_M.28 ])
(vec_select:DF (smax:V2DF (vec_concat:V2DF (vec_select:DF
(reg:V2DF 128 [ vect_M.23 ])
(parallel [
(const_int 1 [0x1])
]))
(vec_select:DF (reg:V2DF 128 [ vect_M.23 ])
(parallel [
(const_int 0 [0])
])))
(reg:V2DF 128 [ vect_M.23 ]))
(parallel [
(const_int 1 [0x1])
])))
(clobber (scratch:DF))
]) 1336 {*vsx_reduc_smax_v2df_scalar}
(expr_list:REG_DEAD (reg:V2DF 128 [ vect_M.23 ])
(nil)))
That is then split into:
(insn 34 20 35 4 (set (reg:DF 137)
(vec_select:DF (reg:V2DF 128 [ vect_M.23 ])
(parallel [
(const_int 1 [0x1])
]))) -1
(nil))
(insn 35 34 22 4 (set (reg:DF 123 [ stmp_M.28 ])
(smax:DF (subreg:DF (reg:V2DF 128 [ vect_M.23 ]) 8)
(reg:DF 137))) -1
(nil))
at which point I'm already not sure if it is correct or not. As I said, at
least
in the debugger it shows that the input to this .REDUC_MAX contains the value {
10, 9 }
is the vec_select extracting the second elt (i.e. 9.0) and (subreg 8) also the
second one?
In the end, that is what happens, the resulting assembly is:
0x000000001000086c <+32>: lxvd2x vs0,0,r9
0x0000000010000870 <+36>: addi r8,r1,-16
0x0000000010000874 <+40>: lxvd2x vs12,0,r8
0x0000000010000878 <+44>: xxswapd vs12,vs12
0x000000001000087c <+48>: xvmaxdp vs0,vs12,vs0
0x0000000010000880 <+52>: xxswapd vs0,vs0
0x0000000010000884 <+56>: stxvd2x vs0,0,r8
0x0000000010000888 <+60>: xxswapd vs0,vs0
0x000000001000088c <+64>: addi r9,r9,16
0x0000000010000890 <+68>: bdnz 0x1000086c <MAIN__._omp_fn.0+32>
=> 0x0000000010000894 <+72>: lfd f12,-8(r1)
0x0000000010000898 <+76>: xsmaxdp vs0,vs12,vs0
0x000000001000089c <+80>: stfd f0,0(r10)
0x00000000100008a0 <+84>: blr
and at that point
x/2fg $r1-16
0x3fffffffed90: 10 9
p $vs0.v2_double
$6 = {10, 9}
p $vs12.v2_double
$7 = {8, 7}
Now, the lfd loads into f12 the second element (i.e. 9), in the debugger it
shows
p $vs12.v2_double
$8 = {0, 9}
after the lfd insn, and xsmaxdp {10, 9}, {0, 9} gives {0, 9} and that is what
we store.
So, does vsx_reduc_smax_v2df_scalar expander need adjustments for
little-endian?