[Bug target/119234] openblas miscompiled with -mcpu=power8

jeevitha at gcc dot gnu.org via Gcc-bugs Wed, 16 Apr 2025 23:35:15 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119234


Jeevitha <jeevitha at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jeevitha at gcc dot gnu.org

--- Comment #5 from Jeevitha <jeevitha at gcc dot gnu.org> ---
I am able to reproduce the issue on the Power8 build. The issue occurs in the
SGEMV kernels 4x8 and 4x4.

For the below source in sgemv_4x8 below assemble was generated by the compiler,

        for (i = 0; i < n/4; i ++) {
            temp0 += v_x[i] * va0[i];
            temp1 += v_x[i] * va1[i];
            temp2 += v_x[i] * va2[i];
            temp3 += v_x[i] * va3[i];
            temp4 += v_x[i] * va4[i];
            temp5 += v_x[i] * va5[i];
            temp6 += v_x[i] * va6[i];
            temp7 += v_x[i] * va7[i];
        }
Assembly:

   0x000020000051d8a0 <+96>:    add     r30,r6,r9
   0x000020000051d8a4 <+100>:   lvx     v12,r5,r9
   0x000020000051d8a8 <+104>:   lvx     v13,r7,r9
   0x000020000051d8ac <+108>:   clrrdi  r30,r30,4
   0x000020000051d8b0 <+112>:   lvx     v1,r12,r9
   0x000020000051d8b4 <+116>:   lxvd2x  vs0,0,r30
   0x000020000051d8b8 <+120>:   add     r30,r11,r9
   0x000020000051d8bc <+124>:   clrrdi  r30,r30,4
   0x000020000051d8c0 <+128>:   lxvd2x  vs13,0,r30
   0x000020000051d8c4 <+132>:   add     r30,r3,r9
   0x000020000051d8c8 <+136>:   clrrdi  r30,r30,4
   0x000020000051d8cc <+140>:   xxswapd vs0,vs0
   0x000020000051d8d0 <+144>:   lxvd2x  vs2,0,r30
   0x000020000051d8d4 <+148>:   add     r30,r4,r9
   0x000020000051d8d8 <+152>:   clrrdi  r30,r30,4
   0x000020000051d8dc <+156>:   xvmaddasp vs6,vs0,vs44
   0x000020000051d8e0 <+160>:   xvmaddasp vs7,vs0,vs45
   0x000020000051d8e4 <+164>:   xxswapd vs13,vs13
   0x000020000051d8e8 <+168>:   xvmaddasp vs10,vs0,vs33
   0x000020000051d8ec <+172>:   lxvd2x  vs3,0,r30
   0x000020000051d8f0 <+176>:   add     r30,r8,r9
   0x000020000051d8f4 <+180>:   clrrdi  r30,r30,4
   0x000020000051d8f8 <+184>:   xvmaddasp vs12,vs0,vs13
   0x000020000051d8fc <+188>:   xxswapd vs2,vs2
   0x000020000051d900 <+192>:   lxvd2x  vs4,0,r30
   0x000020000051d904 <+196>:   add     r30,r10,r9
   0x000020000051d908 <+200>:   addi    r9,r9,16
   0x000020000051d90c <+204>:   clrrdi  r30,r30,4
   0x000020000051d910 <+208>:   xvmaddasp vs8,vs0,vs2
   0x000020000051d914 <+212>:   xxswapd vs3,vs3
   0x000020000051d918 <+216>:   lxvd2x  vs5,0,r30
   0x000020000051d91c <+220>:   xvmaddasp vs9,vs0,vs3
   0x000020000051d920 <+224>:   xxswapd vs4,vs4
   0x000020000051d924 <+228>:   xvmaddasp vs11,vs0,vs4
   0x000020000051d928 <+232>:   xxswapd vs5,vs5
   0x000020000051d92c <+236>:   xvmaddasp vs32,vs0,vs5
   0x000020000051d930 <+240>:   bdnz    0x20000051d8a0 <sgemv_kernel_4x8+96>

While debugging the issue, I observed a misaligned lvx instruction in
sgemv_kernel_4x8, which results in incorrect vector loads 

0x000020000051d8a4      71                  temp0 += v_x[i] * va0[i];
=> 0x000020000051d8a4 <sgemv_kernel_4x8+100>:   ce 48 85 7d     lvx    
v12,r5,r9
(gdb) p/x $r5
$3 = 0x10033940
(gdb) p/x $r9
$4 = 0x0
(gdb) x/10f $r5
0x10033940:     0       1       2       3
0x10033950:     4       5       6       7
0x10033960:     8       9
(gdb) si
(gdb) p $v12
$6 = {uint128 = 85402898748987885467211345242806550528, v4_float = {0, 1, 2,
3}, v4_int32 = {0, 1065353216, 1073741824,
    1077936128}, v8_int16 = {0, 0, 0, 16256, 0, 16384, 0, 16448}, v16_int8 =
{0, 0, 0, 0, 0, 0, -128, 63, 0, 0, 0, 64,
    0, 0, 64, 64}}

Above lvx has no issue because, Address is 16-byte aligned. lvx fetches
expected values.

0x000020000051d8a8      72                  temp1 += v_x[i] * va1[i];
=> 0x000020000051d8a8 <sgemv_kernel_4x8+104>:   ce 48 a7 7d     lvx    
v13,r7,r9
(gdb) p/x $r7
$7 = 0x10033a08
(gdb) p/x $r9
$8 = 0x0
(gdb) x/10f $r7
0x10033a08:     50      51      52      53
0x10033a18:     54      55      56      57
0x10033a28:     58      59
(gdb) si
(gdb) p $v13
$9 = {uint128 = 88123662303566147823415553041968922624, v4_float = {48, 49, 50,
51}, v4_int32 = {1111490560, 1111752704,
    1112014848, 1112276992}, v8_int16 = {0, 16960, 0, 16964, 0, 16968, 0,
16972}, v16_int8 = {0, 0, 64, 66, 0, 0, 68,
    66, 0, 0, 72, 66, 0, 0, 76, 66}}

In this case, Address 0x10033a08 is not 16-byte aligned. Internally, lvx masks
(0xFFFF_FFFF_FFFF_FFF0) changed address to 0x10033a00, so it fetches values
starting two floats earlier than intended.//v4_float = {48, 49, 50, 51}

This causes incorrect computation in temp1 during the loop, and the error
propagates through further vectorized operations


Sample Values (for one loop iteration):
i = 0
v_x[i] = {1.000000, 1.000000, 1.000000, 1.000000}
va0[i] = {0.000000, 1.000000, 2.000000, 3.000000}
va1[i] = {50.000000, 51.000000, 52.000000, 53.000000}
va2[i] = {100.000000, 101.000000, 102.000000, 103.000000}
va3[i] = {150.000000, 151.000000, 152.000000, 153.000000}
va4[i] = {200.000000, 201.000000, 202.000000, 203.000000}
va5[i] = {250.000000, 251.000000, 252.000000, 253.000000}
va6[i] = {300.000000, 301.000000, 302.000000, 303.000000}
va7[i] = {350.000000, 351.000000, 352.000000, 353.000000}

Accumulated temp values:
temp0 = {0.000000, 1.000000, 2.000000, 3.000000}
temp1 = {48.000000, 49.000000, 50.000000, 51.000000} /* it should have
{50.000000, 51.000000, 52.000000, 53.000000}*/
temp2 = {100.000000, 101.000000, 102.000000, 103.000000}
temp3 = {148.000000, 149.000000, 150.000000, 151.000000}//wrong
temp4 = {200.000000, 201.000000, 202.000000, 203.000000}
temp5 = {248.000000, 249.000000, 250.000000, 251.000000}//wrong
temp6 = {300.000000, 301.000000, 302.000000, 303.000000}
temp7 = {348.000000, 349.000000, 350.000000, 351.000000}//wrong

If you look at the above, incorrect values are being accumulated in specific
temp variables inside sgemv_kernel_4x8. Specifically, temp1, temp3, temp5, and
temp7 have incorrect values due to misaligned memory access in the lvx
instruction. 
Note temp1 = va1;  // should load {50.0, 51.0, 52.0, 53.0}


If they are not vectorizing the kernel, then the for loop generates the
following code on Power8:


  0x000020000051dc70 <+896>:    lxvd2x  vs0,r14,r30
   0x000020000051dc74 <+900>:   lxvd2x  vs1,r27,r30
   0x000020000051dc78 <+904>:   lxvd2x  vs13,r25,r30
   0x000020000051dc7c <+908>:   lxvd2x  vs2,r24,r30
   0x000020000051dc80 <+912>:   lxvd2x  vs3,r23,r30
   0x000020000051dc84 <+916>:   lxvd2x  vs4,r22,r30
   0x000020000051dc88 <+920>:   lxvd2x  vs5,r21,r30
   0x000020000051dc8c <+924>:   lxvd2x  vs6,r28,r30
   0x000020000051dc90 <+928>:   addi    r30,r30,16
   0x000020000051dc94 <+932>:   xvmaddasp vs7,vs0,vs1
   0x000020000051dc98 <+936>:   xvmaddasp vs8,vs0,vs13
   0x000020000051dc9c <+940>:   xvmaddasp vs9,vs0,vs2
   0x000020000051dca0 <+944>:   xvmaddasp vs10,vs0,vs3
   0x000020000051dca4 <+948>:   xvmaddasp vs11,vs0,vs4
   0x000020000051dca8 <+952>:   xvmaddasp vs12,vs0,vs5
   0x000020000051dcac <+956>:   xvmaddasp vs32,vs0,vs6
   0x000020000051dcb0 <+960>:   bdnz    0x20000051dc70 <sgemv_kernel_4x8+896>

In the above assembly, there is no alignment issue with lxvd2x because it does
not perform any internal alignment adjustments — the data is loaded correctly.

When I manually vectorized the for loop, it generated proper lxvd2x
instructions like the above, and there was no issue:
"All Openblas sgemm test passed" with below code and little changes in v_y
computation

for (i = 0; i < n; i += 4) {
      vector float vx = vec_vsx_ld(0, &x[i]);
        vector float vva0 = vec_vsx_ld(0, &a0[i]);
        vector float vva1 = vec_vsx_ld(0, &a1[i]);
        vector float vva2 = vec_vsx_ld(0, &a2[i]);
        vector float vva3 = vec_vsx_ld(0, &a3[i]);
        vector float vva4 = vec_vsx_ld(0, &a4[i]);
        vector float vva5 = vec_vsx_ld(0, &a5[i]);
        vector float vva6 = vec_vsx_ld(0, &a6[i]);
        vector float vva7 = vec_vsx_ld(0, &a7[i]);

        temp0 = vec_madd(vx, vva0, temp0);
        temp1 = vec_madd(vx, vva1, temp1);
        temp2 = vec_madd(vx, vva2, temp2);
        temp3 = vec_madd(vx, vva3, temp3);
        temp4 = vec_madd(vx, vva4, temp4);
        temp5 = vec_madd(vx, vva5, temp5);
        temp6 = vec_madd(vx, vva6, temp6);
        temp7 = vec_madd(vx, vva7, temp7);
        }

On Power9 and Power10, we don’t face this issue because they generate the lxvx
instruction no alignment issue.

[Bug target/119234] openblas miscompiled with -mcpu=power8

Reply via email to