https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116685

            Bug ID: 116685
           Summary: RISC-V: missed optimization on vector dot products
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: other
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ewlu at rivosinc dot com
  Target Milestone: ---

Godbolt: https://godbolt.org/z/arM88Pa81

Vector definitions and expanded routines extracted from povray
https://github.com/POV-Ray/povray/blob/master/source/core/math/vector.h and
related files

I expanded some of the templates and added a couple basic vector routines which
may be used in a workload.

Gcc tends to avoid vectorizing dot products or fails to vectorize appropriately

2 element vector dot product (failure to vectorize): 
dotDbl_2d(GenericVector2d<double> const&, GenericVector2d<double> const&):
        fld     fa5,8(a0)  <-- could reduce number of loads with vsetivli
zero,2
        fld     fa3,8(a1)
        fld     fa0,0(a0)
        fld     fa4,0(a1)
        fmul.d  fa5,fa5,fa3
        fmadd.d fa0,fa0,fa4,fa5
        ret
3 element vector dot product (failure to vectorize):
dotDbl_3d(GenericVector3d<double> const&, GenericVector3d<double> const&):
        fld     fa3,8(a1) <-- could use with vsetivli zero,2 or vsetivli zero,3
        fld     fa4,8(a0)
        fld     fa5,0(a0)
        fld     fa2,0(a1)
        fmul.d  fa4,fa4,fa3
        fld     fa0,16(a0)
        fld     fa3,16(a1)
        fmadd.d fa5,fa5,fa2,fa4
        fmadd.d fa0,fa0,fa3,fa5
        ret
4 element vector dot product (improper vectorization):
V4D_Dot(double&, double const*, double const*):
        vsetivli        zero,2,e64,m1,ta,ma <-- should be vsetivli zero,4
        addi    a4,a1,16
        addi    a5,a2,16
        vle64.v v3,0(a5)
        vle64.v v2,0(a4)
        vle64.v v1,0(a1)
        vle64.v v4,0(a2)
        vfmul.vv        v2,v2,v3
        vmv.s.x v3,zero
        vfmadd.vv       v1,v4,v2
        vfredusum.vs    v1,v1,v3
        vfmv.f.s        fa5,v1
        fsd     fa5,0(a0)
        ret

For odd element vectorization like the 3 element vector case, clang currently
uses vsetivli zero,2 and performs the last part of the dot product using scalar
code which gcc could potentially do as well.

While the current scalar code does use 1 fewer instruction compared to
vectorized code, it issues around twice as many loads than vectorized which may
impact performance.

Reply via email to