https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116685
Bug ID: 116685 Summary: RISC-V: missed optimization on vector dot products Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: ewlu at rivosinc dot com Target Milestone: --- Godbolt: https://godbolt.org/z/arM88Pa81 Vector definitions and expanded routines extracted from povray https://github.com/POV-Ray/povray/blob/master/source/core/math/vector.h and related files I expanded some of the templates and added a couple basic vector routines which may be used in a workload. Gcc tends to avoid vectorizing dot products or fails to vectorize appropriately 2 element vector dot product (failure to vectorize): dotDbl_2d(GenericVector2d<double> const&, GenericVector2d<double> const&): fld fa5,8(a0) <-- could reduce number of loads with vsetivli zero,2 fld fa3,8(a1) fld fa0,0(a0) fld fa4,0(a1) fmul.d fa5,fa5,fa3 fmadd.d fa0,fa0,fa4,fa5 ret 3 element vector dot product (failure to vectorize): dotDbl_3d(GenericVector3d<double> const&, GenericVector3d<double> const&): fld fa3,8(a1) <-- could use with vsetivli zero,2 or vsetivli zero,3 fld fa4,8(a0) fld fa5,0(a0) fld fa2,0(a1) fmul.d fa4,fa4,fa3 fld fa0,16(a0) fld fa3,16(a1) fmadd.d fa5,fa5,fa2,fa4 fmadd.d fa0,fa0,fa3,fa5 ret 4 element vector dot product (improper vectorization): V4D_Dot(double&, double const*, double const*): vsetivli zero,2,e64,m1,ta,ma <-- should be vsetivli zero,4 addi a4,a1,16 addi a5,a2,16 vle64.v v3,0(a5) vle64.v v2,0(a4) vle64.v v1,0(a1) vle64.v v4,0(a2) vfmul.vv v2,v2,v3 vmv.s.x v3,zero vfmadd.vv v1,v4,v2 vfredusum.vs v1,v1,v3 vfmv.f.s fa5,v1 fsd fa5,0(a0) ret For odd element vectorization like the 3 element vector case, clang currently uses vsetivli zero,2 and performs the last part of the dot product using scalar code which gcc could potentially do as well. While the current scalar code does use 1 fewer instruction compared to vectorized code, it issues around twice as many loads than vectorized which may impact performance.