everton.constantino added a comment. @fhahn When I mentioned the splats I was talking about the IR, not the final code. On the Godbolts links you sent, its the same that I see. However take a look into the IR your example generates:
%vec.cast = bitcast [4 x float]* %A to <2 x float>* %col.load = load <2 x float>, <2 x float>* %vec.cast, align 4 %vec.gep = getelementptr [4 x float], [4 x float]* %A, i64 0, i64 2 %vec.cast2 = bitcast float* %vec.gep to <2 x float>* %col.load3 = load <2 x float>, <2 x float>* %vec.cast2, align 4 %vec.cast4 = bitcast [4 x float]* %B to <2 x float>* %col.load5 = load <2 x float>, <2 x float>* %vec.cast4, align 4 %vec.gep6 = getelementptr [4 x float], [4 x float]* %B, i64 0, i64 2 %vec.cast7 = bitcast float* %vec.gep6 to <2 x float>* %col.load8 = load <2 x float>, <2 x float>* %vec.cast7, align 4 %splat.splat = shufflevector <2 x float> %col.load5, <2 x float> poison, <2 x i32> zeroinitializer %0 = fmul <2 x float> %col.load, %splat.splat %splat.splat11 = shufflevector <2 x float> %col.load5, <2 x float> undef, <2 x i32> <i32 1, i32 1> %1 = call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %col.load3, <2 x float> %splat.splat11, <2 x float> %0) %splat.splat14 = shufflevector <2 x float> %col.load8, <2 x float> poison, <2 x i32> zeroinitializer %2 = fmul <2 x float> %col.load, %splat.splat14 %splat.splat17 = shufflevector <2 x float> %col.load8, <2 x float> undef, <2 x i32> <i32 1, i32 1> %3 = call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %col.load3, <2 x float> %splat.splat17, <2 x float> %2) %vec.cast18 = bitcast [4 x float]* %C to <2 x float>* %col.load19 = load <2 x float>, <2 x float>* %vec.cast18, align 4 %vec.gep20 = getelementptr [4 x float], [4 x float]* %C, i64 0, i64 2 %vec.cast21 = bitcast float* %vec.gep20 to <2 x float>* %col.load22 = load <2 x float>, <2 x float>* %vec.cast21, align 4 %4 = fadd <2 x float> %1, %col.load19 %5 = fadd <2 x float> %3, %col.load22 store <2 x float> %4, <2 x float>* %vec.cast18, align 4 store <2 x float> %5, <2 x float>* %vec.cast21, align 4 I don't see a simple, reliable pattern to match the operands of %4 with %0 for example, and this is what I meant by the splat in the middle. The pragma approach assumes that we´re always working with architectures that the better approach is to fuse the fmul and fadds. The problem here is what you have to decide is between preloading the accumulator or not. On IBM Power10´s MMA this would be pretty far from optimal, for example, because you have specific instructions to load accumulators. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D99433/new/ https://reviews.llvm.org/D99433 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits