everton.constantino added a comment.

@fhahn When I mentioned the splats I was talking about the IR, not the final 
code. On the Godbolts links you sent, its the same that I see. However take a 
look into the IR your example generates:

  %vec.cast = bitcast [4 x float]* %A to <2 x float>*
  %col.load = load <2 x float>, <2 x float>* %vec.cast, align 4
  %vec.gep = getelementptr [4 x float], [4 x float]* %A, i64 0, i64 2
  %vec.cast2 = bitcast float* %vec.gep to <2 x float>*
  %col.load3 = load <2 x float>, <2 x float>* %vec.cast2, align 4
  %vec.cast4 = bitcast [4 x float]* %B to <2 x float>*
  %col.load5 = load <2 x float>, <2 x float>* %vec.cast4, align 4
  %vec.gep6 = getelementptr [4 x float], [4 x float]* %B, i64 0, i64 2
  %vec.cast7 = bitcast float* %vec.gep6 to <2 x float>*
  %col.load8 = load <2 x float>, <2 x float>* %vec.cast7, align 4
  %splat.splat = shufflevector <2 x float> %col.load5, <2 x float> poison, <2 x 
i32> zeroinitializer
  %0 = fmul <2 x float> %col.load, %splat.splat
  %splat.splat11 = shufflevector <2 x float> %col.load5, <2 x float> undef, <2 
x i32> <i32 1, i32 1>
  %1 = call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %col.load3, <2 x float> 
%splat.splat11, <2 x float> %0)
  %splat.splat14 = shufflevector <2 x float> %col.load8, <2 x float> poison, <2 
x i32> zeroinitializer
  %2 = fmul <2 x float> %col.load, %splat.splat14
  %splat.splat17 = shufflevector <2 x float> %col.load8, <2 x float> undef, <2 
x i32> <i32 1, i32 1>
  %3 = call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %col.load3, <2 x float> 
%splat.splat17, <2 x float> %2)
  %vec.cast18 = bitcast [4 x float]* %C to <2 x float>*
  %col.load19 = load <2 x float>, <2 x float>* %vec.cast18, align 4
  %vec.gep20 = getelementptr [4 x float], [4 x float]* %C, i64 0, i64 2
  %vec.cast21 = bitcast float* %vec.gep20 to <2 x float>*
  %col.load22 = load <2 x float>, <2 x float>* %vec.cast21, align 4
  %4 = fadd <2 x float> %1, %col.load19
  %5 = fadd <2 x float> %3, %col.load22
  store <2 x float> %4, <2 x float>* %vec.cast18, align 4
  store <2 x float> %5, <2 x float>* %vec.cast21, align 4

I don't see a simple, reliable pattern to match the operands of %4 with %0 for 
example, and this is what I meant by the splat in the middle. The pragma 
approach assumes that we´re always working with architectures that the better 
approach is to fuse the fmul and fadds. The problem here is what you have to 
decide is between preloading the accumulator or not. On IBM Power10´s MMA this 
would be pretty far from optimal, for example, because you have specific 
instructions to load accumulators.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99433/new/

https://reviews.llvm.org/D99433

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to