Thanks for the wonderful library. It's a pleasure to work with it.
I am a bit puzzled by the generated code for a matrix multiplication. I see neither 256bit SIMD vectors nor FMA instructions although I specify the target as: `llvm -mcpu=alderlake -mattr=+avx2 -num-cores=16`. What could be the reason? My code is: ``` import tvm from tvm.script import tir as T import numpy as np from tvm import meta_schedule as ms @tvm.script.ir_module class MyModule: @T.prim_func def main( A: T.Buffer[(1024, 1024), "float32"], B: T.Buffer[(1024, 1024), "float32"], C: T.Buffer[(1024, 1024), "float32"], ): T.func_attr({"global_symbol": "main", "tir.noalias": True}) for i, j, k in T.grid(1024, 1024, 1024): with T.block("C"): vi, vj, vk = T.axis.remap("SSR", [i, j, k]) with T.init(): C[vi, vj] = 0.0 C[vi, vj] = C[vi, vj] + A[vi, vk] * B[vk, vj] dtype = "float32" a_np = np.random.rand(1024, 1024).astype(dtype) b_np = np.random.rand(1024, 1024).astype(dtype) a_nd = tvm.nd.array(a_np) b_nd = tvm.nd.array(b_np) c_nd = tvm.nd.empty((1024, 1024), dtype="float32") target = "llvm -mcpu=alderlake -mattr=+avx2 -num-cores=16" database = ms.tune_tir( mod=MyModule, target=target, max_trials_global=64, num_trials_per_iter=64, work_dir="./tune_tmp", ) sch_tuned = ms.tir_integration.compile_tir(database, MyModule, target=target) print(sch_tuned.mod.script()) lib = tvm.build(sch_tuned.mod, target="llvm") with open('/tmp/my_module.S', 'w') as f: f.write(lib.get_source('asm')) ``` Looking at the assembly, I see 128 vectors being used, and `mulps` and `addps` instructions. What can I do to improve codegen? --- [Visit Topic](https://discuss.tvm.apache.org/t/why-are-neither-256bit-simd-vectors-nor-fmad-instructions-used-by-autotuning/18429/1) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/c9a5c3277e47abf17c7bd5561484ebe076b54e7afe586c11eaa319ee087279cc).