LuoYuanke added a comment. > IIUC you need this to transfer/convert data from a consecutive vector to an > `AMX` tile. To express that, emitting an intrinsic for the conversion instead > a `bit cast` seems the right thing to me.
Yes. We need to transfer/convert data from a consecutive vector to an `AMX` tile. Because in the C language interface the tile defined as vector. `typedef int _tile1024i __attribute__((__vector_size__(1024), __aligned__(64)));` Take below code (https://gcc.godbolt.org/z/noaWEWd6n) as an example. #include <immintrin.h> char buf[1024]; void foo() { _tile1024i tile; tile = __builtin_ia32_tileloadd64_internal(16, 64, buf, 64); } Compile it with "clang -S -emit-llvm simple_amx.c -mamx-int8" we got below IR. define dso_local void @foo() #0 !dbg !15 { %1 = alloca <256 x i32>, align 64 call void @llvm.dbg.declare(metadata <256 x i32>* %1, metadata !18, metadata !DIExpression()), !dbg !25 %2 = call x86_amx @llvm.x86.tileloadd64.internal(i16 16, i16 64, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 64), !dbg !26 %3 = bitcast x86_amx %2 to <256 x i32>, !dbg !26 store <256 x i32> %3, <256 x i32>* %1, align 64, !dbg !27 ret void, !dbg !28 } Front-end alloca <256 x i32> for the local variable tile. When the return value of __builtin_ia32_tileloadd64_internal is assigned to tile. Front-end bitcast x86_amx to <256 x i32>. The x86_amx is the type returned from __builtin_ia32_tileloadd64_internal. > IIUC Roman was saying that from that example alone it is not clear why the > explicit conversion in IR is actually needed (please correct me if I am > wrong). For the example, you *could* have a version of > `llvm.x86.tilestored64.internal` that takes an `<256 x i32>` and does the > conversion internally. Having a separate intrinsic to do the conversion gives > greater composability in the IR, but I think at the moment it is hard to > judge if that is needed, because it is not easy to get an overview of all AMX > operations that need support. Is there a summary/documentation of the AMX > builtins supported in Clang? I plan to add AMX operation to Clang doc when the AMX support in LLVM is stable. There are only load/store, zero, dotproduct operations for AMX. We don't have full ISA support to matrix operation. __builtin_ia32_tileloadd64_internal __builtin_ia32_tdpbssd_internal __builtin_ia32_tilestored64_internal __builtin_ia32_tilezero_internal > With respect to the `load` issue, it is not clear to me at the moment under > which circumstances regular `load` instructions are generated & interact with > AMX. If `load` is used to load `x` consecutive elements, than that's fine. > But if the actual intended operation is a strided load, then `load` should > not be used (this has also been discussed on llvm-dev). The `load` instructions are generated because it is a vector in C language. See https://gcc.godbolt.org/z/qv5jnjK48. If we use -O0, there is load instruction generated. If we use -O2, the load instruction is eliminated. The -O2 version is what we want. There is no <256 x i32> in the generated code. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D99152/new/ https://reviews.llvm.org/D99152 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits