LuoYuanke added a comment.

> IIUC you need this to transfer/convert data from a consecutive vector to an 
> `AMX` tile. To express that, emitting an intrinsic for the conversion instead 
> a `bit cast` seems the right thing to me.

Yes. We need to transfer/convert data from a consecutive vector to an `AMX` 
tile. Because in the C language interface the tile defined as vector. `typedef 
int _tile1024i __attribute__((__vector_size__(1024), __aligned__(64)));`  Take 
below code (https://gcc.godbolt.org/z/noaWEWd6n) as an example.

  #include <immintrin.h>
  
  char buf[1024];
  void foo() {
    _tile1024i tile;
    tile = __builtin_ia32_tileloadd64_internal(16, 64, buf, 64);
  }

Compile it with "clang -S -emit-llvm simple_amx.c -mamx-int8" we got below IR.

  define dso_local void @foo() #0 !dbg !15 {
    %1 = alloca <256 x i32>, align 64
    call void @llvm.dbg.declare(metadata <256 x i32>* %1, metadata !18, 
metadata !DIExpression()), !dbg !25
    %2 = call x86_amx @llvm.x86.tileloadd64.internal(i16 16, i16 64, i8* 
getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 64), 
!dbg !26
    %3 = bitcast x86_amx %2 to <256 x i32>, !dbg !26
    store <256 x i32> %3, <256 x i32>* %1, align 64, !dbg !27
    ret void, !dbg !28
  }

Front-end alloca <256 x i32> for the local variable tile. When the return value 
of __builtin_ia32_tileloadd64_internal is assigned to tile. Front-end bitcast 
x86_amx to <256 x i32>. The x86_amx is the type returned from 
__builtin_ia32_tileloadd64_internal.

> IIUC Roman was saying that from that example alone it is not clear why the 
> explicit conversion in IR is actually needed (please correct me if I am 
> wrong). For the example, you *could* have a version of 
> `llvm.x86.tilestored64.internal` that takes an `<256 x i32>` and does the 
> conversion internally. Having a separate intrinsic to do the conversion gives 
> greater composability in the IR, but I think at the moment it is hard to 
> judge if that is needed, because it is not easy to get an overview of all AMX 
> operations that need support. Is there a summary/documentation of the AMX 
> builtins supported in Clang?

I plan to add AMX operation to Clang doc when the AMX support in LLVM is 
stable. There are only load/store, zero, dotproduct operations for AMX. We 
don't have full ISA support to matrix operation.

  __builtin_ia32_tileloadd64_internal
  __builtin_ia32_tdpbssd_internal
  __builtin_ia32_tilestored64_internal
  __builtin_ia32_tilezero_internal



> With respect to the `load` issue, it is not clear to me at the moment under 
> which circumstances regular `load` instructions are generated & interact with 
> AMX. If `load` is used to load `x` consecutive elements, than that's fine. 
> But if the actual intended operation is a strided load, then `load` should 
> not be used (this has also been discussed on llvm-dev).

The `load` instructions are generated because it is a vector in C language. See 
https://gcc.godbolt.org/z/qv5jnjK48. If we use -O0, there is load instruction 
generated. If we use -O2, the load instruction is eliminated. The -O2 version 
is what we want. There is no <256 x i32> in the generated code.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99152/new/

https://reviews.llvm.org/D99152

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to