[PATCH] D99152: [AMX] Prototype for vector and amx bitcast.

Florian Hahn via Phabricator via cfe-commits Mon, 29 Mar 2021 07:04:33 -0700

fhahn added a comment.

In D99152#2655691 <https://reviews.llvm.org/D99152#2655691>, @LuoYuanke wrote:


>> I think that point was not really clear during the discussion. Using `load 
>> <256 x i32>` to lower `__tile_loadd() ` would indeed be incorrect. But I 
>> don't think that's happening at the moment, at least going from a simple 
>> example https://gcc.godbolt.org/z/KT5rczn8j
>
> The `load/store <256 x i32>` is generated by front-end, because in C language 
> tile is a vector <256 x i32>. The `load/store <256 x i32>` is transformed to 
> `llvm.x86.tileloadd64.internal/llvm.x86.tilestored64.internal` in 
> `lib/Target/X86/X86LowerAMXType.cpp` if the load result is to be an operand 
> of amx intrinsics or the store value is returned from amx intrinsics.

Sure, you can get rid of unnecessary conversions/loads/stores during 
optimizations, if you can operate on AMX tiles directly and do not need to 
store the intermediate results in flat vectors. You can also use strided 
loads/stores to store continuous memory (no stride between columns/rows). But 
what optimizations are applied later do not impact the whether IR emitted by 
Clang is correct or now. It always needs to be correct. Whether to further 
optimizations are correct is a different problem, but we need a specification 
for the builtins, intrinsics and the type before going any further in that 
direction.

>>   void foo() {
>>     tilea = __builtin_ia32_tileloadd64_internal(16, 64, buf, 64);
>>   }
>>
>> is lowered to
>>
>>   define dso_local void @foo() #0 {
>>     %1 = call x86_amx @llvm.x86.tileloadd64.internal(i16 16, i16 64, i8* 
>> getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 
>> 64)
>>     %2 = bitcast x86_amx %1 to <256 x i32>
>>     store <256 x i32> %2, <256 x i32>* @tilea, align 64
>>     ret void
>>   }
>>
>> So we emit an intrinsic to do the strided load and the result is stored to 
>> continuous memory, which is what the type `_tile1024i` requires. What's not 
>> modeled correctly is the conversion between the result of 
>> `@llvm.x86.tileloadd64.internal` and the `store`. It needs to be transferred 
>> in a flat vector.
>
> Yes.  I agree that it needs to be transferred in a flat vector.
>
>> Whether we should have `x86_amx` in the first place is a separate question I 
>> think. Having a builtin type that does not work properly with fundamental 
>> instructions like `load` or `store` seems prone for errors (what 
>> instructions actually work with `x86_amx`? Do binary operators work?). 
>> Perhaps it would be possible and sufficient to have the intrinsics use an 
>> opaque type instead of a builtin type, like
>
> We only support tileload, tilestore, tilezero, tiletdp (dot product) 
> instructions/intrinsics for `x86_amx`. Is there any opaque type example llvm 
> source code for builtin? This example has some error at 
> https://gcc.godbolt.org/z/ar6WhjTMz.

I think you need to set the input to `LLVM IR`: 
https://gcc.godbolt.org/z/WexMjsas9

You should be able to use opaque types with overloaded intrinsics. I don't 
think you define an intrinsic to take a specific opaque type (because it's not 
known up front).


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99152/new/

https://reviews.llvm.org/D99152

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D99152: [AMX] Prototype for vector and amx bitcast.

Reply via email to