fhahn added a comment. In D99152#2655304 <https://reviews.llvm.org/D99152#2655304>, @lebedev.ri wrote:
> In D99152#2655274 <https://reviews.llvm.org/D99152#2655274>, @fhahn wrote: > >> In D99152#2649520 <https://reviews.llvm.org/D99152#2649520>, @LuoYuanke >> wrote: >> >>> In D99152#2647681 <https://reviews.llvm.org/D99152#2647681>, @fhahn wrote: >>> >>>> I can't see any `load <256 x i32>` in the linked example, just a store. >>>> Could you check the example? >>> >>> I create another example at https://gcc.godbolt.org/z/v6od5ceEz. In bar() >>> function, you can see the `load <256 x i32>*` in the IR. The bar() function >>> is actually buggy code, because tilec is not initialized by amx intrinsics. >>> We want user call amx intrinsic to load/store tile explicitly. Ideally >>> front-end can detect the issue and report error. >> >> Thanks AFAIK in those cases the conversion intrinsic makes sense to use, >> because you effectively need to convert between 2 types in a non-trivial >> way. @lebedev.ri WDYT? > > I'm not sure. I think first and foremost the `load`/`store` miscompile should > be addressed. > I think the rest is confusing because it seems to me that the only reason why > that `bitcast` > is needed is not correctness reason, but as an opaque optimization barrier. I'm not sure if the loads and store are actually incorrect. `_tile1024i ` is defined as `typedef int _tile1024i __attribute__((__vector_size__(1024), __aligned__(64)));` and the loads/stores are for assignments/reads from variables that have that type, which is `<256 x i32>` in IR. So it is not obvious to me why the loads/stores would be wrong, as long as `_tile1024i` is defined as it is (if it would be a different type, that all changes). As a consequence, `__builtin_ia32_tilezero_internal` & the other builtins need to be defined as returning `_tile1024i` / `<256 x i32>`. I don't think there's any other way to specify this, unless you have a dedicated AMX type in the frontend. IIUC the current lowering is to call an intrinsic that returns `x86_amx` and then a `bitcast` is used for the conversion to the result type `<256 x i32>`, with the (incorrect) assumption that the `bitcast` does complex conversion between types. Another consequence of the builtins returning `_tile1024i` / `<256 x i32>` is that the conversion from the intrinsic result to `<256 x i32>` should happen at the place where Clang emits the call to the intrinsic directly, not patched up later as it is done now. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D99152/new/ https://reviews.llvm.org/D99152 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits