On Fri, May 31, 2024 at 4:59 AM Hanke Zhang via Gcc <[email protected]> wrote:
>
> Hi,
> I've recently been trying to hand-write code to trigger automatic
> vectorization optimizations in GCC on Intel x86 machines (without
> using the interfaces in immintrin.h), but I'm running into a problem
> where I can't seem to get the concise `vpmovzxbd` or similar
> instructions.
>
> My requirement is to convert 8 `uint8_t` elements to `int32_t` type
> and print the output. If I use the interface (_mm256_cvtepu8_epi32) in
> immintrin.h, the code is as follows:
>
> int immintrin () {
> int size = 10000, offset = 3;
> uint8_t* a = malloc(sizeof(char) * size);
>
> __v8si b = (__v8si)_mm256_cvtepu8_epi32(*(__m128i *)(a + offset));
>
> for (int i = 0; i < 8; i++) {
> printf("%d\n", b[i]);
> }
> }
>
> After compiling with -mavx2 -O3, you can get concise and efficient
> instructions. (You can see it here: https://godbolt.org/z/8ojzdav47)
>
> But if I do not use this interface and instead use a for-loop or the
> `__builtin_convertvector` interface provided by GCC, I cannot achieve
> the above effect. The code is as follows:
>
> typedef uint8_t v8qiu __attribute__ ((__vector_size__ (8)));
> int forloop () {
> int size = 10000, offset = 3;
> uint8_t* a = malloc(sizeof(char) * size);
>
> v8qiu av = *(v8qiu *)(a + offset);
> __v8si b = {};
> for (int i = 0; i < 8; i++) {
> b[i] = (a + offset)[i];
> }
>
> for (int i = 0; i < 8; i++) {
> printf("%d\n", b[i]);
> }
> }
>
> int builtin_cvt () {
> int size = 10000, offset = 3;
> uint8_t* a = malloc(sizeof(char) * size);
>
> v8qiu av = *(v8qiu *)(a + offset);
> __v8si b = __builtin_convertvector(av, __v8si);
>
> for (int i = 0; i < 8; i++) {
> printf("%d\n", b[i]);
> }
> }
Ideally both should work. The loop case works when disabling
the loop vectorizer, thus -O3 -fno-tree-loop-vectorize it then
produces
vpmovzxbd 3(%rax), %ymm0
vmovdqa %ymm0, (%rsp)
the loop vectorizer is constraint with using same vector sizes
and thus makes a mess out of it by unpacking the 8 char
vector two times to four 2 element int vectors.
I do have plans to address this, but not sure if those can materialize
for GCC 15.
> The instructions generated by both functions are redundant and
> complex, and are quite difficult to read compared to calling
> `_mm256_cvtepu8_epi32` directly. (You can see it here as well:
> https://godbolt.org/z/8ojzdav47)
>
> What I want to ask is: How should I write the source code to get
> assembly instructions similar to directly calling
> _mm256_cvtepu8_epi32?
>
> Or would it be easier if I modified the GIMPLE directly? But it seems
> that there is no relevant expression or interface directly
> corresponding to `vpmovzxbd` in GIMPLE.
>
> Thanks
> Hanke Zhang