Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics

Christophe LYON via Gcc-patches Fri, 03 Sep 2021 03:58:38 -0700


On 03/09/2021 10:35, Prathamesh Kulkarni via Gcc-patches wrote:

On Thu, 2 Sept 2021 at 14:32, Christophe Lyon
<christophe.lyon....@gmail.com> wrote:



On Tue, Aug 24, 2021 at 10:17 AM Kyrylo Tkachov <kyrylo.tkac...@arm.com> wrote:

-----Original Message-----
From: Prathamesh Kulkarni <prathamesh.kulka...@linaro.org>
Sent: 24 August 2021 09:01
To: Christophe Lyon <christophe.lyon....@gmail.com>
Cc: Kyrylo Tkachov <kyrylo.tkac...@arm.com>; gcc Patches <gcc-
patc...@gcc.gnu.org>
Subject: Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n
intrinsics

On Tue, 17 Aug 2021 at 11:55, Prathamesh Kulkarni
<prathamesh.kulka...@linaro.org> wrote:

On Thu, 12 Aug 2021 at 19:04, Christophe Lyon
<christophe.lyon....@gmail.com> wrote:



On Thu, Aug 12, 2021 at 1:54 PM Prathamesh Kulkarni

<prathamesh.kulka...@linaro.org> wrote:

On Wed, 11 Aug 2021 at 22:23, Christophe Lyon
<christophe.lyon....@gmail.com> wrote:



On Thu, Jun 24, 2021 at 6:29 PM Kyrylo Tkachov via Gcc-patches <gcc-

patc...@gcc.gnu.org> wrote:

-----Original Message-----
From: Prathamesh Kulkarni <prathamesh.kulka...@linaro.org>
Sent: 24 June 2021 12:11
To: gcc Patches <gcc-patches@gcc.gnu.org>; Kyrylo Tkachov
<kyrylo.tkac...@arm.com>
Subject: [ARM] PR66791: Replace builtins for vdup_n and vmov_n

intrinsics

Hi,
This patch replaces builtins for vdup_n and vmov_n.
The patch results in regression for pr51534.c.
Consider following function:

uint8x8_t f1 (uint8x8_t a) {
   return vcgt_u8(a, vdup_n_u8(0));
}

code-gen before patch:
f1:
         vmov.i32  d16, #0  @ v8qi
         vcgt.u8     d0, d0, d16
         bx             lr

code-gen after patch:
f1:
         vceq.i8 d0, d0, #0
         vmvn    d0, d0
         bx         lr

I am not sure which one is better tho ?

Hi Prathamesh,

This patch introduces a regression on non-hardfp configs (eg arm-

linux-gnueabi or arm-eabi):

FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan-

assembler-times vmov.i32[ \t]+[dD][0-9]+, #0xffffffff 3

FAIL:  gcc:gcc.target/arm/arm.exp=gcc.target/arm/pr51534.c scan-

assembler-times vmov.i32[ \t]+[qQ][0-9]+, #4294967295 3

Can you fix this?

The issue is, for following test:

#include <arm_neon.h>

uint8x8_t f1 (uint8x8_t a) {
   return vcge_u8(a, vdup_n_u8(0));
}

armhf code-gen:
f1:
         vmov.i32  d0, #0xffffffff  @ v8qi
         bx            lr

arm softfp code-gen:
f1:
         mov     r0, #-1
         mov     r1, #-1
         bx      lr

The code-gen for both is same upto split2 pass:

(insn 10 6 11 2 (set (reg/i:V8QI 16 s0)
         (const_vector:V8QI [
                 (const_int -1 [0xffffffffffffffff]) repeated x8
             ])) "foo.c":5:1 1052 {*neon_movv8qi}
      (expr_list:REG_EQUAL (const_vector:V8QI [
                 (const_int -1 [0xffffffffffffffff]) repeated x8
             ])
         (nil)))
(insn 11 10 13 2 (use (reg/i:V8QI 16 s0)) "foo.c":5:1 -1
      (nil))

and for softfp target, split2 pass splits the assignment to r0 and r1:

(insn 15 6 16 2 (set (reg:SI 0 r0)
         (const_int -1 [0xffffffffffffffff])) "foo.c":5:1 740

{*thumb2_movsi_vfp}

      (nil))
(insn 16 15 11 2 (set (reg:SI 1 r1 [+4 ])
         (const_int -1 [0xffffffffffffffff])) "foo.c":5:1 740

{*thumb2_movsi_vfp}

      (nil))
(insn 11 16 13 2 (use (reg/i:V8QI 0 r0)) "foo.c":5:1 -1
      (nil))

I suppose we could use a dg-scan for r[0-9]+, #-1 for softfp targets ?

Yes, probably, or try with check-function-bodies.

Hi,
Sorry for the late response. Does the attached patch look OK ?

ping https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577532.html

Ok.



Sorry Prathamesh, this does not quite work. See  
https://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/r12-3294-g7a6f40d0452ec76e126c2612dcfa32f3c73e2315/report-build-info.html
(red cells in the gcc column)

Can you have a look?

Ah, for arm-none-eabi target the code-gen for test_vcge_u8_uint8x8_t is:
         mvn     r0, #0
         mvn     r1, #0
         bx      lr

instead of:
         mov     r0, #-1
         mov     r1, #-1
         bx      lr

I guess both code-gen sequences are correct, but I am not sure under
which circumstance either sequence gets generated.
To accomodate for both, I tried to check for either pattern with OR:
/* { dg-final { scan-assembler-times "(mov\[    \]+r\[0-9\]+,
#-1)|(mvn\[    \]+r\[0-9\]+, #0)" 6 { target { arm_softfp_ok } } } }
*/
but that doesn't seem to work.

Do you have any suggestions on how to proceed ?


Sorry I have no magic syntax for this :-)

Maybe it's also worth checking why we have this codegen difference?


Thanks,


Christophe


Thanks,
Prathamesh

Thanks

Christophe

Thanks,
Kyrill

Thanks,
Prathamesh

Thanks,
Prathamesh

  Christophe

Thanks,
Prathamesh

Thanks

Christophe

I think they're equivalent in practice, in any case the patch itself is

good (move away from RTL builtins).

Ok.
Thanks,
Kyrill

Also, this patch regressed bf16_dup.c on arm-linux-gnueabi,
which is due to a missed opt in lowering. I had filed it as
PR98435, and posted a fix for it here:
https://gcc.gnu.org/pipermail/gcc-patches/2021-

June/572648.html

Thanks,
Prathamesh

Re: [ARM] PR66791: Replace builtins for vdup_n and vmov_n intrinsics

Reply via email to