On Tue, Jun 30, 2026 at 3:40 PM Konstantinos Eleftheriou
<[email protected]> wrote:
>
> Actually, ASF runs before reload, v28 is allocated during early_ra.
> If we decide to gate this, should this be done in ASF, before calling
> `store_bit_field`?

Maybe just avoid doing ASF on hardregs, or move the pass before early_ra or move
early_ra later ...  IMO exposing hardregs "early" is going in the
wrong direction ...

Richard.

> Konstantinos
>
> On Tue, Jun 30, 2026 at 4:18 PM Richard Biener <[email protected]> 
> wrote:
>>
>> On Tue, Jun 30, 2026 at 3:14 PM Konstantinos Eleftheriou
>> <[email protected]> wrote:
>> >
>> > Yes, exactly. It succeeds the second time, because the new temp is a 
>> > pseudo.
>>
>> That looks wrong.  Possibly wrong in the sense of using any of this
>> after reload(?)
>>
>> > Konstantinos.
>> >
>> > On Tue, Jun 30, 2026 at 4:12 PM Richard Biener 
>> > <[email protected]> wrote:
>> >>
>> >> On Tue, Jun 30, 2026 at 2:55 PM Konstantinos Eleftheriou
>> >> <[email protected]> wrote:
>> >> >
>> >> > `x` is store_bit_field's `str_rtx`, the V4x4BF register. You can see 
>> >> > ASF's dumps in this comment
>> >> > in the PR:
>> >> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125988#c6
>> >> >
>> >> > `x` is `(reg:V4x4BF 60 v28)` in this case.
>> >>
>> >> How does copy_to_reg for a (reg:V4x4BF ..) help then?  Is that hardreg
>> >> vs pseudo and
>> >> "somehow" validate_subreg is happy with one but not the other?
>> >>
>> >> > Konstantinos.
>> >> >
>> >> > On Tue, Jun 30, 2026 at 3:36 PM Richard Biener 
>> >> > <[email protected]> wrote:
>> >> >>
>> >> >> On Tue, Jun 30, 2026 at 2:20 PM Konstantinos Eleftheriou
>> >> >> <[email protected]> wrote:
>> >> >> >
>> >> >> > The call chain for our testcase is the following:
>> >> >> > gen_lowpart -> gen_lowpart_general -> gen_lowpart_common -> 
>> >> >> > lowpart_subreg -> simplify_gen_subreg -> validate_subreg.
>> >> >> >
>> >> >> > `validate_subreg` for OI and V4x4BF modes fails, leading to this 
>> >> >> > fallback in `gen_lowpart_general`:
>> >> >> >
>> >> >> > /* Handle SUBREGs and hard REGs that were rejected by
>> >> >> >      simplify_gen_subreg.  */
>> >> >> >   else if (REG_P (x) || GET_CODE (x) == SUBREG)
>> >> >> >     {
>> >> >> >       result = gen_lowpart_common (mode, copy_to_reg (x));
>> >> >> >       gcc_assert (result != 0);
>> >> >> >       return result;
>> >> >> >     }
>> >> >> >
>> >> >> > `copy_to_reg (x)` creates the new reg.
>> >> >>
>> >> >> And what's 'x' here?  That said, either gen_lowpart () isn't supposed
>> >> >> to be used in
>> >> >> lvalue context or there's sth else wrong around store_bit_field
>> >> >> (not to say, OI and V4x4BF?  really?)
>> >> >>
>> >> >> Richard.
>> >> >>
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Tue, Jun 30, 2026 at 1:57 PM Richard Biener 
>> >> >> > <[email protected]> wrote:
>> >> >> >>
>> >> >> >> On Tue, Jun 30, 2026 at 11:22 AM Konstantinos Eleftheriou
>> >> >> >> <[email protected]> wrote:
>> >> >> >> >
>> >> >> >> > Hi Richard,
>> >> >> >> >
>> >> >> >> > We've been sending this as part of the ASF default enablement 
>> >> >> >> > patchset for nearly a year now.
>> >> >> >> > We are splitting it now as we judged that it requires separate 
>> >> >> >> > attention.
>> >> >> >> >
>> >> >> >> > The idea is to overwrite the original to-be-updated register -- 
>> >> >> >> > stripping the subregs to extract it --
>> >> >> >> > with the newly generated one. Otherwise, str_rtx is left 
>> >> >> >> > unchanged.
>> >> >> >> > The copy is done by `emit_mov_insn`.
>> >> >> >> >
>> >> >> >> > Any ideas on how we could handle this better? It definitely looks 
>> >> >> >> > like the right place for
>> >> >> >> > the fix though.
>> >> >> >>
>> >> >> >> I don't think changing the destination was intended when doing
>> >> >> >>
>> >> >> >>   op0 = gen_lowpart (int_mode_for_mode (GET_MODE (op0), op0);
>> >> >> >>
>> >> >> >> but instead a subreg was intended (and it's full-size).  I wasn't
>> >> >> >> aware that gen_lowpart
>> >> >> >> eventually creates a new reg -- when would it do that?
>> >> >> >>
>> >> >> >> >
>> >> >> >> > Thanks,
>> >> >> >> > Konstantinos
>> >> >> >> >
>> >> >> >> > On Tue, Jun 30, 2026 at 10:22 AM Richard Biener 
>> >> >> >> > <[email protected]> wrote:
>> >> >> >> >>
>> >> >> >> >> On Mon, Jun 29, 2026 at 4:13 PM Konstantinos Eleftheriou
>> >> >> >> >> <[email protected]> wrote:
>> >> >> >> >> >
>> >> >> >> >> > The call to `gen_lowpart` in `store_bit_field_1` might copy 
>> >> >> >> >> > the destination
>> >> >> >> >> > register into a new one, which may lead to wrong code 
>> >> >> >> >> > generation, as the bit
>> >> >> >> >> > insertions update the new register instead of updating 
>> >> >> >> >> > `str_rtx`.
>> >> >> >> >> >
>> >> >> >> >> > This patch copies back the new destination register into 
>> >> >> >> >> > `str_rtx` when needed.
>> >> >> >> >> >
>> >> >> >> >> > Bootstrapped/regtested on AArch64 and x86-64.
>> >> >> >> >> >
>> >> >> >> >> >         PR rtl-optimization/125988
>> >> >> >> >> >
>> >> >> >> >> > gcc/ChangeLog:
>> >> >> >> >> >
>> >> >> >> >> >         * expmed.cc (store_bit_field_1): Copy back the new 
>> >> >> >> >> > destination
>> >> >> >> >> >         register into `str_rtx` when needed.
>> >> >> >> >> >
>> >> >> >> >> > gcc/testsuite/ChangeLog:
>> >> >> >> >> >
>> >> >> >> >> >         * gcc.target/aarch64/pr125988.c: New test.
>> >> >> >> >> > ---
>> >> >> >> >> >  gcc/expmed.cc                               | 22 +++++++--
>> >> >> >> >> >  gcc/testsuite/gcc.target/aarch64/pr125988.c | 51 
>> >> >> >> >> > +++++++++++++++++++++
>> >> >> >> >> >  2 files changed, 70 insertions(+), 3 deletions(-)
>> >> >> >> >> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr125988.c
>> >> >> >> >> >
>> >> >> >> >> > diff --git a/gcc/expmed.cc b/gcc/expmed.cc
>> >> >> >> >> > index da1b5b632876..1f4611a6ed89 100644
>> >> >> >> >> > --- a/gcc/expmed.cc
>> >> >> >> >> > +++ b/gcc/expmed.cc
>> >> >> >> >> > @@ -888,9 +888,25 @@ store_bit_field_1 (rtx str_rtx, 
>> >> >> >> >> > poly_uint64 bitsize, poly_uint64 bitnum,
>> >> >> >> >> >         op0 = gen_lowpart (op0_mode.require (), op0);
>> >> >> >> >> >      }
>> >> >> >> >> >
>> >> >> >> >> > -  return store_integral_bit_field (op0, op0_mode, ibitsize, 
>> >> >> >> >> > ibitnum,
>> >> >> >> >> > -                                  bitregion_start, 
>> >> >> >> >> > bitregion_end,
>> >> >> >> >> > -                                  fieldmode, value, reverse, 
>> >> >> >> >> > fallback_p);
>> >> >> >> >> > +  if (!store_integral_bit_field (op0, op0_mode, ibitsize, 
>> >> >> >> >> > ibitnum,
>> >> >> >> >> > +                                bitregion_start, 
>> >> >> >> >> > bitregion_end,
>> >> >> >> >> > +                                fieldmode, value, reverse, 
>> >> >> >> >> > fallback_p))
>> >> >> >> >> > +    return false;
>> >> >> >> >> > +
>> >> >> >> >> > +  rtx op0_reg = op0;
>> >> >> >> >> > +  rtx str_rtx_reg = str_rtx;
>> >> >> >> >> > +  while (SUBREG_P (op0_reg))
>> >> >> >> >> > +    op0_reg = SUBREG_REG (op0_reg);
>> >> >> >> >> > +  while (SUBREG_P (str_rtx_reg))
>> >> >> >> >> > +    str_rtx_reg = SUBREG_REG (str_rtx_reg);
>> >> >> >> >>
>> >> >> >> >> That looks definitely fishy.
>> >> >> >> >>
>> >> >> >> >> You do not quote the part that does the copy, but stripping all 
>> >> >> >> >> subregs
>> >> >> >> >> and then copying looks wrong.  It also looks this was produced 
>> >> >> >> >> by an AI?
>> >> >> >> >>
>> >> >> >> >> > +
>> >> >> >> >> > +  /* If a new destination register has been generated, copy 
>> >> >> >> >> > the value back
>> >> >> >> >> > +     into str_rtx.  */
>> >> >> >> >> > +  if (REG_P (op0_reg) && REG_P (str_rtx_reg)
>> >> >> >> >> > +      && REGNO (op0_reg) != REGNO (str_rtx_reg))
>> >> >> >> >> > +    emit_move_insn (str_rtx_reg, op0_reg);
>> >> >> >> >> > +
>> >> >> >> >> > +  return true;
>> >> >> >> >> >  }
>> >> >> >> >> >
>> >> >> >> >> >  /* Subroutine of store_bit_field_1, with the same arguments, 
>> >> >> >> >> > except
>> >> >> >> >> > diff --git a/gcc/testsuite/gcc.target/aarch64/pr125988.c 
>> >> >> >> >> > b/gcc/testsuite/gcc.target/aarch64/pr125988.c
>> >> >> >> >> > new file mode 100644
>> >> >> >> >> > index 000000000000..3ac7be9b7b99
>> >> >> >> >> > --- /dev/null
>> >> >> >> >> > +++ b/gcc/testsuite/gcc.target/aarch64/pr125988.c
>> >> >> >> >> > @@ -0,0 +1,51 @@
>> >> >> >> >> > +/* PR rtl-optimization/125988 */
>> >> >> >> >> > +/* { dg-do run } */
>> >> >> >> >> > +/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
>> >> >> >> >> > +/* { dg-options "-O3 -favoid-store-forwarding" } */
>> >> >> >> >> > +/* { dg-add-options arm_v8_2a_bf16_neon } */
>> >> >> >> >> > +
>> >> >> >> >> > +/* Verify that the lane inserted by vld4_lane_bf16 survives
>> >> >> >> >> > +   avoid-store-forwarding's bit-insert rewrite.  */
>> >> >> >> >> > +
>> >> >> >> >> > +#include <arm_neon.h>
>> >> >> >> >> > +
>> >> >> >> >> > +extern void abort (void);
>> >> >> >> >> > +
>> >> >> >> >> > +typedef union
>> >> >> >> >> > +{
>> >> >> >> >> > +  bfloat16_t bf;
>> >> >> >> >> > +  unsigned short u;
>> >> >> >> >> > +} bf16_u;
>> >> >> >> >> > +
>> >> >> >> >> > +__attribute__((noipa)) static int
>> >> >> >> >> > +test (const bf16_u *data, const bf16_u *overwrite)
>> >> >> >> >> > +{
>> >> >> >> >> > +  bfloat16x4x4_t v;
>> >> >> >> >> > +  bf16_u t[4];
>> >> >> >> >> > +  int i, j;
>> >> >> >> >> > +  for (i = 0; i < 4; i++, data += 4)
>> >> >> >> >> > +    v.val[i] = vld1_bf16 (&data->bf);
>> >> >> >> >> > +  v = vld4_lane_bf16 (&overwrite->bf, v, 3);
>> >> >> >> >> > +  while (--i >= 0)
>> >> >> >> >> > +    {
>> >> >> >> >> > +      vst1_bf16 (&t[0].bf, v.val[i]);
>> >> >> >> >> > +      data -= 4;
>> >> >> >> >> > +      for (j = 0; j < 4; j++)
>> >> >> >> >> > +       if (t[j].u != (j == 3 ? overwrite[i].u : data[j].u))
>> >> >> >> >> > +         return 1;
>> >> >> >> >> > +    }
>> >> >> >> >> > +  return 0;
>> >> >> >> >> > +}
>> >> >> >> >> > +
>> >> >> >> >> > +int
>> >> >> >> >> > +main (void)
>> >> >> >> >> > +{
>> >> >> >> >> > +  bf16_u d[16];
>> >> >> >> >> > +  for (int i = 0; i < 16; i++)
>> >> >> >> >> > +    d[i].u = 0x1000 + i;
>> >> >> >> >> > +  bf16_u ov[4] = { {.u = 0xABCD}, {.u = 0x1234},
>> >> >> >> >> > +                  {.u = 0xCAFE}, {.u = 0xBEEF} };
>> >> >> >> >> > +  if (test (d, ov))
>> >> >> >> >> > +    abort ();
>> >> >> >> >> > +  return 0;
>> >> >> >> >> > +}
>> >> >> >> >> > --
>> >> >> >> >> > 2.52.0
>> >> >> >> >> >

Reply via email to