Manolis Tsamis <manolis.tsa...@vrull.eu> writes:
> On Fri, Aug 16, 2024 at 5:33 PM Richard Sandiford
> <richard.sandif...@arm.com> wrote:
>> Manolis Tsamis <manolis.tsa...@vrull.eu> writes:
>> > +    }
>> > +
>> > +  virtual unsigned int execute (function *) override;
>> > +}; // class pass_rtl_avoid_store_forwarding
>> > +
>> > +static unsigned int stats_sf_detected = 0;
>> > +static unsigned int stats_sf_avoided = 0;
>>
>> Could you instead structure the code as a class with these as member
>> variables?  I realise it's a bit over-the-top for just 2 variables,
>> but the pass might be expanded in future.
>>
> I could do that but it would require making most of the functions
> member functions (as these are incremented in
> process_store_forwarding)

Right.

> and by looking at other passes/GCC code this is like never done.  So I
> felt that it would be a bit unconventional and I haven't implemented
> it yet. Thoughts about that?

Most passes were written while the codebase was still C.  I think newer
passes have generally been written in the way I described.  Where possible,
it'd be better for new code to avoid using globals to communicate
information between functions, to reduce the work involved in any future
parallelisation effort.  (There have been tentative steps towards reducing
existing uses of globals in the past.)

Using classes also allows pass-level RAII.  I realise that doesn't matter
for the current code, but like I say, we should consider future extensions
too.

>> > +
>> > +      FOR_EACH_VEC_ELT (stores, i, it)
>> > +     {
>> > +       fprintf (dump_file, "From: ");
>> > +       print_rtl_single (dump_file, it->store_insn);
>> > +     }
>> > +
>> > +      fprintf (dump_file, "To: ");
>> > +      print_rtl_single (dump_file, load_insn);
>> > +
>> > +      if (load_elim)
>> > +     fprintf (dump_file, "(Load elimination candidate)\n");
>> > +    }
>> > +
>> > +  rtx dest;
>> > +  if (load_elim)
>> > +    dest = gen_reg_rtx (load_inner_mode);
>> > +  else
>> > +    dest = SET_DEST (load);
>> > +
>> > +  int move_to_front = -1;
>> > +  int total_cost = 0;
>> > +
>> > +  /* Check if we can emit bit insert instructions for all forwarded 
>> > stores.  */
>> > +  FOR_EACH_VEC_ELT (stores, i, it)
>> > +    {
>> > +      it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem));
>> > +      rtx_insn *insns = NULL;
>> > +
>> > +      /* If we're eliminating the load then find the store with zero 
>> > offset
>> > +      and use it as the base register to avoid a bit insert if possible.  
>> > */
>> > +      if (load_elim && it->offset == 0
>> > +       && validate_subreg (GET_MODE (dest), GET_MODE (it->mov_reg),
>> > +                           it->mov_reg, 0))
>> > +     {
>> > +       start_sequence ();
>> > +
>> > +       /* We can use a paradoxical subreg to force this to a wider mode, 
>> > as
>> > +          the only use will be inserting the bits (i.e., we don't care 
>> > about
>> > +          the value of the higher bits).  */
>> > +       rtx ext0 = gen_rtx_SUBREG (GET_MODE (dest), it->mov_reg, 0);
>>
>> IMO it'd be safer to call lowpart_subreg and check whether the result
>> is null.  This would also remove the need to check validate_subreg
>> directly.
>>
> I did change this to lowpart_subreg, but note that validate_subreg
> looks to still be necessary (for AArch64). Otherwise I get various
> ICEs, e.g. when a SF -> TI case is hit, which was the original reason
> for validate_subreg.

Do you have a testcase?

Direct uses of gen_rtx_SUBREG are generally suspicious.  We should be
using higher-level generators in most cases.

> I could have some help with that, because after the new changes a
> subreg related ICE also happens within store_bit_field when a DI ->
> V4SI case is hit. Afaik store_bit_field should just return NULL if it
> can't handle something so I don't really know how to address this ICE
> in the new version.

I don't think store_bit_field is expected to fail.  But yeah, if you have
a testcase, I can take a look.

>> > +       rtx_insn *move0 = emit_move_insn (dest, ext0);
>> > +       if (recog_memoized (move0) >= 0)
>> > +         {
>> > +           insns = get_insns ();
>> > +           move_to_front = (int) i;
>> > +         }
>>
>> Here too I'd expect emit_move_insn to produce only valid insns.
>> (And it can produce multiple insns.)
>>
> The original implementation emitted simple mov instructions without
> recog but then gradually each of these recog calls was added for ICE
> in various more obscure targets that Jeff helped find (even due to
> "simple" moves).

By "simple mov instructions", do you mean via emit_move_insn, or by
emitting a SET directly?

> So, although I don't like it and it complicates things, I had to add
> recog_memoized pretty much everywhere and also make sure to only emit
> if all of these are successful...

Do you have testcases?  emit_move_insn's job is to make the move
legitimate, so it seems like the checks might be papering over an issue
elsewhere.

>> > +     }
>> > +      else
>> > +     {
>> > +       rtx reg = SET_DEST (set);
>> > +
>> > +       while (GET_CODE (reg) == ZERO_EXTRACT
>> > +             || GET_CODE (reg) == STRICT_LOW_PART
>> > +             || GET_CODE (reg) == SUBREG)
>> > +         reg = XEXP (reg, 0);
>> > +
>> > +       /* Drop store forwarding candidates when the address register is
>> > +          overwritten.  */
>> > +       if (REG_P (reg))
>> > +         {
>> > +           bool remove_rest = false;
>> > +           unsigned int i;
>> > +           store_fwd_info *it;
>> > +           FOR_EACH_VEC_ELT_REVERSE (store_exprs, i, it)
>> > +             {
>> > +               if (remove_rest
>> > +                   || reg_overlap_mentioned_p (reg, it->store_mem))
>> > +                 {
>> > +                   it->remove = true;
>> > +                   removed_count++;
>> > +                   remove_rest = true;
>> > +                 }
>> > +             }
>> > +         }
>> > +       else
>> > +         {
>> > +           /* We can't understand INSN.  */
>> > +           store_exprs.truncate (0);
>> > +           continue;
>> > +         }
>> > +     }
>> > +
>> > +      if (removed_count)
>> > +     {
>> > +       unsigned int i, j;
>> > +       store_fwd_info *it;
>> > +       VEC_ORDERED_REMOVE_IF (store_exprs, i, j, it, it->remove);
>> > +     }
>>
>> It might be safer (and more general) to structure the code as follows:
>>
>>       if (!NONDEBUG_INSN_P (insn))
>>         continue;
>>
>>       vec_rtx_properties properties;
>>       properties.add_insn (insn, false);
>>
>>       rtx set = single_set (insn);
>>       bool is_simple = (set
>>                         && !properties.has_asm
>>                         && !properties.has_side_effects ());
>>       bool is_simple_store = (is_simple
>>                               && MEM_P (SET_DEST (set))
>>                               && !contains_mem_rtx_p (SET_SRC (set)));
>>       rtx load_mem = nullptr;
>>       bool is_simple_load = (is_simple
>>                              && !contains_mem_rtx_p (SET_DEST (set))
>>                              && (load_mem = get_load_mem (set)));
>>       if (is_simple_store)
>>         ...record store...
>>
>>       bool reads_mem = false;
>>       bool writes_mem = false;
>>       for (auto ref : properties.refs ())
>>         if (ref.is_mem ())
>>           {
>>             reads_mem |= ref.is_read ();
>>             writes_mem |= ref.is_write ();
>>           }
>>         else if (ref.is_write ())
>>           {
>>             // Check for the last store to mention ref.regno (), use
>>             // block_remove to remove it and all preceding stores.
>>           }
>>
>>       if (is_simple_load)
>>         {
>>           // Try store forwarding.  Remove the last store that
>>           // load_mem might depend on, and all previous stores.
>>         }
>>
>>       if ((writes_mem && !is_simple_store)
>>           || (reads_mem && !is_simple_load))
>>         store_exprs.truncate (0);
>>
>> It doesn't need to be exactly like that, but the idea is to test
>> directly for all references to memory and registers, without giving
>> up unnecessarily on (say) const calls or arithmetic double-sets.
>>
> Hmm, I have restructured the code based on your other feedback and it
> indeed looks more like this (except for not using vec_rtx_properties).
> I'll evaluate if it makes it better to further make it more like that
> (or use vec_rtx_properties); I'm a bit nervous to make a very large
> change there, but if it improves things I'll do it.

Thanks.  I'd appreciate if you could use vec_rtx_properties, since it
is designed for cases like this.  IMO:

   if (CALL_P (insn) || !set || volatile_refs_p (set)
       || GET_MODE (SET_DEST (set)) == BLKmode
       || (MEM_P (SET_DEST (set)) && MEM_P (SET_SRC (set))))
     {
       store_exprs.truncate (0);
       continue;
     }

and the later continue are too conservative.  In particular,
we shouldn't treat all non-single_sets as optimisation barriers,
since many of them are fine.

Richard

Reply via email to