Richard Biener <[email protected]> writes:
> On Tue, 12 Nov 2024, Richard Sandiford wrote:
>
>> Sorry for the slow review. I think Jeff's much better placed to comment
>> on this than I am, but here's a stab. Mostly it looks really good to me
>> FWIW.
>>
>> Andrew Carlotti <[email protected]> writes:
>> > This pass is used to optimise assignments to the FPMR register in
>> > aarch64. I chose to implement this as a middle-end pass because it
>> > mostly reuses the existing RTL PRE code within gcse.cc.
>> >
>> > Compared to RTL PRE, the key difference in this new pass is that we
>> > insert new writes directly to the destination hardreg, instead of
>> > writing to a new pseudo-register and copying the result later. This
>> > requires changes to the analysis portion of the pass, because sets
>> > cannot be moved before existing instructions that set, use or clobber
>> > the hardreg, and the value becomes unavailable after any uses of
>> > clobbers of the hardreg.
>> >
>> > This patch would currently break any debug instructions that use the
>> > value of fpmr in a region of code where that value is changed by this
>> > pass. I haven't worked out the best way to fix this, but I suspect the
>> > issue is uncommon and tricky enough that it would be best to just drop
>> > those debug instructions.
>>
>> Yeah, good question, and pass on that :) Will need to think more about it.
>>
>> > I've bootstrapped and regression tested this on aarch64, and it should be
>> > NFC
>> > on other targets. Aside from this, my testing so far has involved hacking
>> > in a
>> > single FP8 intrinsic and testing various parameters and control flow
>> > structures, and checking both the codegen and the LCM bitmaps. I intend to
>> > write better and more comprehensive tests once there are some real
>> > intrinsic
>> > implementations available to use.
>> >
>> >
>> > Is this approach good? Apart from fixing the debug instructions and
>> > adding tests, is there anything else I need to change?
>> >
>> >
>> > gcc/ChangeLog:
>> >
>> > * config/aarch64/aarch64.h (HARDREG_PRE_REGNOS): New macro.
>> > * gcse.cc (doing_hardreg_pre_p): New global variable.
>> > (current_hardreg_regno): Ditto.
>> > (compute_local_properties): Unset transp for hardreg clobbers.
>> > (prune_hardreg_uses): New.
>> > (want_to_gcse_p): Always return true for hardreg PRE.
>> > (hash_scan_set): Add checks for hardreg uses/clobbers.
>> > (oprs_unchanged_p): Disable load motion for hardreg PRE pass.
>> > (record_last_mem_set_info): Ditto.
>> > (compute_hash_table_work): Record hardreg uses.
>> > (prune_expressions): Mark hardreg sets as call-clobbered.
>> > (compute_pre_data): Add call to prune_hardreg_uses.
>> > (pre_expr_reaches_here_p_work): Add comment.
>> > (insert_insn_start_basic_block): New functions.
>> > (pre_edge_insert): Don't add hardreg sets to predecessor block.
>> > (pre_delete): Use hardreg for the reaching reg.
>> > (pre_gcse): Don't insert copies for hardreg PRE.
>> > (one_pre_gcse_pass): Disable load motion for hardreg PRE pass.
>> > (execute_hardreg_pre): New.
>> > (class pass_hardreg_pre): New.
>> > (pass_hardreg_pre::gate): New.
>> > (make_pass_hardreg_pre): New.
>> > * passes.def (pass_hardreg_pre): New pass.
>> > * tree-pass.h (make_pass_hardreg_pre): New.
>> >
>> > [...]
>> > @@ -728,6 +752,37 @@ compute_local_properties (sbitmap *transp, sbitmap
>> > *comp, sbitmap *antloc,
>> > }
>> > }
>> > }
>> > +
>> > +/* A hardreg set is not transparent in a block if there are any uses of
>> > that
>> > + hardreg. This filters the results of compute_local_properties, after
>> > the
>> > + result of that function has been used to define the kills bitmap.
>>
>> I think this is mostly my ignorance of the code, and would be obvious
>> if I tried it out locally, but: why do we need to do this after
>> computing the kills bitmap? For mode-switching, the kills bitmap
>> is the inverse of the transparency bitmap, but it sounds like here
>> you want the kills bitmap to be more selective.
>>
>> > +
>> > + TRANSP is the destination sbitmap to be updated.
>> > +
>> > + TABLE controls which hash table to look at. */
>> > +
>> > +static void
>> > +prune_hardreg_uses (sbitmap *transp, struct gcse_hash_table_d *table)
>> > +{
>> > + unsigned int i;
>> > + gcc_assert (doing_hardreg_pre_p);
>> > +
>> > + for (i = 0; i < table->size; i++)
>> > + {
>> > + struct gcse_expr *expr;
>> > +
>> > + for (expr = table->table[i]; expr != NULL; expr =
>> > expr->next_same_hash)
>> > + {
>> > + int indx = expr->bitmap_index;
>> > + df_ref def;
>> > +
>> > + for (def = DF_REG_USE_CHAIN (current_hardreg_regno);
>> > + def;
>> > + def = DF_REF_NEXT_REG (def))
>> > + bitmap_clear_bit (transp[DF_REF_BB (def)->index], indx);
>> > + }
>> > + }
>> > +}
>> >
>> > /* Hash table support. */
>>
>>
>> > @@ -747,6 +804,9 @@ static basic_block current_bb;
>> > static bool
>> > want_to_gcse_p (rtx x, machine_mode mode, HOST_WIDE_INT *max_distance_ptr)
>> > {
>> > + if (doing_hardreg_pre_p)
>> > + return true;
>> > +
>> > #ifdef STACK_REGS
>> > /* On register stack architectures, don't GCSE constants from the
>> > constant pool, as the benefits are often swamped by the overhead
>> > @@ -911,7 +971,7 @@ oprs_unchanged_p (const_rtx x, const rtx_insn *insn,
>> > bool avail_p)
>> > }
>> >
>> > case MEM:
>> > - if (! flag_gcse_lm
>> > + if (! flag_gcse_lm || doing_hardreg_pre_p
>>
>> This test occurs often enough that I think it's worth splitting out.
>> Something like: !do_load_motion ()?
>>
>> > || load_killed_in_block_p (current_bb, DF_INSN_LUID (insn),
>> > x, avail_p))
>> > return false;
>> > [...]
>> > @@ -1544,6 +1642,19 @@ compute_hash_table_work (struct gcse_hash_table_d
>> > *table)
>> > }
>> >
>> > note_stores (insn, record_last_set_info, insn);
>> > +
>> > + if (doing_hardreg_pre_p && hardreg_last_bb != current_bb)
>> > + {
>> > + /* We need to record the first use of a hardreg to determine if a
>> > + set of that hardreg is anticipatable. */
>> > + df_ref ref;
>> > + FOR_EACH_INSN_USE (ref, insn)
>> > + if (DF_REF_REGNO (ref) == current_hardreg_regno)
>> > + {
>> > + hardreg_last_bb = current_bb;
>> > + hardreg_first_use = DF_INSN_LUID (insn);
>> > + }
>> > + }
>> > }
>>
>> Couldn't we instead check whether the register is live on entry to the block?
>> That would avoid the extra bit of state.
>>
>> >
>> > /* The next pass builds the hash table. */
>> > @@ -1714,6 +1825,19 @@ prune_expressions (bool pre_p)
>> > {
>> > for (expr = expr_hash_table.table[ui]; expr; expr =
>> > expr->next_same_hash)
>> > {
>> > + /* For hardreg pre, we assume that all relevant hardregs are
>> > + call-clobbered, and set all bits in prune_exprs if the reg is call
>> > + clobbered.
>>
>> Not sure I understand this. But...
>>
>> > If the hardreg were merely call-used, then we would
>> > + need to remove the expression from the anticipatable and
>> > + transparent bitmaps only (after using this to compute the kills
>> > + bitmap). */
>> > +
>> > + if (doing_hardreg_pre_p)
>> > + {
>> > + bitmap_set_bit (prune_exprs, expr->bitmap_index);
>> > + continue;
>> > + }
>> > +
>>
>> ...the effect seems to be to set every bit of prune_exprs, in which
>> case it might be easier to skip this loop entirely and adjust the later
>> one to use bitmap_set_range.
>>
>> > /* Note potentially trapping expressions. */
>> > if (may_trap_p (expr->expr))
>> > {
>> > [...]
>> > @@ -4028,6 +4228,31 @@ execute_rtl_pre (void)
>> > return 0;
>> > }
>> >
>> > +static unsigned int
>> > +execute_hardreg_pre (void)
>> > +{
>> > + doing_hardreg_pre_p = true;
>> > + unsigned int regnos[] = HARDREG_PRE_REGNOS;
>> > + /* It's possible to avoid this loop, but it isn't worth doing so until
>> > + hardreg PRE is used for multiple hardregs. */
>>
>> Yeah, sounds ok to me. But out of curiosity, how difficult would it be
>> to structure the code so that this just works? Where are the main
>> difficulties? Having to maintain a list of which expressions are
>> associated with which register, and therefore which expressions
>> mutually kill each other?
>>
>> > + for (int i = 0; regnos[i] != 0; i++)
>> > + {
>> > + int changed;
>> > + current_hardreg_regno = regnos[i];
>> > + if (dump_file)
>> > + fprintf(dump_file, "Entering hardreg PRE for regno %d\n",
>> > + current_hardreg_regno);
>> > + delete_unreachable_blocks ();
>> > + df_analyze ();
>> > + changed = one_pre_gcse_pass ();
>> > + flag_rerun_cse_after_global_opts |= changed;
>>
>> Is this appropriate for the new pass? We're not really exposing general
>> CSE opportunities.
>>
>> > + if (changed)
>> > + cleanup_cfg (0);
>> > + }
>> > + doing_hardreg_pre_p = false;
>> > + return 0;
>> > +}
>> > +
>> > static unsigned int
>> > execute_rtl_hoist (void)
>> > {
>> > @@ -4096,6 +4321,56 @@ make_pass_rtl_pre (gcc::context *ctxt)
>> >
>> > namespace {
>> >
>> > +const pass_data pass_data_hardreg_pre =
>> > +{
>> > + RTL_PASS, /* type */
>> > + "hardreg_pre", /* name */
>> > + OPTGROUP_NONE, /* optinfo_flags */
>> > + TV_PRE, /* tv_id */
>> > + PROP_cfglayout, /* properties_required */
>> > + 0, /* properties_provided */
>> > + 0, /* properties_destroyed */
>> > + 0, /* todo_flags_start */
>> > + TODO_df_finish, /* todo_flags_finish */
>> > +};
>> > +
>> > +class pass_hardreg_pre : public rtl_opt_pass
>> > +{
>> > +public:
>> > + pass_hardreg_pre (gcc::context *ctxt)
>> > + : rtl_opt_pass (pass_data_hardreg_pre, ctxt)
>> > + {}
>> > +
>> > + /* opt_pass methods: */
>> > + bool gate (function *) final override;
>> > + unsigned int execute (function *) final override
>> > + {
>> > + return execute_hardreg_pre ();
>> > + }
>> > +
>> > +}; // class pass_rtl_pre
>> > +
>> > +bool
>> > +pass_hardreg_pre::gate (function *fun)
>> > +{
>> > +#ifdef HARDREG_PRE_REGNOS
>> > + return optimize > 0
>> > + && !fun->calls_setjmp;
>>
>> Huh. It looks like these setjmp exclusions go back to 1998. I wouldn't
>> have expected them to be needed now, since the modern cfg framework
>> should represent setjmp correctly. Jeff, do you agree? I'll try
>> removing them and see what breaks...
>
> The problem with setjmp on RTL is that we fail to preserve abnormal
> edges during RTL expansion and fail in the impossible job to recover
> all that are required for correctness (to prevent code motion).
Ah, ok. I'd wrongly thought that that was a solved problem now.
(Kind-of curious why it's impossible to recover the info. I guess that's
a bit of a distraction though. Leaning further into reconstructing
rather than preserving the cfg would be the wrong direction anyway.)
Thanks,
Richard
> With PRE that only operates on hardregs and in particular does not
> introduce alternate register usages or move memory ops there might
> be no issue.
>
> Richard.
>
>> Thanks,
>> Richard
>>
>> > +#else
>> > + return false;
>> > +#endif
>> > +}
>> > +
>> > +} // anon namespace
>> > +
>> > +rtl_opt_pass *
>> > +make_pass_hardreg_pre (gcc::context *ctxt)
>> > +{
>> > + return new pass_hardreg_pre (ctxt);
>> > +}
>> > +
>> > +namespace {
>> > +
>> > const pass_data pass_data_rtl_hoist =
>> > {
>> > RTL_PASS, /* type */
>> > diff --git a/gcc/passes.def b/gcc/passes.def
>> > index
>> > 7d01227eed1fcdda4e2db0b1b9dac80f21e221d9..374b2daf92c427355f93a69c028ddd794fc694c2
>> > 100644
>> > --- a/gcc/passes.def
>> > +++ b/gcc/passes.def
>> > @@ -462,6 +462,7 @@ along with GCC; see the file COPYING3. If not see
>> > NEXT_PASS (pass_rtl_cprop);
>> > NEXT_PASS (pass_rtl_pre);
>> > NEXT_PASS (pass_rtl_hoist);
>> > + NEXT_PASS (pass_hardreg_pre);
>> > NEXT_PASS (pass_rtl_cprop);
>> > NEXT_PASS (pass_rtl_store_motion);
>> > NEXT_PASS (pass_cse_after_global_opts);
>> > diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
>> > index
>> > a928cbe4557368ec483919a06cd3d29d733a7b66..d4cc85888d176ae603bc8c5aec1168749280511f
>> > 100644
>> > --- a/gcc/tree-pass.h
>> > +++ b/gcc/tree-pass.h
>> > @@ -572,6 +572,7 @@ extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context
>> > *ctxt);
>> > extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt);
>> > extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt);
>> > extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt);
>> > +extern rtl_opt_pass *make_pass_hardreg_pre (gcc::context *ctxt);
>> > extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt);
>> > extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context *ctxt);
>> > extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt);
>>