Konstantinos Eleftheriou <konstantinos.elefther...@vrull.eu> writes: > This patch uses `lowpart_subreg` for the base register initialization, > instead of zero-extending it. We had tried this solution before, but > we were leaving undefined bytes in the upper part of the register. > This shouldn't be happening as we are supposed to write the whole > register when the load is eliminated. This was occurring when having > multiple stores with the same offset as the load, generating a > register move for all of them, overwriting the bit inserts that > were inserted before them. > > In order to overcome this, we are removing redundant stores from the sequence, > i.e. stores that write to addresses that will be overwritten by stores that > come after them in the sequence. We are using the same bitmap that is used > for the load elimination check, to keep track of the bytes that are written > by each store. > > Also, we are now allowing the load to be eliminated even when there are > overlaps between the stores, as there is no obvious reason why we shouldn't > do that, we just want the stores to cover all of the load's bytes. > > Bootstrapped/regtested on AArch64 and x86_64. > > PR rtl-optimization/119884 > > gcc/ChangeLog: > > * avoid-store-forwarding.cc (process_store_forwarding): > Use `lowpart_subreg` for the base register initialization, > and remove redundant stores from the store/load sequence. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/pr119884.c: New test. > > Signed-off-by: Konstantinos Eleftheriou <konstantinos.elefther...@vrull.eu> > --- > > Changes in v3: > - Remove redundant stores, instead of generating a register move for > the first store that has the same offset as the load only. > > Changes in v2: > - Use `lowpart_subreg` for the base register initialization, but > only for the first store that has the same offset as the load. > > Changes in v1: > - Add a check for the register modes to match before calling `emit_mov_insn`. > > gcc/avoid-store-forwarding.cc | 45 ++++++++++++++++++------ > gcc/testsuite/gcc.target/i386/pr119884.c | 13 +++++++ > 2 files changed, 48 insertions(+), 10 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/pr119884.c > > diff --git a/gcc/avoid-store-forwarding.cc b/gcc/avoid-store-forwarding.cc > index 5d960adec359..f88a001e5717 100644 > --- a/gcc/avoid-store-forwarding.cc > +++ b/gcc/avoid-store-forwarding.cc > @@ -176,20 +176,28 @@ process_store_forwarding (vec<store_fwd_info> &stores, > rtx_insn *load_insn, > /* Memory sizes should be constants at this stage. */ > HOST_WIDE_INT load_size = MEM_SIZE (load_mem).to_constant (); > > - /* If the stores cover all the bytes of the load without overlap then we > can > - eliminate the load entirely and use the computed value instead. */ > + /* If the stores cover all the bytes of the load, then we can eliminate > + the load entirely and use the computed value instead. > + We can also eliminate stores on addresses that are overwritten > + by later stores. */ > > sbitmap forwarded_bytes = sbitmap_alloc (load_size); > bitmap_clear (forwarded_bytes); > > unsigned int i; > store_fwd_info* it; > + auto_vec<store_fwd_info> redundant_stores; > + auto_vec<int> store_ind_to_remove; > FOR_EACH_VEC_ELT (stores, i, it) > { > HOST_WIDE_INT store_size = MEM_SIZE (it->store_mem).to_constant (); > - if (bitmap_bit_in_range_p (forwarded_bytes, it->offset, > + if (bitmap_is_range_set_p (forwarded_bytes, it->offset, > it->offset + store_size - 1)) > - break; > + { > + redundant_stores.safe_push (*it); > + store_ind_to_remove.safe_push (i); > + continue; > + } > bitmap_set_range (forwarded_bytes, it->offset, store_size); > } > > @@ -215,6 +223,11 @@ process_store_forwarding (vec<store_fwd_info> &stores, > rtx_insn *load_insn, > fprintf (dump_file, "(Load elimination candidate)\n"); > } > > + /* Remove redundant stores from the vector. */ > + store_ind_to_remove.reverse (); > + for (int i : store_ind_to_remove) > + stores.ordered_remove (i); > +
This is quadratic. That probably doesn't matter in practice though, since the dependence checking is already quadratic, and the size is already limited by a --param. But I think it's worth at least a comment. Maybe: /* Remove redundant stores from the vector. ??? Although this is quadratic, there doesn't to be seem much point optimizing it. The number of redundant stores is expected to be low and the length of the list is limited by a --param. The dependence checking that we did earlier is also quadratic in the size of this list. */ >From my POV, the patch is OK for trunk with that change (and with the obvious rename after the comments on patch 2). Please leave a couple of days for others to comment though. Thanks, Richard