Richard Sandiford <richard.sandif...@arm.com> writes:
> Konstantinos Eleftheriou <konstantinos.elefther...@vrull.eu> writes:
>> This patch uses `lowpart_subreg` for the base register initialization,
>> instead of zero-extending it. We had tried this solution before, but
>> we were leaving undefined bytes in the upper part of the register.
>> This shouldn't be happening as we are supposed to write the whole
>> register when the load is eliminated. This was occurring when having
>> multiple stores with the same offset as the load, generating a
>> register move for all of them, overwriting the bit inserts that
>> were inserted before them.
>>
>> In order to overcome this, we are removing redundant stores from the 
>> sequence,
>> i.e. stores that write to addresses that will be overwritten by stores that
>> come after them in the sequence. We are using the same bitmap that is used
>> for the load elimination check, to keep track of the bytes that are written
>> by each store.
>>
>> Also, we are now allowing the load to be eliminated even when there are
>> overlaps between the stores, as there is no obvious reason why we shouldn't
>> do that, we just want the stores to cover all of the load's bytes.
>>
>> Bootstrapped/regtested on AArch64 and x86_64.
>>
>>         PR rtl-optimization/119884
>>
>> gcc/ChangeLog:
>>
>>         * avoid-store-forwarding.cc (process_store_forwarding):
>>      Use `lowpart_subreg` for the base register initialization,
>>      and remove redundant stores from the store/load sequence.
>>
>> gcc/testsuite/ChangeLog:
>>
>>         * gcc.target/i386/pr119884.c: New test.
>>
>> Signed-off-by: Konstantinos Eleftheriou <konstantinos.elefther...@vrull.eu>
>> ---
>>
>> Changes in v3:
>> - Remove redundant stores, instead of generating a register move for
>> the first store that has the same offset as the load only.
>>
>> Changes in v2:
>> - Use `lowpart_subreg` for the base register initialization, but
>> only for the first store that has the same offset as the load.
>>
>> Changes in v1:
>> - Add a check for the register modes to match before calling `emit_mov_insn`.
>>
>>  gcc/avoid-store-forwarding.cc            | 45 ++++++++++++++++++------
>>  gcc/testsuite/gcc.target/i386/pr119884.c | 13 +++++++
>>  2 files changed, 48 insertions(+), 10 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.target/i386/pr119884.c
>>
>> diff --git a/gcc/avoid-store-forwarding.cc b/gcc/avoid-store-forwarding.cc
>> index 5d960adec359..f88a001e5717 100644
>> --- a/gcc/avoid-store-forwarding.cc
>> +++ b/gcc/avoid-store-forwarding.cc
>> @@ -176,20 +176,28 @@ process_store_forwarding (vec<store_fwd_info> &stores, 
>> rtx_insn *load_insn,
>>    /* Memory sizes should be constants at this stage.  */
>>    HOST_WIDE_INT load_size = MEM_SIZE (load_mem).to_constant ();
>>  
>> -  /* If the stores cover all the bytes of the load without overlap then we 
>> can
>> -     eliminate the load entirely and use the computed value instead.  */
>> +  /* If the stores cover all the bytes of the load, then we can eliminate
>> +     the load entirely and use the computed value instead.
>> +     We can also eliminate stores on addresses that are overwritten
>> +     by later stores.  */
>>  
>>    sbitmap forwarded_bytes = sbitmap_alloc (load_size);
>>    bitmap_clear (forwarded_bytes);
>>  
>>    unsigned int i;
>>    store_fwd_info* it;
>> +  auto_vec<store_fwd_info> redundant_stores;
>> +  auto_vec<int> store_ind_to_remove;
>>    FOR_EACH_VEC_ELT (stores, i, it)
>>      {
>>        HOST_WIDE_INT store_size = MEM_SIZE (it->store_mem).to_constant ();
>> -      if (bitmap_bit_in_range_p (forwarded_bytes, it->offset,
>> +      if (bitmap_is_range_set_p (forwarded_bytes, it->offset,
>>                               it->offset + store_size - 1))
>> -    break;
>> +    {
>> +      redundant_stores.safe_push (*it);
>> +      store_ind_to_remove.safe_push (i);
>> +      continue;
>> +    }
>>        bitmap_set_range (forwarded_bytes, it->offset, store_size);
>>      }
>>  
>> @@ -215,6 +223,11 @@ process_store_forwarding (vec<store_fwd_info> &stores, 
>> rtx_insn *load_insn,
>>      fprintf (dump_file, "(Load elimination candidate)\n");
>>      }
>>  
>> +  /* Remove redundant stores from the vector.  */
>> +  store_ind_to_remove.reverse ();
>> +  for (int i : store_ind_to_remove)
>> +    stores.ordered_remove (i);
>> +
>
> This is quadratic.  That probably doesn't matter in practice though,
> since the dependence checking is already quadratic, and the size is
> already limited by a --param.  But I think it's worth at least a comment.
> Maybe:
>
>   /* Remove redundant stores from the vector.  ??? Although this is
>      quadratic, there doesn't to be seem much point optimizing it.

Gah, sorry for the typo: ...doesn't seem to be much point...

>      The number of redundant stores is expected to be low and the length
>      of the list is limited by a --param.  The dependence checking that
>      we did earlier is also quadratic in the size of this list.  */
>
>> From my POV, the patch is OK for trunk with that change (and with the
> obvious rename after the comments on patch 2).  Please leave a couple
> of days for others to comment though.
>
> Thanks,
> Richard

Reply via email to