Re: [PR tree-optimization/52558]: RFC: questions on store data race

Richard Guenther Wed, 25 Apr 2012 04:45:23 -0700

On Tue, Apr 24, 2012 at 7:43 PM, Aldy Hernandez <[email protected]> wrote:
> On 04/13/12 03:46, Richard Guenther wrote:
>>
>> On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandez<[email protected]>  wrote:
>
>
> Richard.  Thanks so much for reviewing and providing an alternative
> approach, which AFAICT provides superior results.
>
>
>> A similar effect could be obtained by keeping a flag whether we entered
>> the
>> path that stored the value before the transform.  Thus, do
>>
>>   lsm = g2;  // technically not needed, if you also handle loads
>>   flag = false;
>>   for (...)
>>     {
>>        if (g1)
>>          {
>>             if (flag)
>>               g2 = lsm;
>>          }
>>        else
>>          {
>>             lsm = 0;
>>             flag = true;
>>          }
>>     }
>>   if (flag)
>>     g2 = lsm;
>
>
> I have implemented this in the attached patch, and it seems to be generating
> better code than my other if-changed approach.  It also avoids generating
> unnecessary loads that may trap.
>
> For the original testcase:
>
>
> int g_1 = 1;
> int g_2 = 0;
>
> int func_1(void)
> {
>  int l;
>  for (l = 0; l < 1234; l++)
>  {
>   if (g_1)
>     return l;
>   else
>     g_2 = 0;
>  }
>  return 999;
> }
>
> After all optimizations are done, we are now generating the following for
> the flags approach:
>
> new:
> func_1:
>        movl    g_1(%rip), %edx
>        xorl    %eax, %eax
>        testl   %edx, %edx
>        je      .L5
>        rep
>        ret
> .L5:
>        movl    $0, g_2(%rip)
>        movw    $999, %ax
>        ret
>
> Which is significantly better than the unmodified trunk of:
>
> func_1:
>        movl    g_1(%rip), %edx
>        movl    g_2(%rip), %eax
>        testl   %edx, %edx
>        je      .L5
>        movl    %eax, g_2(%rip)
>        xorl    %eax, %eax
>        ret
> .L5:
>        movl    $0, g_2(%rip)
>        movl    $999, %eax
>        ret
>
> And in other less contrived cases, we generate correct code that avoids
> writes on code paths that did not have it.  For example, in Hans register
> promotion example:
>
> int count;
>
> struct obj {
>    int data;
>    struct obj *next;
> } *q;
>
> void func()
> {
>  struct obj *p;
>  for (p = q; p; p = p->next)
>        if (p->data > 0)
>                count++;
> }
>
> Under the new memory model we should avoid promoting "count" to a register
> and unilaterally writing to it upon exiting the loop.
>
> With the attached patch, we now generate:
>
> func:
> .LFB0:
>        movq    q(%rip), %rax
>        xorl    %ecx, %ecx
>        movl    count(%rip), %edx
>        testq   %rax, %rax
>        je      .L1
> .L9:
>        movl    (%rax), %esi
>        testl   %esi, %esi
>        jle     .L3
>        addl    $1, %edx
>        movl    $1, %ecx
> .L3:
>        movq    8(%rax), %rax
>        testq   %rax, %rax
>        jne     .L9
>        testb   %cl, %cl
>        jne     .L12
> .L1:
>        rep
>        ret
> .L12:
>        movl    %edx, count(%rip)
>        ret
>
> Which is as expected.
>
> I don't understand your suggestion of:
>
>
>>    lsm = g2;  // technically not needed, if you also handle loads
>>
>> that would allow to extend this to cover the cases where the access may
>>
>> eventually trap (if you omit the load before the loop).
>
>
> Can't I just remove the lsm=g2?  What's this about also handling loads?
>
>
>>
>> Not sure which is more efficient (you can eventually combine flag
>> variables
>> for different store locations, combining the if-changed compare is not so
>> easy
>> I guess).
>
>
> So far, I see better code generated with the flags approach, so...
>
>
>>> 1. No pass can figure out the equality (or inequality) of g_2_lsm and
>>> g_2.
>>>  In the PR, Richard mentions that both FRE/PRE can figure this out, but
>>> they
>>> are not run after store motion.  DOM should also be able to, but
>>> apparently
>>> does not :(.  Tips?
>>
>>
>> Well.  Schedule FRE after loop opts ...
>>
>> DOM is not prepared to handle loads/stores with differing VOPs - it was
>> never
>> updated really after the alias-improvements merge.
>>
>>> 2. The GIMPLE_CONDs I create in this patch are currently causing problems
>>> with complex floats/doubles.  do_jump is complaining that it can't
>>> compare
>>> them, so I assume it is too late (in tree-ssa-loop-im.c) to compare
>>> complex
>>> values since complex lowering has already happened (??). Is there a more
>>> canonical way of creating a GIMPLE_COND that may contain complex floats
>>> at
>>> this stage?
>>
>>
>> As you are handling redundant stores you want a bitwise comparison anyway,
>> but yes, complex compares need to be lowered.  But as said, you want a
>> bitwise comparison, not a value-comparison.  You can try using
>
>
> I can ignore all this.  I have implemented both alternatives, with a target
> hook as suggested, but I'm thinking of removing it altogether and just
> leaving the flags approach.
>
>> Few comments on the patch itself.
>>
>> +         new_bb = split_edge (ex);
>> +         then_bb = create_empty_bb (new_bb);
>> +         if (current_loops&&  new_bb->loop_father)
>>
>> +           add_bb_to_loop (then_bb, new_bb->loop_father);
>>
>> +         gsi = gsi_start_bb (new_bb);
>> +         t1 = make_rename_temp (TREE_TYPE (ref->mem), NULL);
>>
>> Hmm, why do you need to re-load the value?  Don't you have both
>> the value as it was loaded before the loop and the new value (the lsm tmp
>> reg)
>> already?  Why not simply remember the SSA name used?  Ah, because
>> store motion also uses make_rename_temp.  If you don't want to change
>> that (which would be good but inconvenient because of the required PHI
>> insertions), I'd change the code that performs the load in the pre-header
>> to do
>>
>>   SSA_NAME = ref->mem;
>>   rename-lsm-tmp = SSA_NAME;
>>
>> so you can remember SSA_NAME and re-use it here.  That probably solves
>> your optimization issue, too.
>
>
> Fixed.
>
>
>> +         for (gsi = gsi_start_phis (old_dest); !gsi_end_p (gsi);
>> gsi_next (&gsi))
>> +           {
>> +             gimple phi = gsi_stmt (gsi);
>> +             unsigned i;
>> +
>> +             for (i = 0; i<  gimple_phi_num_args (phi); i++)
>> +               if (gimple_phi_arg_edge (phi, i)->src == new_bb)
>> +                 {
>> +                   tree arg = gimple_phi_arg_def (phi, i);
>> +                   add_phi_arg (phi, arg, then_old_edge,
>> UNKNOWN_LOCATION);
>> +                   update_stmt (phi);
>> +                 }
>> +           }
>>
>> Hmm.  This is of course doing unnecessary work in case there are multiple
>> moved stores.  In fact splitting the exit edge is only necessary if there
>> are
>> more than one predecessor.  Otherwise you can simply split the exit block
>> after inserting the new conditional after its labels.
>
>
> Is this still the case with the current patch?  If so, I'm a bit confused as
> to what you want here.
>
>
>>
>> +  if (opts->x_flag_tm)
>> +    {
>> +      if (opts->x_flag_non_call_exceptions)
>> +       sorry ("transactional memory is not supported with non-call
>> exceptions");
>> +
>> +      set_param_value ("allow-load-data-races", 0,
>> +                      opts->x_param_values, opts_set->x_param_values);
>> +      set_param_value ("allow-store-data-races", 0,
>> +                      opts->x_param_values, opts_set->x_param_values);
>>
>> Unless the user explicitely set those params?  Which means using params
>> wasn't a very good idea ... ideally you'd diagnose an error when the user
>> would mix -fgnu-tm and -fallow-store-data-races.  So - can we remove
>> allow-load-data-races (we are never going to implement that) and make
>> allow-store-data-races a proper -f flag please?
>
>
> Sorry, that was not meant for public consumption.  I have rewritten this to
> either take the param value as is, and in the case of flag_tm, only restrict
> the optimization if the loop is inside an actual transaction (see the
> changes to trans-mem.c, cause I know you'll hate bb_in_transaction() :-)).
>
>
>>
>> +    }
>>
>> And that should be a separate patch
>
>
> I can certainly reimplement --param=allow-{load,store}-data-races as proper
> -f flags.  I don't care either way.  But I remember Ian Taylor having a
> different opinion.  If y'all agree, I can implement whatever.
>
>
>> It would be nice if you could think about LIM/SM for possibly trapping
>> loads/stores
>> while you are doing this work.
>
>
> We are no longer adding additional trapping loads (as with the if-changed
> approach).  Is this something else you'd like me to concentrate on?
>
> Please take a look at the attached patch.  I tested the flags alternative on
> a full bootstrap, and there's only one regression which I will look at, but
> I'd like to know if the current WIP is aligned with what you had in mind.


+  /* ?? Perhaps we should cache this somewhere in the BB, or are
+     multiple levels of empty BB's common. ??  */
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      int res = bb_in_transaction_1 (e->src);
+      if (res != -1)
+       return (bool) res;
+    }
+  return false;

that's weird code - if predecessors are not all in or not in a transaction
you return a random value.  So it seems this is somewhat fragile.

If bb status can be derived from looking at a single stmt why is the
transaction-ness not recorded as a basic-block flag then?  Thus,
why do we have a bit in gimple stmts?

Why can nobody merge a transaction and a non-transaction basic-block
making your predicate above wrong because only the 2nd stmt is
in a transaction?

+  bool single_exit_p = single_pred_p (ex->dest);

that's a strange variable name ;)

+/* Helper function for execute_sm.  On every path that sets REF, set
+   an appropriate flag indicating the store.  */
+
+static tree
+execute_sm_if_changed_flag_set (struct loop *loop, mem_ref_p ref)
+{
...
+  FOR_EACH_VEC_ELT (mem_ref_loc_p, locs, i, loc)
+    {
+      gimple_stmt_iterator gsi;
+      gimple stmt;
+
+      gsi = gsi_start_bb (gimple_bb (loc->stmt));
+      for (; !gsi_end_p (gsi); gsi_next (&gsi))
+       if (gsi_stmt (gsi) == loc->stmt)

does not seem to do it on every path but on every REF setter.  And instead
of the innermost loop just use gsi_for_stmt (loc->stmt).

+  /* Emit the load code into the latch, so that we are sure it will
+     be processed after all dependencies.  */
+  latch_edge = loop_latch_edge (loop);
+  load = gimple_build_assign (mem_ssa, unshare_expr (ref->mem));
   lim_data = init_lim_data (load);
   lim_data->max_loop = loop;
   lim_data->tgt_loop = loop;
+  gsi_insert_on_edge (latch_edge, load);
+  load = gimple_build_assign (tmp_var, mem_ssa);
+  lim_data = init_lim_data (load);
+  lim_data->max_loop = loop;
+  lim_data->tgt_loop = loop;
+  gsi_insert_on_edge (latch_edge, load);

the 2nd assign is no longer a "load", so I suppose you don't need any lim_data
for it?

+      else if (store == STORE_IF_CHANGED_FLAG)
+       execute_sm_if_changed (ex, ref->mem, mem_ssa, tmp_var,
+                              store_flag);

and this can pass NULL instead of mem_ssa?

Overall this looks reasonable.  With also handling trapping things I meant
that we should be able to apply store-motion to

int a[256];
void foo (int *p)
{
  int i;
  for (i= 0;i<256;i++)
    if (flag)
      *p = a[i];
}

but we cannot, because the transform

  lsm = *p;
  for (i = 0; i < 256; ++i)
    if (flag)
      lsm = a[i];
  *p = lsm;

even when the store is properly conditional the load is not (and you do not
change that).  The unconditional load may trap here.  What I meant to say
is that we do not need the load at all unless there is also a load involved
inside the loop - if we use the flag-based approach.  If there is also a load
inside the loop we'd need to perform the load there (or not handle this
case)

But overall this looks nice!

Thanks,
Richard.

> Thanks again.
> Aldy

Re: [PR tree-optimization/52558]: RFC: questions on store data race

Reply via email to