On Tue, Apr 24, 2012 at 7:43 PM, Aldy Hernandez <[email protected]> wrote:
> On 04/13/12 03:46, Richard Guenther wrote:
>>
>> On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandez<[email protected]> wrote:
>
>
> Richard. Thanks so much for reviewing and providing an alternative
> approach, which AFAICT provides superior results.
>
>
>> A similar effect could be obtained by keeping a flag whether we entered
>> the
>> path that stored the value before the transform. Thus, do
>>
>> lsm = g2; // technically not needed, if you also handle loads
>> flag = false;
>> for (...)
>> {
>> if (g1)
>> {
>> if (flag)
>> g2 = lsm;
>> }
>> else
>> {
>> lsm = 0;
>> flag = true;
>> }
>> }
>> if (flag)
>> g2 = lsm;
>
>
> I have implemented this in the attached patch, and it seems to be generating
> better code than my other if-changed approach. It also avoids generating
> unnecessary loads that may trap.
>
> For the original testcase:
>
>
> int g_1 = 1;
> int g_2 = 0;
>
> int func_1(void)
> {
> int l;
> for (l = 0; l < 1234; l++)
> {
> if (g_1)
> return l;
> else
> g_2 = 0;
> }
> return 999;
> }
>
> After all optimizations are done, we are now generating the following for
> the flags approach:
>
> new:
> func_1:
> movl g_1(%rip), %edx
> xorl %eax, %eax
> testl %edx, %edx
> je .L5
> rep
> ret
> .L5:
> movl $0, g_2(%rip)
> movw $999, %ax
> ret
>
> Which is significantly better than the unmodified trunk of:
>
> func_1:
> movl g_1(%rip), %edx
> movl g_2(%rip), %eax
> testl %edx, %edx
> je .L5
> movl %eax, g_2(%rip)
> xorl %eax, %eax
> ret
> .L5:
> movl $0, g_2(%rip)
> movl $999, %eax
> ret
>
> And in other less contrived cases, we generate correct code that avoids
> writes on code paths that did not have it. For example, in Hans register
> promotion example:
>
> int count;
>
> struct obj {
> int data;
> struct obj *next;
> } *q;
>
> void func()
> {
> struct obj *p;
> for (p = q; p; p = p->next)
> if (p->data > 0)
> count++;
> }
>
> Under the new memory model we should avoid promoting "count" to a register
> and unilaterally writing to it upon exiting the loop.
>
> With the attached patch, we now generate:
>
> func:
> .LFB0:
> movq q(%rip), %rax
> xorl %ecx, %ecx
> movl count(%rip), %edx
> testq %rax, %rax
> je .L1
> .L9:
> movl (%rax), %esi
> testl %esi, %esi
> jle .L3
> addl $1, %edx
> movl $1, %ecx
> .L3:
> movq 8(%rax), %rax
> testq %rax, %rax
> jne .L9
> testb %cl, %cl
> jne .L12
> .L1:
> rep
> ret
> .L12:
> movl %edx, count(%rip)
> ret
>
> Which is as expected.
>
> I don't understand your suggestion of:
>
>
>> lsm = g2; // technically not needed, if you also handle loads
>>
>> that would allow to extend this to cover the cases where the access may
>>
>> eventually trap (if you omit the load before the loop).
>
>
> Can't I just remove the lsm=g2? What's this about also handling loads?
>
>
>>
>> Not sure which is more efficient (you can eventually combine flag
>> variables
>> for different store locations, combining the if-changed compare is not so
>> easy
>> I guess).
>
>
> So far, I see better code generated with the flags approach, so...
>
>
>>> 1. No pass can figure out the equality (or inequality) of g_2_lsm and
>>> g_2.
>>> In the PR, Richard mentions that both FRE/PRE can figure this out, but
>>> they
>>> are not run after store motion. DOM should also be able to, but
>>> apparently
>>> does not :(. Tips?
>>
>>
>> Well. Schedule FRE after loop opts ...
>>
>> DOM is not prepared to handle loads/stores with differing VOPs - it was
>> never
>> updated really after the alias-improvements merge.
>>
>>> 2. The GIMPLE_CONDs I create in this patch are currently causing problems
>>> with complex floats/doubles. do_jump is complaining that it can't
>>> compare
>>> them, so I assume it is too late (in tree-ssa-loop-im.c) to compare
>>> complex
>>> values since complex lowering has already happened (??). Is there a more
>>> canonical way of creating a GIMPLE_COND that may contain complex floats
>>> at
>>> this stage?
>>
>>
>> As you are handling redundant stores you want a bitwise comparison anyway,
>> but yes, complex compares need to be lowered. But as said, you want a
>> bitwise comparison, not a value-comparison. You can try using
>
>
> I can ignore all this. I have implemented both alternatives, with a target
> hook as suggested, but I'm thinking of removing it altogether and just
> leaving the flags approach.
>
>> Few comments on the patch itself.
>>
>> + new_bb = split_edge (ex);
>> + then_bb = create_empty_bb (new_bb);
>> + if (current_loops&& new_bb->loop_father)
>>
>> + add_bb_to_loop (then_bb, new_bb->loop_father);
>>
>> + gsi = gsi_start_bb (new_bb);
>> + t1 = make_rename_temp (TREE_TYPE (ref->mem), NULL);
>>
>> Hmm, why do you need to re-load the value? Don't you have both
>> the value as it was loaded before the loop and the new value (the lsm tmp
>> reg)
>> already? Why not simply remember the SSA name used? Ah, because
>> store motion also uses make_rename_temp. If you don't want to change
>> that (which would be good but inconvenient because of the required PHI
>> insertions), I'd change the code that performs the load in the pre-header
>> to do
>>
>> SSA_NAME = ref->mem;
>> rename-lsm-tmp = SSA_NAME;
>>
>> so you can remember SSA_NAME and re-use it here. That probably solves
>> your optimization issue, too.
>
>
> Fixed.
>
>
>> + for (gsi = gsi_start_phis (old_dest); !gsi_end_p (gsi);
>> gsi_next (&gsi))
>> + {
>> + gimple phi = gsi_stmt (gsi);
>> + unsigned i;
>> +
>> + for (i = 0; i< gimple_phi_num_args (phi); i++)
>> + if (gimple_phi_arg_edge (phi, i)->src == new_bb)
>> + {
>> + tree arg = gimple_phi_arg_def (phi, i);
>> + add_phi_arg (phi, arg, then_old_edge,
>> UNKNOWN_LOCATION);
>> + update_stmt (phi);
>> + }
>> + }
>>
>> Hmm. This is of course doing unnecessary work in case there are multiple
>> moved stores. In fact splitting the exit edge is only necessary if there
>> are
>> more than one predecessor. Otherwise you can simply split the exit block
>> after inserting the new conditional after its labels.
>
>
> Is this still the case with the current patch? If so, I'm a bit confused as
> to what you want here.
>
>
>>
>> + if (opts->x_flag_tm)
>> + {
>> + if (opts->x_flag_non_call_exceptions)
>> + sorry ("transactional memory is not supported with non-call
>> exceptions");
>> +
>> + set_param_value ("allow-load-data-races", 0,
>> + opts->x_param_values, opts_set->x_param_values);
>> + set_param_value ("allow-store-data-races", 0,
>> + opts->x_param_values, opts_set->x_param_values);
>>
>> Unless the user explicitely set those params? Which means using params
>> wasn't a very good idea ... ideally you'd diagnose an error when the user
>> would mix -fgnu-tm and -fallow-store-data-races. So - can we remove
>> allow-load-data-races (we are never going to implement that) and make
>> allow-store-data-races a proper -f flag please?
>
>
> Sorry, that was not meant for public consumption. I have rewritten this to
> either take the param value as is, and in the case of flag_tm, only restrict
> the optimization if the loop is inside an actual transaction (see the
> changes to trans-mem.c, cause I know you'll hate bb_in_transaction() :-)).
>
>
>>
>> + }
>>
>> And that should be a separate patch
>
>
> I can certainly reimplement --param=allow-{load,store}-data-races as proper
> -f flags. I don't care either way. But I remember Ian Taylor having a
> different opinion. If y'all agree, I can implement whatever.
>
>
>> It would be nice if you could think about LIM/SM for possibly trapping
>> loads/stores
>> while you are doing this work.
>
>
> We are no longer adding additional trapping loads (as with the if-changed
> approach). Is this something else you'd like me to concentrate on?
>
> Please take a look at the attached patch. I tested the flags alternative on
> a full bootstrap, and there's only one regression which I will look at, but
> I'd like to know if the current WIP is aligned with what you had in mind.
+ /* ?? Perhaps we should cache this somewhere in the BB, or are
+ multiple levels of empty BB's common. ?? */
+ FOR_EACH_EDGE (e, ei, bb->preds)
+ {
+ int res = bb_in_transaction_1 (e->src);
+ if (res != -1)
+ return (bool) res;
+ }
+ return false;
that's weird code - if predecessors are not all in or not in a transaction
you return a random value. So it seems this is somewhat fragile.
If bb status can be derived from looking at a single stmt why is the
transaction-ness not recorded as a basic-block flag then? Thus,
why do we have a bit in gimple stmts?
Why can nobody merge a transaction and a non-transaction basic-block
making your predicate above wrong because only the 2nd stmt is
in a transaction?
+ bool single_exit_p = single_pred_p (ex->dest);
that's a strange variable name ;)
+/* Helper function for execute_sm. On every path that sets REF, set
+ an appropriate flag indicating the store. */
+
+static tree
+execute_sm_if_changed_flag_set (struct loop *loop, mem_ref_p ref)
+{
...
+ FOR_EACH_VEC_ELT (mem_ref_loc_p, locs, i, loc)
+ {
+ gimple_stmt_iterator gsi;
+ gimple stmt;
+
+ gsi = gsi_start_bb (gimple_bb (loc->stmt));
+ for (; !gsi_end_p (gsi); gsi_next (&gsi))
+ if (gsi_stmt (gsi) == loc->stmt)
does not seem to do it on every path but on every REF setter. And instead
of the innermost loop just use gsi_for_stmt (loc->stmt).
+ /* Emit the load code into the latch, so that we are sure it will
+ be processed after all dependencies. */
+ latch_edge = loop_latch_edge (loop);
+ load = gimple_build_assign (mem_ssa, unshare_expr (ref->mem));
lim_data = init_lim_data (load);
lim_data->max_loop = loop;
lim_data->tgt_loop = loop;
+ gsi_insert_on_edge (latch_edge, load);
+ load = gimple_build_assign (tmp_var, mem_ssa);
+ lim_data = init_lim_data (load);
+ lim_data->max_loop = loop;
+ lim_data->tgt_loop = loop;
+ gsi_insert_on_edge (latch_edge, load);
the 2nd assign is no longer a "load", so I suppose you don't need any lim_data
for it?
+ else if (store == STORE_IF_CHANGED_FLAG)
+ execute_sm_if_changed (ex, ref->mem, mem_ssa, tmp_var,
+ store_flag);
and this can pass NULL instead of mem_ssa?
Overall this looks reasonable. With also handling trapping things I meant
that we should be able to apply store-motion to
int a[256];
void foo (int *p)
{
int i;
for (i= 0;i<256;i++)
if (flag)
*p = a[i];
}
but we cannot, because the transform
lsm = *p;
for (i = 0; i < 256; ++i)
if (flag)
lsm = a[i];
*p = lsm;
even when the store is properly conditional the load is not (and you do not
change that). The unconditional load may trap here. What I meant to say
is that we do not need the load at all unless there is also a load involved
inside the loop - if we use the flag-based approach. If there is also a load
inside the loop we'd need to perform the load there (or not handle this
case)
But overall this looks nice!
Thanks,
Richard.
> Thanks again.
> Aldy