Re: [i386] Scalar DImode instructions on XMM registers

2015-05-21 Thread Ilya Enkovich
On 20 May 23:27, Vladimir Makarov wrote:
> 
> 
> On 20/05/15 04:17 AM, Ilya Enkovich wrote:
> >On 19 May 11:22, Vladimir Makarov wrote:
> >>On 05/18/2015 08:13 AM, Ilya Enkovich wrote:
> >>>2015-05-06 17:18 GMT+03:00 Ilya Enkovich :
> >>>Hi Vladimir,
> >>>
> >>>Could you please comment on this?
> >>>
> >>>
> >>Ilya, I think that the idea is worth to try but results might be
> >>mixed.  It is hard to say until you actually try it (as example, Jan
> >>implemented -fpmath=both and it looks a pretty good idea at least
> >>for me but when I checked SPEC2000 the results were not so good even
> >>with IRA/LRA).
> >>
> >>Long ago I did some experiments and found that spilling into SSE
> >>would benefitial for Intel CPUs but not for AMD ones.  As I remember
> >>I also found that storing several scalar values into one SSE reg and
> >>extracting it when you need to do some (fp) arithmetics would
> >>benefitial for AMD but not for Intel CPUs.   In literature more
> >>general approach is called bitwise register allocator.  Actually it
> >>would be a pretty big IRA/LRA project from which some targets might
> >>benefit.
> >I suspect such things are not trivially done in IRA/LRA and want to make it 
> >as an independent optimization because its application seems to be quite 
> >narrow.
> Yes, that is true.  The complications and implementation complexity
> will be probably very high in this project and the positive results
> are not sure.  So the project might have a small value.
> >>
> >>As for the wrong code, it is hard for me to say anything w/o RA
> >>dumps.  If you send me the dump (-fira-verbose=16), i might say more
> >>what is going on.
> >>
> >>
> >Here are some dumps from my reproducer.  The problematic register is r108.
> >
> Thanks.  For me it looks like an inheritance bug.  It is really hard
> to fix the bug w/o the source code.  Could you send me your patch in
> order I can debug RA with it to investigate more.
> 

Sure! Here is a patch and a testcase.  I applied patch to r222125.  Cmd to 
reproduce:

gcc -m32 -msse4.2 -O2 pr65105.c -S -march=slm -fPIE

Thanks,
Ilya
void
counter (long long l);

void
test (long long *arr)
{
  register unsigned long long tmp;

  tmp = arr[0] | arr[1] & arr[2];
  while (tmp)
{
  counter (tmp);
  tmp = *(arr++) & tmp;
}
}
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a607ef4..a9dbfea 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2554,6 +2554,789 @@ rest_of_handle_insert_vzeroupper (void)
   return 0;
 }
 
+static bool
+has_non_address_hard_reg (rtx_insn *insn)
+{
+  df_ref ref;
+  FOR_EACH_INSN_DEF (ref, insn)
+if (HARD_REGISTER_P (DF_REF_REAL_REG (ref))
+   && !DF_REF_FLAGS_IS_SET (ref, DF_REF_MUST_CLOBBER))
+  return true;
+
+  FOR_EACH_INSN_USE (ref, insn)
+if (!DF_REF_REG_MEM_P(ref) && HARD_REGISTER_P (DF_REF_REAL_REG (ref)))
+  return true;
+
+  return false;
+}
+
+static bool
+scalar_to_vector_candidate_p (rtx_insn *insn)
+{
+  rtx def_set = single_set (insn);
+
+  if (!def_set)
+return false;
+
+  if (has_non_address_hard_reg (insn))
+return false;
+
+  rtx src = SET_SRC (def_set);
+  rtx dst = SET_DEST (def_set);
+
+  /* We are interested in DImode -> V1DI promotion
+ only.  */
+  if (GET_MODE (src) != DImode
+  || GET_MODE (dst) != DImode)
+return false;
+
+  if (!REG_P (dst) && !MEM_P (dst))
+return false;
+
+  switch (GET_CODE (src))
+{
+case PLUS:
+case MINUS:
+case IOR:
+case XOR:
+case AND:
+  break;
+
+default:
+  return false;
+}
+
+  if (!REG_P (XEXP (src, 0)) && !MEM_P (XEXP (src, 0)))
+  return false;
+
+  if (!REG_P (XEXP (src, 1)) && !MEM_P (XEXP (src, 1)))
+  return false;
+
+  if (GET_MODE (XEXP (src, 0)) != DImode
+  || GET_MODE (XEXP (src, 1)) != DImode)
+return false;
+
+  return true;
+}
+
+/* Remove regs having both convertible and
+   not convertible definitions.  */
+static void
+remove_non_convertible_regs (bitmap insns)
+{
+  bitmap_iterator bi;
+  unsigned id;
+  bitmap regs = BITMAP_ALLOC (NULL);
+
+  EXECUTE_IF_SET_IN_BITMAP (insns, 0, id, bi)
+{
+  rtx def_set = single_set (DF_INSN_UID_GET (id)->insn);
+  rtx reg = SET_DEST (def_set);
+
+  if (!REG_P (reg) || bitmap_bit_p (regs, REGNO (reg)))
+   continue;
+
+  for (df_ref def = DF_REG_DEF_CHAIN (REGNO (reg));
+  def;
+  def = DF_REF_NEXT_REG (def))
+   {
+ if (!bitmap_bit_p (insns, DF_REF_INSN_UID (def)))
+   {
+ if (dump_file)
+   fprintf (dump_file,
+"r%d has non convertible definition in insn %d\n",
+REGNO (reg), DF_REF_INSN_UID (def));
+
+ bitmap_set_bit (regs, REGNO (reg));
+ break;
+   }
+   }
+}
+
+  EXECUTE_IF_SET_IN_BITMAP (regs, 0, id, bi)
+{
+  for (df_ref def = DF_REG_DEF_CHAIN (id);
+  def;
+  def = DF_REF_NEXT_REG (def))
+   

[RFC] Combine related fail of gcc.target/powerpc/ti_math1.c

2015-05-21 Thread Alan Modra
FAIL: gcc.target/powerpc/ti_math1.c scan-assembler-times adde 1
is seen on powerpc64le-linux since somewhere between revision 218587
and 218616.  See
https://gcc.gnu.org/ml/gcc-testresults/2014-12/msg01287.html and
https://gcc.gnu.org/ml/gcc-testresults/2014-12/msg01325.html

A regression hunt fingers one of Segher's 2014-12-10 patches to the
rs6000 backend, git commit 0f1bedb4 or svn rev 218595.  Segher might
argue that generated code is better after this commit, and I'd agree
that his change is a good one in general, but even so it would be nice
to generate the ideal code.  Curiously, the ideal code is generated at
-O1, but we regress at -O2..

before  after   ideal (-O1)
add_128:add_128:add_128:
ld 10,0(3)  ld 9,0(3)   ld 9,0(3)
ld 11,8(3)  ld 10,8(3)  ld 10,8(3)
addc 8,4,10 addc 3,4,9  addc 3,4,9
adde 9,5,11 addze 5,5   adde 4,5,10
mr 3,8  add 4,5,10  blr
mr 4,9  blr
blr

I went looking into where the addze appeared, and found combine.

Trying 18, 9 -> 24:
Failed to match this instruction:
(set (reg:DI 4 4 [+8 ])
(plus:DI (plus:DI (reg:DI 5 5 [ val+8 ])
(reg:DI 76 ca))
(reg:DI 169 [+8 ])))
Successfully matched this instruction:
(set (reg:DI 167 [ D.2366+8 ])
(plus:DI (reg:DI 5 5 [ val+8 ])
(reg:DI 76 ca)))
Successfully matched this instruction:
(set (reg:DI 4 4 [+8 ])
(plus:DI (reg:DI 167 [ D.2366+8 ])
(reg:DI 169 [+8 ])))
allowing combination of insns 18, 9 and 24
original costs 4 + 8 + 4 = 16
replacement costs 4 + 4 = 8

Here are the three insns involved, sans source line numbers and notes.

(insn 18 17 4 2 (set (reg:DI 165 [ val+8 ])
(reg:DI 5 5 [ val+8 ])) {*movdi_internal64})
...
(insn 9 8 23 2 (parallel [
(set (reg:DI 167 [ D.2366+8 ])
(plus:DI (plus:DI (reg:DI 165 [ val+8 ])
(reg:DI 169 [+8 ]))
(reg:DI 76 ca)))
(clobber (reg:DI 76 ca))
]) {*adddi3_carry_in_internal})
...
(insn 24 23 15 2 (set (reg:DI 4 4 [+8 ])
(reg:DI 167 [ D.2366+8 ])) {*movdi_internal64})

So, a move copying an argument register to a pseudo, one insn from the
body of the function, and a move copying a pseudo to a result
register.  The thought I had was: It is really combine's business to
look at copies from/to ABI mandated hard registers?  Isn't removing
the copies something that register allocation can do better?  If so,
then combine is doing unnecessary work.

As a quick hack, I tried the following.

Index: gcc/combine.c
===
--- gcc/combine.c   (revision 223431)
+++ gcc/combine.c   (working copy)
@@ -1281,6 +1281,16 @@ combine_instructions (rtx_insn *f, unsigned int nr
  if (!NONDEBUG_INSN_P (insn))
continue;
 
+ if (this_basic_block == EXIT_BLOCK_PTR_FOR_FN (cfun)->prev_bb)
+   {
+ rtx set = single_set (insn);
+ if (set
+ && REG_P (SET_DEST (set))
+ && HARD_REGISTER_P (SET_DEST (set))
+ && REG_P (SET_SRC (set)))
+   continue;
+   }
+
  while (last_combined_insn
 && last_combined_insn->deleted ())
last_combined_insn = PREV_INSN (last_combined_insn);

This cures the powerpc64le testcase failure, but Segher said on irc I
was risking breaking x86 and other targets.  Perhaps that was trying
to push me to fix the underlying combine problem.  :)  In any case, I
didn't believe him, and tested the patch on powerpc64le-linux and
x86_64-linux.  No regressions in --languages=all,go and objdump -d
comparison for gcc/*.o against virgin source show no unexpected
changes.  powerpc64le-linux actually shows no changes at all apart
from combine.o while x86_64-linux shows some changes in register
allocation and cmove arg swapping with inversion of the condition.
There were no extra instructions.

So, is this worth pursuing in order to speed up combine?  I'd be
inclined to patch create_log_links instead for a proper patch.


Incidentally the underlying problem in combine (well the first one I
spotted, there might be more), is that 
  if (flag_expensive_optimizations)
{
  /* Pass pc_rtx so no substitutions are done, just
 simplifications.  */
"simplifies" this i2src

(plus:DI (plus:DI (reg:DI 165 [ val+8 ])
(reg:DI 169 [+8 ]))
(reg:DI 76 ca))

to this

(plus:DI (plus:DI (reg:DI 76 ca)
(reg:DI 165 [ val+8 ]))
(reg:DI 169 [+8 ]))

and the latter has the ca register in the wrong place.  So a split is
tried and you get addze.  I'm working on this.  The reordering happens
inside simplify_plus_minus.

-- 
Alan Modra
Australia Development Lab, IBM


Re: [RFC] Combine related fail of gcc.target/powerpc/ti_math1.c

2015-05-21 Thread Segher Boessenkool
On Thu, May 21, 2015 at 08:06:04PM +0930, Alan Modra wrote:
> FAIL: gcc.target/powerpc/ti_math1.c scan-assembler-times adde 1
> is seen on powerpc64le-linux since somewhere between revision 218587
> and 218616.  See
> https://gcc.gnu.org/ml/gcc-testresults/2014-12/msg01287.html and
> https://gcc.gnu.org/ml/gcc-testresults/2014-12/msg01325.html
> 
> A regression hunt fingers one of Segher's 2014-12-10 patches to the
> rs6000 backend, git commit 0f1bedb4 or svn rev 218595.

It doesn't trigger on big-endian; what is different?  The order of tried
combinations I suppose, because the loads are swapped?

> Segher might
> argue that generated code is better after this commit, and I'd agree
> that his change is a good one in general, but even so it would be nice
> to generate the ideal code.

Yes (to all of it).  It was a big rip-up, left better generated code
than we had before, and simplified the rs6000 backend code.  Tuning it
again now can only make things better :-)

> Curiously, the ideal code is generated at
> -O1, but we regress at -O2..

Skipping all the way to the end of your mail, that's because of
flag_expensive_optimizations (if nothing else).

>   before  after   ideal (-O1)
> add_128:  add_128:add_128:
>   ld 10,0(3)  ld 9,0(3)   ld 9,0(3)
>   ld 11,8(3)  ld 10,8(3)  ld 10,8(3)
>   addc 8,4,10 addc 3,4,9  addc 3,4,9
>   adde 9,5,11 addze 5,5   adde 4,5,10
>   mr 3,8  add 4,5,10  blr
>   mr 4,9  blr
>   blr
> 
> I went looking into where the addze appeared, and found combine.

That wasn't a big surprise I hope :-)

> Trying 18, 9 -> 24:
> Failed to match this instruction:
> (set (reg:DI 4 4 [+8 ])
> (plus:DI (plus:DI (reg:DI 5 5 [ val+8 ])
> (reg:DI 76 ca))
> (reg:DI 169 [+8 ])))

For some reason it has the CA reg not last.  I think we should add to
the canonicalisation rules so that fixed regs sort after other regs.
That requires a lot of testing.

For even slightly less trivial code, combine usually tries something
in the "correct" order as well, btw, which is why the current code
works as well as it does.

> Successfully matched this instruction:
> (set (reg:DI 167 [ D.2366+8 ])
> (plus:DI (reg:DI 5 5 [ val+8 ])
> (reg:DI 76 ca)))
> Successfully matched this instruction:
> (set (reg:DI 4 4 [+8 ])
> (plus:DI (reg:DI 167 [ D.2366+8 ])
> (reg:DI 169 [+8 ])))
> allowing combination of insns 18, 9 and 24
> original costs 4 + 8 + 4 = 16
> replacement costs 4 + 4 = 8

Still need to fix the costs as well (but they work as-is; well enough
that is).

> Here are the three insns involved, sans source line numbers and notes.
> 
> (insn 18 17 4 2 (set (reg:DI 165 [ val+8 ])
> (reg:DI 5 5 [ val+8 ])) {*movdi_internal64})
> ...
> (insn 9 8 23 2 (parallel [
> (set (reg:DI 167 [ D.2366+8 ])
> (plus:DI (plus:DI (reg:DI 165 [ val+8 ])
> (reg:DI 169 [+8 ]))
> (reg:DI 76 ca)))
> (clobber (reg:DI 76 ca))
> ]) {*adddi3_carry_in_internal})
> ...
> (insn 24 23 15 2 (set (reg:DI 4 4 [+8 ])
> (reg:DI 167 [ D.2366+8 ])) {*movdi_internal64})
> 
> So, a move copying an argument register to a pseudo, one insn from the
> body of the function, and a move copying a pseudo to a result
> register.  The thought I had was: It is really combine's business to
> look at copies from/to ABI mandated hard registers?  Isn't removing
> the copies something that register allocation can do better?

Yes, a well-known problem, and one I intended to fix for GCC 6.

> If so, then combine is doing unnecessary work.

Not only that, it pessimises generated code (as you see here; but
more often it prevents optimal register allocation because it
already has put a hard reg somewhere).

> As a quick hack, I tried the following.
> 
> Index: gcc/combine.c
> ===
> --- gcc/combine.c (revision 223431)
> +++ gcc/combine.c (working copy)
> @@ -1281,6 +1281,16 @@ combine_instructions (rtx_insn *f, unsigned int nr
> if (!NONDEBUG_INSN_P (insn))
>   continue;
>  
> +   if (this_basic_block == EXIT_BLOCK_PTR_FOR_FN (cfun)->prev_bb)

Are these copies guaranteed to (still) be in this basic block,
after the passes before combine?  Did those passes do anything to
prevent moving it?  I'm asking because it would be good to use the
same conditions in that case.

> + {
> +   rtx set = single_set (insn);
> +   if (set
> +   && REG_P (SET_DEST (set))
> +   && HARD_REGISTER_P (SET_DEST (set))
> +   && REG_P (SET_SRC (set)))
> + continue;
> + }
> +
> while (last_combined_insn
>&& last_combined_insn->delete

Re: optimization question

2015-05-21 Thread mark maule



On 5/20/2015 2:13 PM, Martin Uecker wrote:

  mark maule :

On 5/20/2015 3:27 AM, Richard Biener wrote:

On Mon, May 18, 2015 at 10:01 PM, mark maule  wrote:
The usual issue with this kind of behavior is out-of-bound accesses of
arrays in a loop
or invoking undefined behavior when signed integer operations wrap.


 uint32_toutLun[ BS_CFG_DRIVE_GROUPS ];

and

while ( ( dgHandle < ( BS_CFG_DRIVE_GROUPS + 1 ) ) &&
...
   dgDestageOut = bs_destageData.outLun[ dgHandle ];

looks like this might access outLun[BS_CFG_DRIVE_GROUPS] which is
out-of-bounds.

Richard.

You are correct, and when I change outLun[] to be size
BS_CFG_DRIVE_GROUPS+1, the generated asm looks like it will account for
dgHandle in the while() loop.  I will pass this back to our development
team to get a proper fix.

Now, the followon:  Something in the compiler/optimizer recognized this
out of bounds situation - should a warning have been emitted? Or are
there ambiguities which make a warning generation here inappropriate?

Yes, ideally a compiler should emit a warning. C compilers traditionally
were not very good at this, but it turns out very recent versions of GCC
can do this:

test.c:14:23: warning: iteration 10u invokes undefined behavior 
[-Waggressive-loop-optimizations]
   dgDestageOut = outLun[ dgHandle ];
^
test.c:11:13: note: containing loop
while ( ( dgHandle < ( BS_CFG_DRIVE_GROUPS + 1 ) ) )


For this simplified test case:

#include 

#define BS_CFG_DRIVE_GROUPS 10
uint32_t dgDestageLimit = 0;
uint32_t outLun[ BS_CFG_DRIVE_GROUPS ];

void test(void)
{
int dgHandle = 0;

   while ( ( dgHandle < ( BS_CFG_DRIVE_GROUPS + 1 ) ) )
   {
  uint32_t dgDestageOut;
  dgDestageOut = outLun[ dgHandle ];
  dgHandle++;
   }
}


Martin


Thanks Martin.

My gcc (4.8.3) must not be recent enough.  Any idea when this sort of 
checking went in?


#include 

#define BS_CFG_DRIVE_GROUPS 0x1e0

uint32_t dgDestageLimit = 0;
uint32_t outLun[ 10 ];

int
test(void)
{
int dgHandle = 0;
int total;

while ( ( dgHandle < ( BS_CFG_DRIVE_GROUPS + 1 ) ) ) {
uint32_t dgDestageOut;
dgDestageOut = outLun[ dgHandle ];
total += dgDestageOut;
dgHandle++;
}

return dgHandle;
}

int
main(int argc, char **argv)
{
test();
return 0;
}


ca-bld-ol71-02$ cc -o foo foo.c -Wall
ca-bld-ol71-02$




Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Michael Matz
Hi,

On Wed, 20 May 2015, Paul E. McKenney wrote:

> > > I'm not sure... you'd require the compiler to perform static analysis of
> > > loops to determine the state of the machine when they exit (if they exit!)
> > > in order to show whether or not a dependency is carried to subsequent
> > > operations. If it can't prove otherwise, it would have to assume that a
> > > dependency *is* carried, and it's not clear to me how it would use this
> > > information to restrict any subsequent dependency removing optimisations.
> > 
> > It'd just convert consume to acquire.
> 
> It should not need to, actually.

[with GCC hat, and having only lightly read your document]

Then you need to provide language or at least informal reasons why the 
compiler is allowed to not do that.  Without that a compiler would have to 
be conservative, if it can't _prove_ that a dependency chain is stopped, 
then it has to assume it hasn't.

For instance I can't really make out easily what your document says about 
the following simple situation (well, actually I have difficulties to 
differ between what you're proposing as the good-new model of this all, 
and what you're merely describing as different current states of affair):

  char * fancy_assign (char *in) { return in; }
  ...
  char *x, *y;
  
  x = atomic_load_explicit(p, memory_order_consume);
  y = fancy_assign (x);
  atomic_store_explicit(q, y, memory_order_relaxed);

So, is there, or is there not a dependency carried from x to y in your 
proposed model (and which rule in your document states so)?  Clearly, 
without any other language the compiler would have to assume that there is 
(because the equivalent 'y = x' assignment would carry the dependency).  

If it has to assume this, then the whole model is not going to work very 
well, as usual with models that assume a certain less-optimal fact 
("carries-dep" is less optimal for code generation purposes that 
"not-carries-dep") unless very specific circumstances say it can be 
ignored.


Ciao,
Michael.


Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Paul E. McKenney
On Thu, May 21, 2015 at 04:22:38PM +0200, Michael Matz wrote:
> Hi,
> 
> On Wed, 20 May 2015, Paul E. McKenney wrote:
> 
> > > > I'm not sure... you'd require the compiler to perform static analysis of
> > > > loops to determine the state of the machine when they exit (if they 
> > > > exit!)
> > > > in order to show whether or not a dependency is carried to subsequent
> > > > operations. If it can't prove otherwise, it would have to assume that a
> > > > dependency *is* carried, and it's not clear to me how it would use this
> > > > information to restrict any subsequent dependency removing 
> > > > optimisations.
> > > 
> > > It'd just convert consume to acquire.
> > 
> > It should not need to, actually.
> 
> [with GCC hat, and having only lightly read your document]

Understood.

> Then you need to provide language or at least informal reasons why the 
> compiler is allowed to not do that.  Without that a compiler would have to 
> be conservative, if it can't _prove_ that a dependency chain is stopped, 
> then it has to assume it hasn't.
> 
> For instance I can't really make out easily what your document says about 
> the following simple situation (well, actually I have difficulties to 
> differ between what you're proposing as the good-new model of this all, 
> and what you're merely describing as different current states of affair):

The point is -exactly- to codify the current state of affairs.  I expect
a follow-on effort to specify some sort of marking regimen, as noted in
the last paragraph of 7.9 and as discussed with Torvald Riegel.  However,
given that there are not yet any implementations or practical experience
with such markings, I suspect that some time will be required to hammer
out a good marking scheme.

>   char * fancy_assign (char *in) { return in; }
>   ...
>   char *x, *y;
>   
>   x = atomic_load_explicit(p, memory_order_consume);
>   y = fancy_assign (x);
>   atomic_store_explicit(q, y, memory_order_relaxed);
> 
> So, is there, or is there not a dependency carried from x to y in your 
> proposed model (and which rule in your document states so)?  Clearly, 
> without any other language the compiler would have to assume that there is 
> (because the equivalent 'y = x' assignment would carry the dependency).  

The dependency is not carried, though this is due to the current set
of rules not covering atomic loads and stores, which I need to fix.
Here is the sequence of events:

o   A memory_order_consume load heads a dependency chain.

o   Rule 2 says that if a value is part of a dependency chain and
is used as the right-hand side of an assignment operator,
the expression extends the chain to cover the assignment.
And I switched to numbered bullet items here:
http://www2.rdrop.com/users/paulmck/RCU/consume.2015.05.21a.pdf

o   Rule 14 says that if a value is part of a dependency chain and
is used as the actual parameter of a function call, then the
dependency chain extends to the corresponding formal parameter,
namely "in" of fancy_assign().

o   Rule 15 says that if a value is part of a dependency chain and
is returned from a function, then the dependency chain extends
to the returned value in the calling function.

o   And you are right.  I need to make the first and second rules
cover the relaxed atomic operations, or at least atomic loads and
stores.  Not that this is an issue for existing Linux-kernel code.

But given such a change, the new version of rule 2 would
extend the dependency chain to cover the atomic_store_explicit().

> If it has to assume this, then the whole model is not going to work very 
> well, as usual with models that assume a certain less-optimal fact 
> ("carries-dep" is less optimal for code generation purposes that 
> "not-carries-dep") unless very specific circumstances say it can be 
> ignored.

Although that is a good general rule of thumb, I do not believe that it
applies to this situation, with the exception that I do indeed assume
that no one is insane enough to do value-speculation optimizations for
non-NULL values on loads from pointers.

So what am I missing here?  Do you have a specific example where the
compiler would need to suppress a production-quality optimization?

Thanx, Paul



Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Michael Matz
Hi,

On Thu, 21 May 2015, Paul E. McKenney wrote:

> The point is -exactly- to codify the current state of affairs.

Ah, I see, so it's not yet about creating a more useful (for compilers, 
that is) model.

> >   char * fancy_assign (char *in) { return in; }
> >   ...
> >   char *x, *y;
> >   
> >   x = atomic_load_explicit(p, memory_order_consume);
> >   y = fancy_assign (x);
> >   atomic_store_explicit(q, y, memory_order_relaxed);
> > 
> > So, is there, or is there not a dependency carried from x to y in your 
> > proposed model (and which rule in your document states so)?  Clearly, 
> > without any other language the compiler would have to assume that there is 
> > (because the equivalent 'y = x' assignment would carry the dependency).  
> 
> The dependency is not carried, though this is due to the current set
> of rules not covering atomic loads and stores, which I need to fix.

Okay, so with the current regime(s), the dependency carries ...

> o Rule 14 says that if a value is part of a dependency chain and
>   is used as the actual parameter of a function call, then the
>   dependency chain extends to the corresponding formal parameter,
>   namely "in" of fancy_assign().
> 
> o Rule 15 says that if a value is part of a dependency chain and
>   is returned from a function, then the dependency chain extends
>   to the returned value in the calling function.
> 
> o And you are right.  I need to make the first and second rules
>   cover the relaxed atomic operations, or at least atomic loads and
>   stores.  Not that this is an issue for existing Linux-kernel code.
> 
>   But given such a change, the new version of rule 2 would
>   extend the dependency chain to cover the atomic_store_explicit().

... (if this detail would be fixed).  Okay, that's quite awful ...

> > If it has to assume this, then the whole model is not going to work 
> > very well, as usual with models that assume a certain less-optimal 
> > fact ("carries-dep" is less optimal for code generation purposes that 
> > "not-carries-dep") unless very specific circumstances say it can be 
> > ignored.
> 
> Although that is a good general rule of thumb, I do not believe that it 
> applies to this situation, with the exception that I do indeed assume 
> that no one is insane enough to do value-speculation optimizations for 
> non-NULL values on loads from pointers.
> 
> So what am I missing here?

... because you are then missing that if "carries-dep" can flow through 
function calls from arguments to return values by default, the compiler 
has to assume this in fact always happens when it can't see the function 
body, or can't analyze it.  In effect that's making the whole "carries-dep 
stops at these and those uses" a useless excercise because a malicious 
user (malicious in the sense of abusing the model to show that it's 
hindering optimizations), i.e. me, can hide all such carries-dep stopping 
effects inside a function, et voila, the dependecy carries through.  So 
for a slightly more simple example:

  extern void *foo (void *);  // body not available
  x = load
  y = foo (x);
  store (y)

the compiler has to assume that there's a dep-chain from x to y; always.  
What's worse, it also has to assume a carries-dep for this:

  extern void foo (void *in, void **out1, void **out2);
  x = load
  foo (x, &o1, &o2);
  store (o1);
  store (o2);

Now the compiler has to assume that the body of 'foo' is just mean enough 
to make the dep-chain carry from in to *out1 or *out2 (i.e. it has to 
assume that for both).  This extends to _all_ memory accessible from foo's 
body, i.e. generally all global and all local address-taken variables, so 
as soon as you have a function call into which a dep-chain value flows 
you're creating a dep-chain extension from that value to each and every 
global piece of memory, because the compiler cannot assume that the 
black box called foo is not mean.  This could conceivably be stopped by 
making normal stores not to carry the dependency; then only the return 
value might be infected; but I don't see that in your rules, as a "normal 
store" is just an assigment in your model and hence rules 1 and 2 apply 
(that is, carries-dep flows through all assignments, incl. loads and 
stores).

Basically whenever you can construct black boxes for the compiler, you 
have to limit their effects on such transitive relations like carries-dep 
by default, at the border of such black boxes; otherwise that transitive 
relation quickly becomes an most-x-everything relation (i.e. 
mostthings carries-dep to everything), and as such is then totally 
useless, because such a universally filled relation (like an empty one) 
doesn't bear any interesting information, at which point it then is 
questionably why the compiler should jump through hoops to analyse the few 
cases that would be allowed to stop the carries-dep flow, when it more 
often than not have to give up anyway and generate slow c

Re: [RFC] Combine related fail of gcc.target/powerpc/ti_math1.c

2015-05-21 Thread Richard Henderson
On 05/21/2015 05:39 AM, Segher Boessenkool wrote:
>> > Trying 18, 9 -> 24:
>> > Failed to match this instruction:
>> > (set (reg:DI 4 4 [+8 ])
>> > (plus:DI (plus:DI (reg:DI 5 5 [ val+8 ])
>> > (reg:DI 76 ca))
>> > (reg:DI 169 [+8 ])))
> For some reason it has the CA reg not last.  I think we should add to
> the canonicalisation rules so that fixed regs sort after other regs.
> That requires a lot of testing.

Actually, I believe that the way CA is modeled at the moment is dangerous.
It's not a 64-bit value, but a 1-bit value.

If we rearrange the expanded rtl to be (zero_extend:DI (reg:BI CA)), then
normal canonicalization rules will apply and it'll always appear first in the
chain of PLUS.


r~


Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Paul E. McKenney
On Thu, May 21, 2015 at 06:17:43PM +0200, Michael Matz wrote:
> Hi,
> 
> On Thu, 21 May 2015, Paul E. McKenney wrote:
> 
> > The point is -exactly- to codify the current state of affairs.
> 
> Ah, I see, so it's not yet about creating a more useful (for compilers, 
> that is) model.

There are several approaches being considered for that as well, but we
do need to codify current usage.

> > >   char * fancy_assign (char *in) { return in; }
> > >   ...
> > >   char *x, *y;
> > >   
> > >   x = atomic_load_explicit(p, memory_order_consume);
> > >   y = fancy_assign (x);
> > >   atomic_store_explicit(q, y, memory_order_relaxed);
> > > 
> > > So, is there, or is there not a dependency carried from x to y in your 
> > > proposed model (and which rule in your document states so)?  Clearly, 
> > > without any other language the compiler would have to assume that there 
> > > is 
> > > (because the equivalent 'y = x' assignment would carry the dependency).  
> > 
> > The dependency is not carried, though this is due to the current set
> > of rules not covering atomic loads and stores, which I need to fix.
> 
> Okay, so with the current regime(s), the dependency carries ...

Yes, that is the intent.

> > o   Rule 14 says that if a value is part of a dependency chain and
> > is used as the actual parameter of a function call, then the
> > dependency chain extends to the corresponding formal parameter,
> > namely "in" of fancy_assign().
> > 
> > o   Rule 15 says that if a value is part of a dependency chain and
> > is returned from a function, then the dependency chain extends
> > to the returned value in the calling function.
> > 
> > o   And you are right.  I need to make the first and second rules
> > cover the relaxed atomic operations, or at least atomic loads and
> > stores.  Not that this is an issue for existing Linux-kernel code.
> > 
> > But given such a change, the new version of rule 2 would
> > extend the dependency chain to cover the atomic_store_explicit().
> 
> ... (if this detail would be fixed).  Okay, that's quite awful ...
> 
> > > If it has to assume this, then the whole model is not going to work 
> > > very well, as usual with models that assume a certain less-optimal 
> > > fact ("carries-dep" is less optimal for code generation purposes that 
> > > "not-carries-dep") unless very specific circumstances say it can be 
> > > ignored.
> > 
> > Although that is a good general rule of thumb, I do not believe that it 
> > applies to this situation, with the exception that I do indeed assume 
> > that no one is insane enough to do value-speculation optimizations for 
> > non-NULL values on loads from pointers.
> > 
> > So what am I missing here?
> 
> ... because you are then missing that if "carries-dep" can flow through 
> function calls from arguments to return values by default, the compiler 
> has to assume this in fact always happens when it can't see the function 
> body, or can't analyze it.  In effect that's making the whole "carries-dep 
> stops at these and those uses" a useless excercise because a malicious 
> user (malicious in the sense of abusing the model to show that it's 
> hindering optimizations), i.e. me, can hide all such carries-dep stopping 
> effects inside a function, et voila, the dependecy carries through.  So 
> for a slightly more simple example:
> 
>   extern void *foo (void *);  // body not available
>   x = load
>   y = foo (x);
>   store (y)
> 
> the compiler has to assume that there's a dep-chain from x to y; always.  

Yes, the compiler does have to make this assumption.  And the intent
behind the rules is to ensure that this assumption does not get in the
way of reasonable optimizations.  So although I am sure that you are as
busy as the rest of us, I really do need you to go through the rules in
detail before you get too much more excited about this.

> What's worse, it also has to assume a carries-dep for this:
> 
>   extern void foo (void *in, void **out1, void **out2);
>   x = load
>   foo (x, &o1, &o2);
>   store (o1);
>   store (o2);
> 
> Now the compiler has to assume that the body of 'foo' is just mean enough 
> to make the dep-chain carry from in to *out1 or *out2 (i.e. it has to 
> assume that for both).  This extends to _all_ memory accessible from foo's 
> body, i.e. generally all global and all local address-taken variables, so 
> as soon as you have a function call into which a dep-chain value flows 
> you're creating a dep-chain extension from that value to each and every 
> global piece of memory, because the compiler cannot assume that the 
> black box called foo is not mean.  This could conceivably be stopped by 
> making normal stores not to carry the dependency; then only the return 
> value might be infected; but I don't see that in your rules, as a "normal 
> store" is just an assigment in your model and hence rules 1 and 2 apply 
> (that is, carries-dep flows through all assignments, incl. loads and 
> stor

Re: [RFC] Combine related fail of gcc.target/powerpc/ti_math1.c

2015-05-21 Thread Segher Boessenkool
On Thu, May 21, 2015 at 11:34:14AM -0700, Richard Henderson wrote:
> On 05/21/2015 05:39 AM, Segher Boessenkool wrote:
> >> > Trying 18, 9 -> 24:
> >> > Failed to match this instruction:
> >> > (set (reg:DI 4 4 [+8 ])
> >> > (plus:DI (plus:DI (reg:DI 5 5 [ val+8 ])
> >> > (reg:DI 76 ca))
> >> > (reg:DI 169 [+8 ])))
> > For some reason it has the CA reg not last.  I think we should add to
> > the canonicalisation rules so that fixed regs sort after other regs.
> > That requires a lot of testing.
> 
> Actually, I believe that the way CA is modeled at the moment is dangerous.
> It's not a 64-bit value, but a 1-bit value.

It's a fixed register and it is only ever set to 0 or 1.  There are
more targets that do such things, and it is safe.

I've tried with BImode before, with two effects: 1) the patterns become
much more unmanageable; and 2) the optimisers do a lousy job on it.
BImode isn't so well supported.

This was when I still modeled the carry output of "adde" and friends
though (it is always a clobber now).

> If we rearrange the expanded rtl to be (zero_extend:DI (reg:BI CA)), then
> normal canonicalization rules will apply and it'll always appear first in the
> chain of PLUS.

Let's wait for Alan's patch that makes combine not reorder things
unnecessarily, that should take care of it all as far as I see.


Segher


Re: [RFC] Combine related fail of gcc.target/powerpc/ti_math1.c

2015-05-21 Thread Richard Henderson
On 05/21/2015 11:44 AM, Segher Boessenkool wrote:
> On Thu, May 21, 2015 at 11:34:14AM -0700, Richard Henderson wrote:
>> Actually, I believe that the way CA is modeled at the moment is dangerous.
>> It's not a 64-bit value, but a 1-bit value.
> 
> It's a fixed register and it is only ever set to 0 or 1.  There are
> more targets that do such things, and it is safe.

Old Cygnus proverb: Lie to the compiler and it will always bite you in the end.

> I've tried with BImode before, with two effects: 1) the patterns become
> much more unmanageable; and 2) the optimisers do a lousy job on it.
> BImode isn't so well supported.

Really?  Zero-extending from BImode should be no different than from SImode,
and we handle that all the time.

> Let's wait for Alan's patch that makes combine not reorder things
> unnecessarily, that should take care of it all as far as I see.

I remain skeptical, but I'm also willing to let someone else worry about it.  
;-)


r~



Re: [i386] Scalar DImode instructions on XMM registers

2015-05-21 Thread Vladimir Makarov

On 05/21/2015 05:54 AM, Ilya Enkovich wrote:

Thanks.  For me it looks like an inheritance bug.  It is really hard
>to fix the bug w/o the source code.  Could you send me your patch in
>order I can debug RA with it to investigate more.
>

Sure! Here is a patch and a testcase.  I applied patch to r222125.  Cmd to 
reproduce:

gcc -m32 -msse4.2 -O2 pr65105.c -S -march=slm -fPIE
The problem is in sharing a subreg in different insns.  Pseudo should be 
shared but not their subregs.


We have before inheritance:

   28: r132:V2DI=r132:V2DI|r126:DI#0
  REG_DEAD r126:DI
  REG_DEAD r118:DI
Inserting insn reload before:
   81: r132:V2DI=r118:DI#0
Inserting insn reload after:
   82: r108:DI#0=r132:V2DI
...
  Creating newreg=135, assigning class SSE_REGS to r135
   42: r135:V2DI=r135:V2DI&r108:DI#0
  REG_DEAD r127:DI
Inserting insn reload before:
   85: r135:V2DI=r127:DI#0
Inserting insn reload after:
   86: r108:DI#0=r135:V2DI

As subreg of 108 in original insns 28 and 42 are shared, The subregs of 
108 in insns 82 and 86 are shared too.  During inheritance subpass we 
change r108 in insn 82 onto r137.  This change insn 86 too.


  Creating newreg=137 from oldreg=108, assigning class 
NO_REX_SSE_REGS to inheritance r137

Original reg change 108->137 (bb2):
   82: r137:DI#0=r132:V2DI
  REG_DEAD r132:V2DI
Add original<-inheritance after:
   88: r108:DI=r137:DI

Inheritance reuse change 108->137 (bb2):
   68: r124:V2DI=r137:DI#0

And now we are trying to do inheritance for insn #86:

 Creating newreg=138 from oldreg=108, assigning class 
NO_REX_SSE_REGS to inheritance r138

Original reg change 108->138 (bb3):
   86: r137:DI#0=r135:V2DI
  REG_DEAD r135:V2DI
Add original<-inheritance after:
   89: r108:DI=r138:DI

Inheritance reuse change 108->138 (bb3):
   64: r123:V2DI=r137:DI#0

and after that having a complete mess.  We are trying to change r108 
onto r138, but r108 is already r137 because of sharing. Later we undo 
the second inheritance creating even more mess.


So, Ilya, to solve the problem you need to avoid sharing subregs for the 
correct LRA/reload work.





Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Will Deacon
On Wed, May 20, 2015 at 07:16:06PM +0100, Paul E. McKenney wrote:
> On Wed, May 20, 2015 at 04:46:17PM +0100, Will Deacon wrote:
> > On Wed, May 20, 2015 at 01:15:22PM +0100, Paul E. McKenney wrote:
> > > Indeed, something like this does -not- carry a dependency from the
> > > memory_order_consume load to q:
> > > 
> > >   char *p, q;
> > > 
> > >   p = atomic_load_explicit(&gp, memory_order_consume);
> > >   q = gq + (intptr_t)p - (intptr_t)p;
> > > 
> > > If this was compiled with -O0, ARM and Power might well carry a
> > > dependency, but given any optimization, the assembly language would have
> > > no hint of any such dependency.  So I am not seeing any particular danger.
> > 
> > The above is a welcome relaxation over C11, since ARM doesn't even give
> > you ordering based off false data dependencies. My concern is more to do
> > with how this can be specified precisely without prohibing honest compiler
> > and hardware optimisations.
> 
> That last is the challenge.  I believe that I am pretty close, but I am
> sure that additional adjustment will be required.  Especially given that
> we also need the memory model to be amenable to formal analysis.

Well, there's still the whole thin-air problem which unfortunately doesn't
go away with your proposal... (I was hoping that differentiating between
true and false dependencies would solve that, but your set of rules isn't
broad enough and I don't blame you at all for that!).

> > Out of interest, how do you tackle examples (4) and (5) of (assuming the
> > reads are promoted to consume loads)?:
> > 
> >   http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html
> > 
> > my understanding is that you permit both outcomes (I appreciate you're
> > not directly tackling out-of-thin-air, but treatment of dependencies
> > is heavily related).

Thanks for taking the time to walk these two examples through.

> Let's see...  #4 is as follows, given promotion to memory_order_consume
> and (I am guessing) memory_order_relaxed:
> 
>   r1 = atomic_load_explicit(&x, memory_order_consume);
>   if (r1 == 42)
> atomic_store_explicit(&y, r1, memory_order_relaxed);
>   --
>   r2 = atomic_load_explicit(&y, memory_order_consume);
>   if (r2 == 42)
> atomic_store_explicit(&x, 42, memory_order_relaxed);
>   else
> atomic_store_explicit(&x, 42, memory_order_relaxed);
> 
> The second thread does not have a proper control dependency, even with
> the memory_order_consume load because both branches assign the same
> value to "x".  This means that the compiler is within its rights to
> optimize this into the following:
> 
>   r1 = atomic_load_explicit(&x, memory_order_consume);
>   if (r1 == 42)
> atomic_store_explicit(&y, r1, memory_order_relaxed);
>   --
>   r2 = atomic_load_explicit(&y, memory_order_consume);
>   atomic_store_explicit(&x, 42, memory_order_relaxed);
> 
> There is no dependency between the second thread's pair of statements,
> so both the compiler and the CPU are within their rights to optimize
> further as follows:
> 
>   r1 = atomic_load_explicit(&x, memory_order_consume);
>   if (r1 == 42)
> atomic_store_explicit(&y, r1, memory_order_relaxed);
>   --
>   atomic_store_explicit(&x, 42, memory_order_relaxed);
>   r2 = atomic_load_explicit(&y, memory_order_consume);
> 
> If the compiler makes this final optimization, even mythical SC hardware
> is within its rights to end up with (r1 == 42 && r2 == 42).  Which is
> fine, as far as I am concerned.  Or at least something that can be
> lived with.

Agreed.

> On to #5:
> 
>   r1 = atomic_load_explicit(&x, memory_order_consume);
>   if (r1 == 42)
> atomic_store_explicit(&y, r1, memory_order_relaxed);
>   
>   r2 = atomic_load_explicit(&y, memory_order_consume);
>   if (r2 == 42)
> atomic_store_explicit(&x, 42, memory_order_relaxed);
> 
> The first thread's accesses are dependency ordered.  The second thread's
> ordering is in a corner case that memory-barriers.txt does not cover.
> You are supposed to start control dependencies with READ_ONCE_CTRL(), not
> a memory_order_consume load (AKA rcu_dereference and friends).  However,
> Alpha would have a full barrier as part of the memory_order_consume load,
> and the rest of the processors would (one way or another) respect the
> control dependency.  And the compiler would have some fun trying to
> break it.

But this is interesting because the first thread is ordered whilst the
second is not, so doesn't that effectively forbid the compiler from
constant-folding values if it can't prove that there is no dependency
chain?

> So the current Linux memory model would allow (r1 == 42 && r2 == 42),
> but I don't know of any hardware/compiler combinat

Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Paul E. McKenney
On Thu, May 21, 2015 at 08:24:22PM +0100, Will Deacon wrote:
> On Wed, May 20, 2015 at 07:16:06PM +0100, Paul E. McKenney wrote:
> > On Wed, May 20, 2015 at 04:46:17PM +0100, Will Deacon wrote:
> > > On Wed, May 20, 2015 at 01:15:22PM +0100, Paul E. McKenney wrote:
> > > > Indeed, something like this does -not- carry a dependency from the
> > > > memory_order_consume load to q:
> > > > 
> > > > char *p, q;
> > > > 
> > > > p = atomic_load_explicit(&gp, memory_order_consume);
> > > > q = gq + (intptr_t)p - (intptr_t)p;
> > > > 
> > > > If this was compiled with -O0, ARM and Power might well carry a
> > > > dependency, but given any optimization, the assembly language would have
> > > > no hint of any such dependency.  So I am not seeing any particular 
> > > > danger.
> > > 
> > > The above is a welcome relaxation over C11, since ARM doesn't even give
> > > you ordering based off false data dependencies. My concern is more to do
> > > with how this can be specified precisely without prohibing honest compiler
> > > and hardware optimisations.
> > 
> > That last is the challenge.  I believe that I am pretty close, but I am
> > sure that additional adjustment will be required.  Especially given that
> > we also need the memory model to be amenable to formal analysis.
> 
> Well, there's still the whole thin-air problem which unfortunately doesn't
> go away with your proposal... (I was hoping that differentiating between
> true and false dependencies would solve that, but your set of rules isn't
> broad enough and I don't blame you at all for that!).
> 
> > > Out of interest, how do you tackle examples (4) and (5) of (assuming the
> > > reads are promoted to consume loads)?:
> > > 
> > >   http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html
> > > 
> > > my understanding is that you permit both outcomes (I appreciate you're
> > > not directly tackling out-of-thin-air, but treatment of dependencies
> > > is heavily related).
> 
> Thanks for taking the time to walk these two examples through.

;-) ;-) ;-)

> > Let's see...  #4 is as follows, given promotion to memory_order_consume
> > and (I am guessing) memory_order_relaxed:
> > 
> > r1 = atomic_load_explicit(&x, memory_order_consume);
> > if (r1 == 42)
> >   atomic_store_explicit(&y, r1, memory_order_relaxed);
> > --
> > r2 = atomic_load_explicit(&y, memory_order_consume);
> > if (r2 == 42)
> >   atomic_store_explicit(&x, 42, memory_order_relaxed);
> > else
> >   atomic_store_explicit(&x, 42, memory_order_relaxed);
> > 
> > The second thread does not have a proper control dependency, even with
> > the memory_order_consume load because both branches assign the same
> > value to "x".  This means that the compiler is within its rights to
> > optimize this into the following:
> > 
> > r1 = atomic_load_explicit(&x, memory_order_consume);
> > if (r1 == 42)
> >   atomic_store_explicit(&y, r1, memory_order_relaxed);
> > --
> > r2 = atomic_load_explicit(&y, memory_order_consume);
> > atomic_store_explicit(&x, 42, memory_order_relaxed);
> > 
> > There is no dependency between the second thread's pair of statements,
> > so both the compiler and the CPU are within their rights to optimize
> > further as follows:
> > 
> > r1 = atomic_load_explicit(&x, memory_order_consume);
> > if (r1 == 42)
> >   atomic_store_explicit(&y, r1, memory_order_relaxed);
> > --
> > atomic_store_explicit(&x, 42, memory_order_relaxed);
> > r2 = atomic_load_explicit(&y, memory_order_consume);
> > 
> > If the compiler makes this final optimization, even mythical SC hardware
> > is within its rights to end up with (r1 == 42 && r2 == 42).  Which is
> > fine, as far as I am concerned.  Or at least something that can be
> > lived with.
> 
> Agreed.
> 
> > On to #5:
> > 
> > r1 = atomic_load_explicit(&x, memory_order_consume);
> > if (r1 == 42)
> >   atomic_store_explicit(&y, r1, memory_order_relaxed);
> > 
> > r2 = atomic_load_explicit(&y, memory_order_consume);
> > if (r2 == 42)
> >   atomic_store_explicit(&x, 42, memory_order_relaxed);
> > 
> > The first thread's accesses are dependency ordered.  The second thread's
> > ordering is in a corner case that memory-barriers.txt does not cover.
> > You are supposed to start control dependencies with READ_ONCE_CTRL(), not
> > a memory_order_consume load (AKA rcu_dereference and friends).  However,
> > Alpha would have a full barrier as part of the memory_order_consume load,
> > and the rest of the processors would (one way or another) respect the
> > control dependency.  And the compiler would have some fun trying to
> > break it.
> 
> But this is interesting because the first thread is ordered whilst the
> second is not, so d

Re: [i386] Scalar DImode instructions on XMM registers

2015-05-21 Thread Jeff Law

On 05/21/2015 01:08 PM, Vladimir Makarov wrote:

On 05/21/2015 05:54 AM, Ilya Enkovich wrote:

Thanks.  For me it looks like an inheritance bug.  It is really hard
>to fix the bug w/o the source code.  Could you send me your patch in
>order I can debug RA with it to investigate more.
>

Sure! Here is a patch and a testcase.  I applied patch to r222125.
Cmd to reproduce:

gcc -m32 -msse4.2 -O2 pr65105.c -S -march=slm -fPIE

The problem is in sharing a subreg in different insns.  Pseudo should be
shared but not their subregs.

[ ... ]


So, Ilya, to solve the problem you need to avoid sharing subregs for the
correct LRA/reload work.
If their code is sharing subregs, then most definitely that code is 
wrong.  GCC has very well defined rtx sharing rules that are defined in 
the developer documentation.


jeff


Re: [i386] Scalar DImode instructions on XMM registers

2015-05-21 Thread Jakub Jelinek
On Thu, May 21, 2015 at 02:23:47PM -0600, Jeff Law wrote:
> On 05/21/2015 01:08 PM, Vladimir Makarov wrote:
> >On 05/21/2015 05:54 AM, Ilya Enkovich wrote:
> >>>Thanks.  For me it looks like an inheritance bug.  It is really hard
> to fix the bug w/o the source code.  Could you send me your patch in
> order I can debug RA with it to investigate more.
> 
> >>Sure! Here is a patch and a testcase.  I applied patch to r222125.
> >>Cmd to reproduce:
> >>
> >>gcc -m32 -msse4.2 -O2 pr65105.c -S -march=slm -fPIE
> >The problem is in sharing a subreg in different insns.  Pseudo should be
> >shared but not their subregs.
> [ ... ]
> >
> >So, Ilya, to solve the problem you need to avoid sharing subregs for the
> >correct LRA/reload work.
> If their code is sharing subregs, then most definitely that code is wrong.
> GCC has very well defined rtx sharing rules that are defined in the
> developer documentation.

Shouldn't --enable-checking=rtl catch such bugs?

Jakub


Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Linus Torvalds
On Thu, May 21, 2015 at 1:02 PM, Paul E. McKenney
 wrote:
>
> The compiler can (and does) speculate non-atomic non-volatile writes
> in some cases, but I do not believe that it is permitted to speculate
> either volatile or atomic writes.

I do *not* believe that a compiler is ever allowed to speculate *any*
writes - volatile or not - unless the compiler can prove that the end
result is either single-threaded, or the write in question is
guaranteed to only be visible in that thread (ie local stack variable
etc).

Quite frankly, I'd be much happier if the C standard just said so outright.

Also, I do think that the whole "consume" read should be explained
better to compiler writers. Right now the language (including very
much in the "restricted dependency" model) is described in very
abstract terms. Yet those abstract terms are actually very subtle and
complex, and very opaque to a compiler writer.

If I was a compiler writer, I'd absolutely detest that definition.
It's very far removed from my problem space as a compiler writer, and
nothing in the language *explains* the odd and subtle abstract rules.
It smells ad-hoc to me.

Now, I actually understand the point of those odd and abstract rules,
but to a compiler writer that doesn't understand the background, the
whole section reads as "this is really painful for me to track all
those dependencies and what kills them".

So I would very much suggest that there would be language that
*explains* this. Basically, tell the compiler writer:

 (a) the "official" rules are completely pointless, and make sense
only because the standard is written for some random "abstract
machine" that doesn't actually exist.

 (b) in *real life*, the whole and only point of the rules is to make
sure that the compiler doesn't turn a data depenency into a control
dependency, which on ARM and POWERPC does not honor causal memory
ordering

 (c) on x86, since *all* memory accesses are causal, all the magical
dependency rules are just pointless anyway, and what it really means
is that you cannot re-order accesses with value speculation.

 (c) the *actual* relevant rule for a compiler writer is very simple:
the compiler must not do value speculation on a "consume" load, and
the abstract machine rules are written so that any other sane
optimization is legal.

 (d) if the compiler writer really thinks they want to do value
speculation, they have to turn the "consume" load into an "acquire"
load. And you have to do that anyway on architectures like alpha that
aren't causal even for data dependencies.

I personally think the whole "abstract machine" model of the C
language is a mistake. It would be much better to talk about things in
terms of actual code generation and actual issues. Make all the
problems much more concrete, with actual examples of how memory
ordering matters on different architectures.

99% of all the problems with the whole "consume" memory ordering comes
not from anything relevant to a compiler writer. All of it comes from
trying to "define" the issue in the wrong terms.

 Linus


Re: Is there a way to adjust alignment of DImode and DFmode?

2015-05-21 Thread Jim Wilson
On 05/20/2015 10:00 AM, H.J. Lu wrote:
> By default, alignment of DImode and DFmode is set to 8 bytes.
> Intel MCU psABI specifies alignment of DImode and DFmode
> to be 4 bytes. I'd like to make get_mode_alignment to return
> 32 bits for DImode and DFmode.   Is there a way to adjust alignment
> of DImode and DFmode via ADJUST_ALIGNMENT?

I see that i386-modes.def already uses ADJUST_ALIGNMENT to change the
alignment of XFmode to 4 for ilp32 code.  ADJUST_ALIGNMENT should work
the same for DImode and DFmode.  Did you run into a problem when you
tried it?

Jim



Re: Is there a way to adjust alignment of DImode and DFmode?

2015-05-21 Thread H.J. Lu
On Thu, May 21, 2015 at 1:56 PM, Jim Wilson  wrote:
> On 05/20/2015 10:00 AM, H.J. Lu wrote:
>> By default, alignment of DImode and DFmode is set to 8 bytes.
>> Intel MCU psABI specifies alignment of DImode and DFmode
>> to be 4 bytes. I'd like to make get_mode_alignment to return
>> 32 bits for DImode and DFmode.   Is there a way to adjust alignment
>> of DImode and DFmode via ADJUST_ALIGNMENT?
>
> I see that i386-modes.def already uses ADJUST_ALIGNMENT to change the
> alignment of XFmode to 4 for ilp32 code.  ADJUST_ALIGNMENT should work
> the same for DImode and DFmode.  Did you run into a problem when you
> tried it?

It seems to work.  I don't know why it failed for me last time.


-- 
H.J.


Re: [RFC] Combine related fail of gcc.target/powerpc/ti_math1.c

2015-05-21 Thread Oleg Endo
On Thu, 2015-05-21 at 11:59 -0700, Richard Henderson wrote:
> On 05/21/2015 11:44 AM, Segher Boessenkool wrote:
> > On Thu, May 21, 2015 at 11:34:14AM -0700, Richard Henderson wrote:
> >> Actually, I believe that the way CA is modeled at the moment is dangerous.
> >> It's not a 64-bit value, but a 1-bit value.
> > 
> > It's a fixed register and it is only ever set to 0 or 1.  There are
> > more targets that do such things, and it is safe.
> 
> Old Cygnus proverb: Lie to the compiler and it will always bite you in the 
> end.

Just for the record, the same is being done on SH with the T bit.  It's
a fixed 1 bit hardreg, but declared and treated as SImode, because all
the other integer arithmetic is done primarily in SImode, too.  No
significant problems with that.

Cheers,
Oleg



Re: Is there a way to adjust alignment of DImode and DFmode?

2015-05-21 Thread H.J. Lu
On Thu, May 21, 2015 at 2:08 PM, H.J. Lu  wrote:
> On Thu, May 21, 2015 at 1:56 PM, Jim Wilson  wrote:
>> On 05/20/2015 10:00 AM, H.J. Lu wrote:
>>> By default, alignment of DImode and DFmode is set to 8 bytes.
>>> Intel MCU psABI specifies alignment of DImode and DFmode
>>> to be 4 bytes. I'd like to make get_mode_alignment to return
>>> 32 bits for DImode and DFmode.   Is there a way to adjust alignment
>>> of DImode and DFmode via ADJUST_ALIGNMENT?
>>
>> I see that i386-modes.def already uses ADJUST_ALIGNMENT to change the
>> alignment of XFmode to 4 for ilp32 code.  ADJUST_ALIGNMENT should work
>> the same for DImode and DFmode.  Did you run into a problem when you
>> tried it?
>
> It seems to work.  I don't know why it failed for me last time.
>

Now I remembered.  It doesn't work for complex and decimal
floating point modes:

build/genmodes: config/i386/i386-modes.def:41: no mode "DD"
build/genmodes: config/i386/i386-modes.def:42: no mode "TD"



-- 
H.J.


Re: Is there a way to adjust alignment of DImode and DFmode?

2015-05-21 Thread H.J. Lu
On Thu, May 21, 2015 at 2:25 PM, H.J. Lu  wrote:
> On Thu, May 21, 2015 at 2:08 PM, H.J. Lu  wrote:
>> On Thu, May 21, 2015 at 1:56 PM, Jim Wilson  wrote:
>>> On 05/20/2015 10:00 AM, H.J. Lu wrote:
 By default, alignment of DImode and DFmode is set to 8 bytes.
 Intel MCU psABI specifies alignment of DImode and DFmode
 to be 4 bytes. I'd like to make get_mode_alignment to return
 32 bits for DImode and DFmode.   Is there a way to adjust alignment
 of DImode and DFmode via ADJUST_ALIGNMENT?
>>>
>>> I see that i386-modes.def already uses ADJUST_ALIGNMENT to change the
>>> alignment of XFmode to 4 for ilp32 code.  ADJUST_ALIGNMENT should work
>>> the same for DImode and DFmode.  Did you run into a problem when you
>>> tried it?
>>
>> It seems to work.  I don't know why it failed for me last time.
>>
>
> Now I remembered.  It doesn't work for complex and decimal
> floating point modes:
>
> build/genmodes: config/i386/i386-modes.def:41: no mode "DD"
> build/genmodes: config/i386/i386-modes.def:42: no mode "TD"
>

machmode.def has

/* Allow the target to specify additional modes of various kinds.  */
#if HAVE_EXTRA_MODES
# include EXTRA_MODES_FILE
#endif

/* Complex modes.  */
COMPLEX_MODES (INT);
COMPLEX_MODES (FLOAT);

/* Decimal floating point modes.  */
DECIMAL_FLOAT_MODE (SD, 4, decimal_single_format);
DECIMAL_FLOAT_MODE (DD, 8, decimal_double_format);
DECIMAL_FLOAT_MODE (TD, 16, decimal_quad_format);

We can't adjust any modes in i386-modes.def since they
aren't available yet.  But we need to include i386-modes.def
before

COMPLEX_MODES (FLOAT);

to get XCmode.

Should we add an EXTRA_ALIGNMENTS_FILE and include it
after all modes are created?


-- 
H.J.


Re: Is there a way to adjust alignment of DImode and DFmode?

2015-05-21 Thread H.J. Lu
On Thu, May 21, 2015 at 2:40 PM, H.J. Lu  wrote:
> On Thu, May 21, 2015 at 2:25 PM, H.J. Lu  wrote:
>> On Thu, May 21, 2015 at 2:08 PM, H.J. Lu  wrote:
>>> On Thu, May 21, 2015 at 1:56 PM, Jim Wilson  wrote:
 On 05/20/2015 10:00 AM, H.J. Lu wrote:
> By default, alignment of DImode and DFmode is set to 8 bytes.
> Intel MCU psABI specifies alignment of DImode and DFmode
> to be 4 bytes. I'd like to make get_mode_alignment to return
> 32 bits for DImode and DFmode.   Is there a way to adjust alignment
> of DImode and DFmode via ADJUST_ALIGNMENT?

 I see that i386-modes.def already uses ADJUST_ALIGNMENT to change the
 alignment of XFmode to 4 for ilp32 code.  ADJUST_ALIGNMENT should work
 the same for DImode and DFmode.  Did you run into a problem when you
 tried it?
>>>
>>> It seems to work.  I don't know why it failed for me last time.
>>>
>>
>> Now I remembered.  It doesn't work for complex and decimal
>> floating point modes:
>>
>> build/genmodes: config/i386/i386-modes.def:41: no mode "DD"
>> build/genmodes: config/i386/i386-modes.def:42: no mode "TD"
>>
>
> machmode.def has
>
> /* Allow the target to specify additional modes of various kinds.  */
> #if HAVE_EXTRA_MODES
> # include EXTRA_MODES_FILE
> #endif
>
> /* Complex modes.  */
> COMPLEX_MODES (INT);
> COMPLEX_MODES (FLOAT);
>
> /* Decimal floating point modes.  */
> DECIMAL_FLOAT_MODE (SD, 4, decimal_single_format);
> DECIMAL_FLOAT_MODE (DD, 8, decimal_double_format);
> DECIMAL_FLOAT_MODE (TD, 16, decimal_quad_format);
>
> We can't adjust any modes in i386-modes.def since they
> aren't available yet.  But we need to include i386-modes.def
> before
>
> COMPLEX_MODES (FLOAT);
>
> to get XCmode.
>
> Should we add an EXTRA_ALIGNMENTS_FILE and include it
> after all modes are created?

I opened:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66250

-- 
H.J.


Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Paul E. McKenney
On Thu, May 21, 2015 at 01:42:11PM -0700, Linus Torvalds wrote:
> On Thu, May 21, 2015 at 1:02 PM, Paul E. McKenney
>  wrote:
> >
> > The compiler can (and does) speculate non-atomic non-volatile writes
> > in some cases, but I do not believe that it is permitted to speculate
> > either volatile or atomic writes.
> 
> I do *not* believe that a compiler is ever allowed to speculate *any*
> writes - volatile or not - unless the compiler can prove that the end
> result is either single-threaded, or the write in question is
> guaranteed to only be visible in that thread (ie local stack variable
> etc).
> 
> Quite frankly, I'd be much happier if the C standard just said so outright.

So would I.  ;-)

The usual example is the following, where x is non-volatile and
non-atomic:

if (a)
x = 42;
else
x = 7;

The current rules allow this to be transformed as follows:

x = 7;
if (a)
x = 42;

So the variable x takes on the value 7 momentarily when it otherwise
would not have.

At least C11 now prohibits "drive-by" speculative writes, where the
compiler writes a larger size and then fixes things up afterwards.

> Also, I do think that the whole "consume" read should be explained
> better to compiler writers. Right now the language (including very
> much in the "restricted dependency" model) is described in very
> abstract terms. Yet those abstract terms are actually very subtle and
> complex, and very opaque to a compiler writer.
> 
> If I was a compiler writer, I'd absolutely detest that definition.
> It's very far removed from my problem space as a compiler writer, and
> nothing in the language *explains* the odd and subtle abstract rules.
> It smells ad-hoc to me.
> 
> Now, I actually understand the point of those odd and abstract rules,
> but to a compiler writer that doesn't understand the background, the
> whole section reads as "this is really painful for me to track all
> those dependencies and what kills them".
> 
> So I would very much suggest that there would be language that
> *explains* this. Basically, tell the compiler writer:
> 
>  (a) the "official" rules are completely pointless, and make sense
> only because the standard is written for some random "abstract
> machine" that doesn't actually exist.
> 
>  (b) in *real life*, the whole and only point of the rules is to make
> sure that the compiler doesn't turn a data depenency into a control
> dependency, which on ARM and POWERPC does not honor causal memory
> ordering
> 
>  (c) on x86, since *all* memory accesses are causal, all the magical
> dependency rules are just pointless anyway, and what it really means
> is that you cannot re-order accesses with value speculation.
> 
>  (c) the *actual* relevant rule for a compiler writer is very simple:
> the compiler must not do value speculation on a "consume" load, and
> the abstract machine rules are written so that any other sane
> optimization is legal.
> 
>  (d) if the compiler writer really thinks they want to do value
> speculation, they have to turn the "consume" load into an "acquire"
> load. And you have to do that anyway on architectures like alpha that
> aren't causal even for data dependencies.

I am certainly more than willing to add this sort of wording to
the document.

> I personally think the whole "abstract machine" model of the C
> language is a mistake. It would be much better to talk about things in
> terms of actual code generation and actual issues. Make all the
> problems much more concrete, with actual examples of how memory
> ordering matters on different architectures.
> 
> 99% of all the problems with the whole "consume" memory ordering comes
> not from anything relevant to a compiler writer. All of it comes from
> trying to "define" the issue in the wrong terms.

I certainly cannot deny that there seems to be significant confusion
in the discussion thus far.

Thanx, Paul



gcc-4.8-20150521 is now available

2015-05-21 Thread gccadmin
Snapshot gcc-4.8-20150521 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/4.8-20150521/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 4.8 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_8-branch 
revision 223505

You'll find:

 gcc-4.8-20150521.tar.bz2 Complete GCC

  MD5=8d0bef153f23afce2e3303da4b122d17
  SHA1=f965139b06ba671b1873f75c100046ad7403d3a9

Diffs from 4.8-20150514 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-4.8
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Re: [RFC] Combine related fail of gcc.target/powerpc/ti_math1.c

2015-05-21 Thread Alan Modra
On Thu, May 21, 2015 at 07:39:16AM -0500, Segher Boessenkool wrote:
> On Thu, May 21, 2015 at 08:06:04PM +0930, Alan Modra wrote:
> > FAIL: gcc.target/powerpc/ti_math1.c scan-assembler-times adde 1
> 
> It doesn't trigger on big-endian; what is different?

Register dependencies.  One of the arguments is in r4,r5, the return
value in r3,r4.  We calculate the low 64 bits first, which goes to r4
on big-endian, overlapping the argument.

> > Trying 18, 9 -> 24:
> > Failed to match this instruction:
> > (set (reg:DI 4 4 [+8 ])
> > (plus:DI (plus:DI (reg:DI 5 5 [ val+8 ])
> > (reg:DI 76 ca))
> > (reg:DI 169 [+8 ])))
> 
> For some reason it has the CA reg not last.

simplify-rtx.c:simplify_plus_minus_op_data_cmp

>  I think we should add to
> the canonicalisation rules so that fixed regs sort after other regs.
> That requires a lot of testing.

What if you have two hard regs as above?  Which of reg 5 and reg 76
sorts first?  If they are sorted by register number, then ca appears
in the wrong place.  Reverse sorting hard regs might work for this
pattern on powerpc, but that seems an odd choice.  And if you say hard
regs ought to keep their original order in rtl like the above, then it
is no more difficult to keep all regs in their original order

> > original costs 4 + 8 + 4 = 16
> > replacement costs 4 + 4 = 8
> 
> Still need to fix the costs as well (but they work as-is; well enough
> that is).

Yes, I noticed that too.

> Are these copies guaranteed to (still) be in this basic block,
> after the passes before combine?  Did those passes do anything to
> prevent moving it?  I'm asking because it would be good to use the
> same conditions in that case.

Something I need to investigate.  As I said, the patch was just a
quick hack.

-- 
Alan Modra
Australia Development Lab, IBM


Re: optimization question

2015-05-21 Thread Martin Uecker
Am Thu, 21 May 2015 07:45:19 -0500
schrieb mark maule :

> 
> 
> On 5/20/2015 2:13 PM, Martin Uecker wrote:
> >   mark maule :
> >> On 5/20/2015 3:27 AM, Richard Biener wrote:
> >>> On Mon, May 18, 2015 at 10:01 PM, mark maule  
> >>> wrote:
> >>> The usual issue with this kind of behavior is out-of-bound accesses of
> >>> arrays in a loop
> >>> or invoking undefined behavior when signed integer operations wrap.
> >>>
> >>>
> >>>  uint32_toutLun[ BS_CFG_DRIVE_GROUPS ];
> >>>
> >>> and
> >>>
> >>> while ( ( dgHandle < ( BS_CFG_DRIVE_GROUPS + 1 ) ) &&
> >>> ...
> >>>dgDestageOut = bs_destageData.outLun[ dgHandle ];
> >>>
> >>> looks like this might access outLun[BS_CFG_DRIVE_GROUPS] which is
> >>> out-of-bounds.
> >>>
> >>> Richard.
> >> You are correct, and when I change outLun[] to be size
> >> BS_CFG_DRIVE_GROUPS+1, the generated asm looks like it will account for
> >> dgHandle in the while() loop.  I will pass this back to our development
> >> team to get a proper fix.
> >>
> >> Now, the followon:  Something in the compiler/optimizer recognized this
> >> out of bounds situation - should a warning have been emitted? Or are
> >> there ambiguities which make a warning generation here inappropriate?
> > Yes, ideally a compiler should emit a warning. C compilers traditionally
> > were not very good at this, but it turns out very recent versions of GCC
> > can do this:
> >
> > test.c:14:23: warning: iteration 10u invokes undefined behavior 
> > [-Waggressive-loop-optimizations]
> >dgDestageOut = outLun[ dgHandle ];
> > ^
> > test.c:11:13: note: containing loop
> > while ( ( dgHandle < ( BS_CFG_DRIVE_GROUPS + 1 ) ) )
> >
> >
> > For this simplified test case:
> >
> > #include 
> >
> > #define BS_CFG_DRIVE_GROUPS 10
> > uint32_t dgDestageLimit = 0;
> > uint32_t outLun[ BS_CFG_DRIVE_GROUPS ];
> >
> > void test(void)
> > {
> > int dgHandle = 0;
> >
> >while ( ( dgHandle < ( BS_CFG_DRIVE_GROUPS + 1 ) ) )
> >{
> >   uint32_t dgDestageOut;
> >   dgDestageOut = outLun[ dgHandle ];
> >   dgHandle++;
> >}
> > }
> >
> >
> > Martin
> 
> Thanks Martin.
> 
> My gcc (4.8.3) must not be recent enough.  Any idea when this sort of 
> checking went in?
> 

Sorry, I don't know. I think 4.8 already has this warning,
but maybe isn't smart enough to detect this case. Also I think
there is no garantuee that all errors of this type are detected
with GCC 5. And this may also depend on your optimizer flags
and the phase of the moon.

If you want to detect this kind of bugs, you may find
the -fsanitize=undefined useful to detect undefined behaviour 
at runtime. Also there other code analyzer tools which may
help.

Martin




Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Ingo Molnar

* Linus Torvalds  wrote:

>  (a) the "official" rules are completely pointless, and make sense 
> only because the standard is written for some random "abstract 
> machine" that doesn't actually exist.

Presuming the intent of the abstract machine specification is to avoid 
being seen as biased towards any specific machine (politics), maybe 
write this as:

   (a) the "official" rules are written for a somewhat weird and 
   complex "union of all known and theoretically possible CPU 
   architectures that exist or which might exist in the future", 
   which machine does not actually exist in practice, but which 
   allows a single abstract set of rules to apply to all machines. 
   These rules are complex, but if applied to a specific machine 
   they become considerably simpler. Here's a few examples: ...

?

(Assuming it's a goal of this standard to be human parseable to more 
than a few dozen people on the planet.)

Thanks,

Ingo