Possibly latent issue with combine ?

2019-06-26 Thread Prathamesh Kulkarni
Hi,
For following test-case, taken from pr88152.C:

#include 

template 
using V [[gnu::vector_size(N)]] = T;

int f10 (V a)
{
  return _mm_movemask_pd (reinterpret_cast<__m128d> (a > __LONG_LONG_MAX__));
}

.optimized dump shows:

f10 (V a)
{
  vector(2) signed long _1;
  vector(2) long int _2;
  vector(2) double _3;
  int _6;

   [local count: 1073741824]:
  _1 = VIEW_CONVERT_EXPR(a_4(D));
  _2 = VEC_COND_EXPR <_1 < { 0, 0 }, { -1, -1 }, { 0, 0 }>;
  _3 = VIEW_CONVERT_EXPR<__m128d>(_2);
  _6 = __builtin_ia32_movmskpd (_3); [tail call]
  return _6;

}

IIUC, we're using -1 to represent true and 0 as false.
combine then does following combinations:

Trying 7 -> 9:
7: r90:V2DI=r89:V2DI>r93:V2DI
  REG_DEAD r93:V2DI
  REG_DEAD r89:V2DI
9: r91:V2DF=r90:V2DI#0
  REG_DEAD r90:V2DI
Successfully matched this instruction:
(set (subreg:V2DI (reg:V2DF 91) 0)
(gt:V2DI (reg:V2DI 89)
(reg:V2DI 93)))
allowing combination of insns 7 and 9

Trying 6, 9 -> 10:
6: r89:V2DI=const_vector
9: r91:V2DF#0=r89:V2DI>r93:V2DI
  REG_DEAD r89:V2DI
  REG_DEAD r93:V2DI
   10: r87:SI=unspec[r91:V2DF] 43
  REG_DEAD r91:V2DF
Successfully matched this instruction:
(set (reg:SI 87)
(unspec:SI [
(lt:V2DF (reg:V2DI 93)
(const_vector:V2DI [
(const_int 0 [0]) repeated x2
]))
] UNSPEC_MOVMSK))

Is the above folding correct, since lt has V2DF mode,
and casting -1 (literally) to DFmode would result in -NaN ?
Also, should result of lt be having only integral modes ?

split2 then folds insn 10 into:

(insn 22 9 16 2 (set (reg:SI 0 ax [87])
(unspec:SI [
(reg:V2DF 20 xmm0 [93])
] UNSPEC_MOVMSK))
"../../stage1-build/gcc/include/emmintrin.h":958:34 4222
{sse2_movmskpd}
 (nil))

deleting insn 10.

The issue is my patch for PR88833 results in following propagation in forwprop1:

In insn 10, replacing
 (unspec:SI [
(reg:V2DF 91)
] UNSPEC_MOVMSK)
 with (unspec:SI [
(subreg:V2DF (reg:V2DI 90) 0)
] UNSPEC_MOVMSK)

deleting insn 9 and this inhibits the above combinations,
resulting in failure of PR88152.C

With patch, combine shows:
Trying 7 -> 10:
7: r90:V2DI=r89:V2DI>r93:V2DI
  REG_DEAD r93:V2DI
  REG_DEAD r89:V2DI
   10: r87:SI=unspec[r90:V2DI#0] 43
  REG_DEAD r90:V2DI
Failed to match this instruction:
(set (reg:SI 87)
(unspec:SI [
(subreg:V2DF (gt:V2DI (reg:V2DI 89)
(reg:V2DI 93)) 0)
] UNSPEC_MOVMSK))

and subsequently fails to match 6, 7 -> 10

Patch:
http://people.linaro.org/~prathamesh.kulkarni/pr88833-10.diff

Upstream discussion about the issue:
https://gcc.gnu.org/ml/gcc-patches/2019-06/msg01651.html

Thanks,
Prathamesh


CFG generation from C/C++ and JAVA

2019-06-26 Thread charfi asma via gcc
Hello,
I am interested in generating the CFG from several gcc front ends. All works 
fine for GCC and G++, and we are also interrested in JAVA.
we have seen that GCJ is no longer maintained/distributed by GCCHowever, we do 
not like to compile input java code into assembly or binary code, we would like 
just to analyze (with our tools) the CFG produced from each front ends 
including the gcj front end.
we downloaded the gcc 6.5.0 that contains gcj but when trying to call "gcj -v" 
we got this error : "libgcj.spec : Nosuch file or directory."
Any idea ? we really need to evaluate the java front end even in a previous 
version of GCC (just to dump the Generic, gimple and cfg intermediate 
representations)
Thank you very much !
Asma 

Re: Possibly latent issue with combine ?

2019-06-26 Thread Segher Boessenkool
On Wed, Jun 26, 2019 at 07:27:20PM +0530, Prathamesh Kulkarni wrote:
> combine then does following combinations:
> 
> Trying 7 -> 9:
> 7: r90:V2DI=r89:V2DI>r93:V2DI
>   REG_DEAD r93:V2DI
>   REG_DEAD r89:V2DI
> 9: r91:V2DF=r90:V2DI#0
>   REG_DEAD r90:V2DI
> Successfully matched this instruction:
> (set (subreg:V2DI (reg:V2DF 91) 0)
> (gt:V2DI (reg:V2DI 89)
> (reg:V2DI 93)))
> allowing combination of insns 7 and 9
> 
> Trying 6, 9 -> 10:
> 6: r89:V2DI=const_vector
> 9: r91:V2DF#0=r89:V2DI>r93:V2DI
>   REG_DEAD r89:V2DI
>   REG_DEAD r93:V2DI
>10: r87:SI=unspec[r91:V2DF] 43
>   REG_DEAD r91:V2DF
> Successfully matched this instruction:
> (set (reg:SI 87)
> (unspec:SI [
> (lt:V2DF (reg:V2DI 93)
> (const_vector:V2DI [
> (const_int 0 [0]) repeated x2
> ]))
> ] UNSPEC_MOVMSK))

Both of these are obviously correct, they are simple substitutions.
If this does not do what you want, the problem is elsewhere, not in
combine, afaics?

> Is the above folding correct, since lt has V2DF mode,
> and casting -1 (literally) to DFmode would result in -NaN ?

Combine does not introduce any of that, it was there already.

> Also, should result of lt be having only integral modes ?

Apparently your machine description likes this fine.  Combine does not
ask questions.


Segher


Re: Possibly latent issue with combine ?

2019-06-26 Thread Richard Sandiford
Segher Boessenkool  writes:
> On Wed, Jun 26, 2019 at 07:27:20PM +0530, Prathamesh Kulkarni wrote:
>> combine then does following combinations:
>> 
>> Trying 7 -> 9:
>> 7: r90:V2DI=r89:V2DI>r93:V2DI
>>   REG_DEAD r93:V2DI
>>   REG_DEAD r89:V2DI
>> 9: r91:V2DF=r90:V2DI#0
>>   REG_DEAD r90:V2DI
>> Successfully matched this instruction:
>> (set (subreg:V2DI (reg:V2DF 91) 0)
>> (gt:V2DI (reg:V2DI 89)
>> (reg:V2DI 93)))
>> allowing combination of insns 7 and 9
>> 
>> Trying 6, 9 -> 10:
>> 6: r89:V2DI=const_vector
>> 9: r91:V2DF#0=r89:V2DI>r93:V2DI
>>   REG_DEAD r89:V2DI
>>   REG_DEAD r93:V2DI
>>10: r87:SI=unspec[r91:V2DF] 43
>>   REG_DEAD r91:V2DF
>> Successfully matched this instruction:
>> (set (reg:SI 87)
>> (unspec:SI [
>> (lt:V2DF (reg:V2DI 93)
>> (const_vector:V2DI [
>> (const_int 0 [0]) repeated x2
>> ]))
>> ] UNSPEC_MOVMSK))
>
> Both of these are obviously correct, they are simple substitutions.
> If this does not do what you want, the problem is elsewhere, not in
> combine, afaics?

"Obviously" correct seems a stretch :-)  We can only fold:

  (subreg:V2DF (foo:V2DI X) 0)

to:

  (foo:V2DF X)

for certain operations.  E.g. it'd be wrong to do it for foo=plus.
IMO it's wrong for comparisons too.  A comparison between integers
that produces a floating-point result makes no sense, whatever the
target thinks about it.

>> Is the above folding correct, since lt has V2DF mode,
>> and casting -1 (literally) to DFmode would result in -NaN ?
>
> Combine does not introduce any of that, it was there already.

The original insns had an lt:V2DI between V2DI inputs and a V2DF
subreg of the result.  It's combine that turns that into a lt:V2DF
between V2DI inputs.

Richard

>> Also, should result of lt be having only integral modes ?
>
> Apparently your machine description likes this fine.  Combine does not
> ask questions.
>
>
> Segher


Re: Possibly latent issue with combine ?

2019-06-26 Thread Segher Boessenkool
On Wed, Jun 26, 2019 at 05:45:48PM +0100, Richard Sandiford wrote:
> "Obviously" correct seems a stretch :-)  We can only fold:
> 
>   (subreg:V2DF (foo:V2DI X) 0)
> 
> to:
> 
>   (foo:V2DF X)
> 
> for certain operations.
> 
> E.g. it'd be wrong to do it for foo=plus.

You would need to change X then, sure, so you cannot get that by doing a
simple substitution.  But this is lt, and it makes (structurally) perfect
sense here, the mode of lt does not depend on the mode of its args.  The
target should refuse it if it doesn't like it.  Simply by not having too
lenient patterns in the machine descriptions, probably.

> IMO it's wrong for comparisons too.  A comparison between integers
> that produces a floating-point result makes no sense, whatever the
> target thinks about it.

Then the target should not say it makes sense?

> >> Is the above folding correct, since lt has V2DF mode,
> >> and casting -1 (literally) to DFmode would result in -NaN ?
> >
> > Combine does not introduce any of that, it was there already.
> 
> The original insns had an lt:V2DI between V2DI inputs and a V2DF
> subreg of the result.  It's combine that turns that into a lt:V2DF
> between V2DI inputs.

Combine did only simple substitution as far as I can see.


Segher


[GSoC'19] First Evaluations: Implementing OpenMP Work Stealing Scheduling

2019-06-26 Thread 김규래
Hi everyone,
I'll share my status for GSoC first evaluation.
 
Current status of libgomp task system:
I'll first summarize my understanding of libgomp.
Please correct me if I'm wrong.
Currently libgomp has 3 different queues: children_queue, taskloop_queue and 
team_queue.
These three queues are protected using a big lock (team->task_lock).
The current 3 queue setup makes the implementation of work-stealing hard 
because they must be inter-synchronized.
 
Implementation of competing systems:
​The intel OpenMP implementation [1] is simpler.
It uses a single queue for each thread and a single subroutine for dequeuing 
and executing the tasks  [2, 3].
The taskgroup tasks and childen tasks are only counted (not queued) [4, 5, 6].
So the taskwait or barrier like constructs only have to check whether all the 
tasks of interest were computed.
This unifies the task queuing system and makes scheduling much simpler.

What to do on libgomp:
I think we should follow a similar path to libomp.
Instead of using 3 different queues, we could simply use one and only count the 
tasks of interest.
This should also reduce the synchronization overhead between the queues (such 
as in gomp_task_run_post_remove_taskgroup). 
Then the big task_lock lock could be split into one lock for each thread.
I'm currently implementing this but not yet have testable results.
I'll share results as soon as they become available.

Any feedback would be much appreciated.

Ray Kim


[1] https://github.com/llvm-mirror/openmp
[2] 
https://github.com/llvm-mirror/openmp/blob/3f381f546ec9b065f9133d1fcd5d2711affb646a/runtime/src/kmp_tasking.cpp#L2889
[3] Each thread's dedicated queue. 
https://github.com/llvm-mirror/openmp/blob/bbb6f0170731679d690cee002be712f2d703b8fe/runtime/src/kmp.h#L2384
[4] The task executing part of taskwait. 
https://github.com/llvm-mirror/openmp/blob/bbb6f0170731679d690cee002be712f2d703b8fe/runtime/src/kmp.h#L2187
[5] Atomic variable holding number of uncompleted children. 
https://github.com/llvm-mirror/openmp/blob/bbb6f0170731679d690cee002be712f2d703b8fe/runtime/src/kmp.h#L2352
[6] Atomic variable hodling number of uncompleted tasks in taskgroup. 
https://github.com/llvm-mirror/openmp/blob/bbb6f0170731679d690cee002be712f2d703b8fe/runtime/src/kmp.h#L2188
 


[RFC] Confusing fall through BB pc in conditional jump

2019-06-26 Thread Kewen.Lin
Hi all,

6: NOTE_INSN_BASIC_BLOCK 2

   12: r135:CC=cmp(r122:DI,0)
   13: pc={(r135:CC!=0)?L52:pc}
  REG_DEAD r135:CC
  REG_BR_PROB 1041558836
   31: L31:
   17: NOTE_INSN_BASIC_BLOCK 3

The above RTL seq is from pass doloop dumping with -fdump-rtl-all-slim, I 
misunderstood that:
the fall through BB of BB 2 is BB 3, since BB 3 is placed just next to BB 2.  
Then I found the contradiction that BB 3 will have some uninitialized regs if 
it's true.

I can get the exact information with "-blocks" dumping and even detailed one 
with "-details".
But I'm thinking whether it's worth to giving some information on "-slim" dump 
(or more 
exactly without "-blocks") to avoid some confusion especially for new comers 
like me.
Or is it unnecessary? since we have "-blocks" options for that and want to keep 
it "slim"?

Any comments on that? 

Thanks in advance!


---

The below patch is to append the hint on pc target when we find jump insn with 
if-then-else
and the target BB with known code label.

The dumping will look like:
6: NOTE_INSN_BASIC_BLOCK 2
...
   12: r135:CC=cmp(r122:DI,0)
   13: pc={(r135:CC!=0)?L52:pc}
  REG_DEAD r135:CC
  REG_BR_PROB 1041558836
;;  pc (fall through) -> L67
   31: L31:
   17: NOTE_INSN_BASIC_BLOCK 3


diff --git a/gcc/cfgrtl.c b/gcc/cfgrtl.c
index a1ca5992c41..608bcd130c5 100644
--- a/gcc/cfgrtl.c
+++ b/gcc/cfgrtl.c
@@ -2164,7 +2164,27 @@ rtl_dump_bb (FILE *outf, basic_block bb, int indent, 
dump_flags_t flags)
 }

 }
-^L
+
+/* For dumping without specifying basic blocks option, when we see pc is one of
+   jump targets, it's easy to misunderstand the next basic block is the
+   fallthrough one, but it's not so true.  This function is to dump some hints
+   for that.  */
+
+static void
+dump_hints_for_jump_target_pc (FILE *outf, basic_block bb)
+{
+  gcc_assert (outf);
+  edge e = FALLTHRU_EDGE (bb);
+  basic_block dest = e->dest;
+  rtx_insn *label = BB_HEAD (dest);
+  if (!LABEL_P (label))
+return;
+
+  fputs (";; ", outf);
+  fprintf (outf, " pc (fall through) -> L%d", INSN_UID (label));
+  fputc ('\n', outf);
+}
+
 /* Like dump_function_to_file, but for RTL.  Print out dataflow information
for the start of each basic block.  FLAGS are the TDF_* masks documented
in dumpfile.h.  */
@@ -2255,6 +2275,14 @@ print_rtl_with_bb (FILE *outf, const rtx_insn 
*rtx_first, dump_flags_t flags)
  putc ('\n', outf);
}
}
+ else if (GET_CODE (tmp_rtx) == JUMP_INSN
+  && GET_CODE (PATTERN (tmp_rtx)) == SET)
+   {
+ bb = BLOCK_FOR_INSN (tmp_rtx);
+ const_rtx src = SET_SRC (PATTERN (tmp_rtx));
+ if (bb != NULL && GET_CODE (src) == IF_THEN_ELSE)
+   dump_hints_for_jump_target_pc (outf, bb);
+   }
}

   free (start);