Re: On-Demand range technology [2/5] - Major Components : How it works

2019-06-05 Thread Richard Biener
On Tue, Jun 4, 2019 at 9:55 PM Andrew MacLeod  wrote:>
> On 6/4/19 1:07 PM, Richard Biener wrote:
> > On June 4, 2019 6:50:07 PM GMT+02:00, Andrew MacLeod  
> > wrote:
> >> On 6/4/19 11:37 AM, Richard Biener wrote:
> >>
> >>> where the single(!) query of the range of i on the alloca call
> >>> will populate N BBs caches with the Nth having ranges for
> >>> all SSA defs of i running into quadraticness in N.
> >> well, not really.   its still linear since the 'i' used in each foo()
> >> will be dead, so the next block will contain no range for it
> >>
> >>
> >>try{
> >>long i_2 = _1;
> >>i_3 += foo(i_2);
> >>i_4 += foo(i_3);
> >>i_5 += foo(i_4);
> >>i_6 += foo(i_5);
> >> ... repeat N times ...
> >>   alloca(i_6); // query range of i
> >>
> >> once i_2 is 'dead' after the first call to foo(), it will no longer be
> >> in the on-entry cache to any other block.
> >> The walk back never see's i_2 until the first call to foo(), so none of
> > It's actually
> >
> > Utemp_2 = foo(i_1);
> >
> > Bb:
> > i_3 = i_1 + (long) utemp_2;
> > Utemp_4 = foo(i_3);
> >
> > Eliding the separate stmt for the conversion. From you description of the 
> > cache it sounded like getting incoming values is a take-all operation. 
> > Otherwise you'd need negative entries as well (I thought the cache might 
> > not contain varying ranges for example).
>
> ok,  Utemp2 and i_1 will be live, so there are 2 ranges on entry . the
> next block will have only utemp_4 and i_3 live on entry.  So it'll be 2
> ranges per block.
>
>
> It currently does have varying in the table, but the varying value is
> cached for the entire ssa_name in the cache. so 500 BB's with varying
> range will only consume a single range.
>
> By negative entries I assume you mean something like the bottom of a
> diamond?
>
> if (x_3 > 20)
>blah1 (x_3)
> else
>blah 2(x)_3;
> blah3 (x_3);
>
> It does nothing special to come up with the range of x_3 at blah3 ().
> It simply unions all the incoming ranges which is [21,MAX] union
> [MIN,20]  to produce the value [MIN, MAX] which is then put into the
> cache.  We don't use anything special to represent varying.
> varying_p()  merely checks if the range is [MIN, MAX]. And when putting
> a value into the cache, if it is varying_p() the cache simply uses it's
> unique copy of [MIN, MAX] rather than create a new one every time it is
> needed.
>
>
> I had originally considered not caching ranges spanning blocks in which
> the range never changes...  the trade off is you have an increase in
> calculation time as you walk the CFG to find the last block in which it
> did change.  First I figured we'd first see if the less error prone full
> cache causes any real problems or not.   Its always an option to
> implement later since the cache is self contained and can resolve its
> queries internally however it wants.
>
> >
> >> blocks after that have a need for its range, so it will not in their
> >> caches.
> >>
> >> The on-entry cache will only get a single range entry for each of those
> >>
> >> blocks.. the incoming value of i_x  and thats it. The implicitly known
> >> live-on-exit attribute is handy here.
> >>
> >> Now, that's not to say we cant construct cases where there is a lot of
> >> ranges being populated.. If we find the cache is really a problem, we
> >> can start caching ranges.. so that if all these i_x's were live
> >> somehow,
> >> there would only be one range for each in this case, and the cache's
> >> would simply all point to the same range.
> > I suppose using i in the catch block would create sth like this, a phi with 
> > a large in degree and thus the switch case you already know about?
>
> It would be a PHI. None of the ranges coming into the block are
> considered live on entry, so they dont get into the cache.  The PHI
> calculation asks for the incoming range on the edge for each parameter,
> and we get a global range for the PHI definition calculated, and that is
> it.

Thanks for all the clarifications.

> However, Thats not actually the switch problem :-)  the switch problem
> isn't due to a large degree of incoming edges, but rather the difficulty
> in calculating case ranges on the outgoing edges from the switch itself.
>
> stepping back slightly,  Branches are handled generically the same way
> any expression is, the outgoing edge you want information about provides
> a constant range for the implicit LHS of the branch.
>
> so for an if
>c_2 = x_3 > 20
>if (c_2 != 0)
>
> the TRUE edge  has a range of bool [1,1] , fed back as the implicit LHS
> of the branch expression
> [1,1] = (c_2 != 0)   which range-ops for '!=' says c_2 must be [1,1] in
> order to satisfy the equation.
> [1,1] = x_3 > 20  which range-ops for '>' says x_3 must have a range of
> [21, MAX]
>
> if you want the range on the FALSE edge, the edge starts with the range
> [0,0] into the equation solver, and out pops [MIN, 20].
>
> switch (i_3)
>
> for a switch, it work

Re: About GSOC.

2019-06-05 Thread Tejas Joshi
Hello.
Following patch is tested on following test cases in the testsuite and
tests did not fail. I have considered most of the test cases this time
hopefully.

/* { dg-do link } */

extern int link_error (int);

#define TEST(FN, VALUE, RESULT) \
  if (__builtin_##FN (VALUE) != RESULT) link_error (__LINE__);

int
main (void)
{
  TEST(roundeven,  0, 0);
  TEST(roundeven,  0.5, 0);
  TEST(roundeven,  -0.5, 0);
  TEST(roundeven,  6, 6);
  TEST(roundeven,  -8, -8);
  TEST(roundeven,  2.5, 2);
  TEST(roundeven,  3.5, 4);
  TEST(roundeven,  -1.5, -2);
  TEST(roundeven,  3.499, 3);
  TEST(roundeven,  3.501, 4);

  return 0;
}
_Float128 test cases :

/* { dg-do link } */
/* { dg-add-options float128 } */
/* { dg-require-effective-target float128 } */

extern int link_error (int);

#define TEST(FN, VALUE, RESULT) \
  if (__builtin_##FN##f128 (VALUE) != RESULT) link_error (__LINE__);

int
main (void)
{
  TEST(roundeven,  (0x1p64+0.5f128), (0x1p64f128));
  TEST(roundeven,  (0x1p63+0.5f128), (0x1p63f128));
  TEST(roundeven,  (0x1p63-0.5f128), (0x1p63f128));
  TEST(roundeven,  (0x1p64-0.5f128), (0x1p64f128));
  TEST(roundeven,  (0x1p64+0.501f128), (0x1p64+1.0f128));
  TEST(roundeven,  (0x1.C39A5653p1f128), (0x1p2f128))
// the hex number is 3.5000...01 which would fail for roundeven
but not for f128
  return 0;
}

Thanks,
-Tejas

On Tue, 4 Jun 2019 at 12:38, Tejas Joshi  wrote:
>
> Hello.
>
> > NaN, and you should make sure it behaves accordingly.  (If it should never
> > be called for them, a gcc_assert would be appropriate.)
>
> I can't find any documentation about how and when to use gcc_assert.
> But I used it looking at the comment at its definition and locations
> it is used, is this appropriate? Or is it supposed to be used before
> calling the function? :
>
> +bool
> +is_even (REAL_VALUE_TYPE *r)
> +{
> +  /* The function is not supposed to use for Inf and NaN. */
> +  gcc_assert (r->cl != rvc_inf);
> +  gcc_assert (r->cl != rvc_nan);
>
> > So n is the bit position, and w is the word position, of the bit with
> > value 1; n-1 is the position of the bit with value 0.5.
> > If n is a multiple of HOST_BITS_PER_LONG (that is, the bits with values
> > 0.5 and 1 are in different words), this will incorrectly return false when
> > the 0.5 bit is set.
>
> I did not understand this. What is the bit with value 1?
> But when n is a multiple of HOST_BITS_PER_LONG, the function was
> computing value of w wrong (e.g. for number 2^63 + 0.5). At such time,
> would the following improvisation be acceptable in is_halfway_below?
>
> +bool
> +is_halfway_below (const REAL_VALUE_TYPE *r)
> +{
> +  /* The function is not supposed to use for Inf and NaN. */
> +  gcc_assert (r->cl != rvc_inf);
> +  gcc_assert (r->cl != rvc_nan);
> +  int i;
> +
> +  /* For numbers (-0.5,0) and (0,0.5). */
> +  if (REAL_EXP (r) < 0)
> +return false;
> +
> +  else if (REAL_EXP (r) <= SIGNIFICAND_BITS)
> +  {
> +unsigned int n = SIGNIFICAND_BITS - REAL_EXP (r);
> +int w = n / HOST_BITS_PER_LONG;
> +
> +if (n % HOST_BITS_PER_LONG == 0)
> +{
> +  if (w > 0)
> +w = w - 1;
> +  else
> +w = 0;
> +}
> +
> +for (i = 0; i < w; ++i)
> +{
> +  if (r->sig[i] != 0)
> +return false;
> +}
>
> Thanks.
>
>
> On Mon, 3 Jun 2019 at 22:08, Joseph Myers  wrote:
> >
> > On Fri, 31 May 2019, Tejas Joshi wrote:
> >
> > > +/* Return true if integer part of R is even, else return false. */
> > > +
> > > +bool
> > > +is_even (REAL_VALUE_TYPE *r)
> > > +{
> > > +  if (REAL_EXP (r) <= 0)
> > > +return false;
> >
> > But the integer part (truncation towards 0) of something in the interval
> > (-1, 1) is of course even.
> >
> > > +  else if (REAL_EXP (r) < SIGNIFICAND_BITS)
> > > +  {
> > > +unsigned int n = SIGNIFICAND_BITS - REAL_EXP (r);
> > > +int w = n / HOST_BITS_PER_LONG;
> > > +
> > > +unsigned long num = ((unsigned long)1 << (n % HOST_BITS_PER_LONG));
> > > +
> > > +if ((r->sig[w] & num) == 0)
> > > +  return true;
> > > +  }
> > > +  return false;
> > > +}
> >
> > Suppose REAL_EXP (r) == SIGNIFICAND_BITS.  Then you still need to check
> > the low bit in that case.
> >
> > Suppose REAL_EXP (r) > SIGNIFICAND_BITS.  Then the number is definitely
> > even, so you should return true, not false.
> >
> > The comment on this function needs to define what it does for infinity /
> > NaN, and you should make sure it behaves accordingly.  (If it should never
> > be called for them, a gcc_assert would be appropriate.)
> >
> > What does this function do for zero?  It should, of course, return that it
> > is even.
> >
> > > +/* Return true if R is halfway between two integers, else return false. 
> > > */
> >
> > Again, define what this does for infinity / NaN and make sure it behaves
> > accordingly.
> >
> > > +bool
> > > +is_halfway_below (const REAL_VALUE_TYPE *r)
> > > +{
> > > +  if (REAL_EXP (r) < 0)
> > > +return false;
> > > +
> > > +  if (REAL_EXP (r) == 0)
> > > +  {

Preventing ISO C errors when using macros for builtin types

2019-06-05 Thread Jozef Lawrynowicz
The MSP430 target in the large memory model uses the (non-ISO) __int20 type for
SIZE_TYPE and PTRDIFF_TYPE.
The preprocessor therefore expands a builtin such as __SIZE_TYPE__ to
"__int20 unsigned" in user code.
When compiling with the "-pedantic-errors" flag, the use of any of these
builtin macros results in an error of the form:

> tester.c:4:9: error: ISO C does not support '__int20' types [-Wpedantic]

The GCC documentation does instruct users *not* to use these types directly
(cpp.texi):
> You should not use these macros directly; instead, include the
> appropriate headers and use the typedefs.
When using the typedefs (e.g. size_t) in a program compiled with
-pedantic-errors, there is no ISO C error.

However, in the testsuite there is an ever-growing list of tests which use
the macros to avoid having to include any header files required for the
typedefs.
Since -pendantic-errors is often passed as a default flag in the testsuite,
there are many false failures when testing with -mlarge, caused by this ISO C
error.

I would like to try to find a way to address this issue within GCC itself, so
that constant updates to the testsuite are not required to filter these types
of failures out.

I tried one approach suggested here
https://gcc.gnu.org/ml/gcc-patches/2018-11/msg02219.html
which was to add "__extension__" to the definition of SIZE_TYPE/PTRDIFF_TYPE in
msp430.h, however it became clear that that will not work, since the following
is not valid:
> typedef __extension__ __int20 ptrdiff_t;

> error: expected identifier or '(' before '__extension__'

__extension__ must be placed at the beginning of the declaration.

I'm assuming it would not be valid to modify the behaviour of __extension__
so it can be placed within a declaration, and not just at the
beginning. However, there is minimal documentation on this keyword (it does not
state that it can be used in declarations, even though it can), so I wonder
what the "rules" are.

I would appreciate if anyone can help me decide if:
- It would be OK for the use of builtin macros such as __SIZE_TYPE__ to somehow
  not trigger the "pedantic errors", and what a valid approach might look like
  * would finding a way to sandwich __extension__ into the expansion of these
macros be acceptable?
  or,
- These types of failures should be continued to be fixed in the tests
  themselves, for example by adding __extension__ before their usage.

Thanks,
Jozef


sha512.sum for gcc-7.4.0.tar.gz incorrect on at least three mirrors

2019-06-05 Thread Farid Parpia


Greetings,

It appears that the given sha512 sum for gcc-7.4.0.tar.gz may be incorrect
on at least the following mirrors:
   http://mirrors.concertpass.com/
   http://www.netgull.com/
   https://bigsearcher.com/

Regards,

Farid Parpia  IBM Corporation: 710-2-RF28, 2455 South Road,
Poughkeepsie, NY 12601, USA; Telephone: (845) 433-8420 = Tie Line 293-8420


Re: Preventing ISO C errors when using macros for builtin types

2019-06-05 Thread Segher Boessenkool
On Wed, Jun 05, 2019 at 02:25:59PM +0100, Jozef Lawrynowicz wrote:
> I'm assuming it would not be valid to modify the behaviour of __extension__
> so it can be placed within a declaration, and not just at the
> beginning. However, there is minimal documentation on this keyword (it does 
> not
> state that it can be used in declarations, even though it can), so I wonder
> what the "rules" are.

The documentation says

  '-pedantic' and other options cause warnings for many GNU C extensions.
  You can prevent such warnings within one expression by writing
  '__extension__' before the expression.  '__extension__' has no effect
  aside from this.

It's not clear to me why you cannot simply put __extension__ earlier in
your case?


Segher


Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime

2019-06-05 Thread 김규래
Hi, thanks for the detailed explanation.
I think I now get the picture.
Judging from my current understanding, the task-parallelism currently works as 
follows: 
1. Tasks are placed in a global shared queue.
2. Workers consume the tasks by bashing the queue in a while loop, just as 
self-scheduling (dynamic scheduling)/
 
Then the improvements including work-stealing must be done by:
1. Each worker holds a dedicated task queue reducing the resource contention.
2. The tasks are distributed in a round-robin fashion
3. work-stealing will resolve the load imbalance.
 
If the above statements are correct, I guess the task priority should be given 
some special treatment?
 
Ray Kim
 
-Original Message-
From: "Jakub Jelinek"
To: "김규래";
Cc: ;
Sent: 2019-06-04 (화) 03:21:01 (GMT+09:00)
Subject: Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
 
On Tue, Jun 04, 2019 at 03:01:13AM +0900, 김규래 wrote:
> Hi,
> I've been studying the libgomp task parallelism system.
> I have a few questions.
> First, Tracing the events shows that only the main thread calls GOMP_task.

No, any thread can call GOMP_task, in particular the thread that encountered
the #pragma omp task construct.

The GOMP_task function then decides based on the clauses of the construct
(passed in various ways through the arguments of that function) whether it
will be included (executed by the encountering thread), or queued for
later execution.  In the latter case, it will be scheduled during a barrier
(implicit or explicit), see gomp_barrier_handle_tasks called from the
bar.[ch] code, or at other spots, e.g. during taskwait construct
(GOMP_taskwait) or at the end of taskgroup (GOMP_taskgroup_end).

> How do the other worker threads enter the libgomp runtime?

If you never encounter a parallel, teams or target construct, then there is
just one thread that does everything (well, the library is written such that
if you by hand pthread_create, each such thread acts as a separate initial
thread from OpenMP POV).
Threads are created e.g. during parallel construct (GOMP_parallel), where
for non-nested parallelism as the standard requires it reuses existing
threads if possible or spawns new ones, see mainly team.c (gomp_team_start)
for the function that spawns new threads or awakes the ones waiting for
work, or gomp_thread_start in the same file for the function actually run by
the libgomp library created threads.

> I can't find the entry point of the worker threads from the event tracing and 
> the assembly dump.
> Second, How is the task priority set?

By the user, through priority clause, passed to GOMP_task and then taken
into account when handling tasks in the various queues.

Jakub


Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime

2019-06-05 Thread Jakub Jelinek
On Thu, Jun 06, 2019 at 03:25:24AM +0900, 김규래 wrote:
> Hi, thanks for the detailed explanation.
> I think I now get the picture.
> Judging from my current understanding, the task-parallelism currently works 
> as follows: 
> 1. Tasks are placed in a global shared queue.

It isn't a global shared queue, but a per-team shared queue, in fact 3
different ones, guarded by the same per-team mutex team->task_lock though:
team->task_queueused on barriers
task->children_queueused for #pragma omp taskwait
taskgroup->taskgroup_queue  used at the end of #pragma omp taskgroup

> 2. Workers consume the tasks by bashing the queue in a while loop, just as 
> self-scheduling (dynamic scheduling)/
>  
> Then the improvements including work-stealing must be done by:
> 1. Each worker holds a dedicated task queue reducing the resource contention.
> 2. The tasks are distributed in a round-robin fashion
> 3. work-stealing will resolve the load imbalance.
>  
> If the above statements are correct, I guess the task priority should be 
> given some special treatment?

Yes, one thing to consider is task priority, another thing is what to do
on those #pragma omp taskwait or #pragma omp taskgroup waits where while one
can schedule most of the tasks (the OpenMP specification has restrictions on
what can be scheduled, but the restrictions are mostly relevant to untied
tasks which we don't support anyway (well, ignore untied and schedule them
like tied tasks)), but it might be more beneficial (and what is currently
implemented) to queue only the tasks that will help satisfying what we are
waiting on.  All of priority and only scheduling task or taskgroup children
instead of all tasks are just hints, the implementation may choose other
tasks.
There is another thing though, tasks with dependencies, where
gomp_task_handle_depend which needs to look up addresses in hash tables
for the dependencies also uses the team->task_lock mutex.

Jakub


Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime

2019-06-05 Thread Janne Blomqvist
On Wed, Jun 5, 2019 at 9:25 PM 김규래  wrote:
>
> Hi, thanks for the detailed explanation.
> I think I now get the picture.
> Judging from my current understanding, the task-parallelism currently works 
> as follows:
> 1. Tasks are placed in a global shared queue.
> 2. Workers consume the tasks by bashing the queue in a while loop, just as 
> self-scheduling (dynamic scheduling)/
>
> Then the improvements including work-stealing must be done by:
> 1. Each worker holds a dedicated task queue reducing the resource contention.
> 2. The tasks are distributed in a round-robin fashion

For nested task submission (does OpenMP support that?) you probably
want to submit to the local queue rather than round-robin, no?

> 3. work-stealing will resolve the load imbalance.
>
> If the above statements are correct, I guess the task priority should be 
> given some special treatment?
>
> Ray Kim
>
> -Original Message-
> From: "Jakub Jelinek"
> To: "김규래";
> Cc: ;
> Sent: 2019-06-04 (화) 03:21:01 (GMT+09:00)
> Subject: Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime
>
> On Tue, Jun 04, 2019 at 03:01:13AM +0900, 김규래 wrote:
> > Hi,
> > I've been studying the libgomp task parallelism system.
> > I have a few questions.
> > First, Tracing the events shows that only the main thread calls GOMP_task.
>
> No, any thread can call GOMP_task, in particular the thread that encountered
> the #pragma omp task construct.
>
> The GOMP_task function then decides based on the clauses of the construct
> (passed in various ways through the arguments of that function) whether it
> will be included (executed by the encountering thread), or queued for
> later execution.  In the latter case, it will be scheduled during a barrier
> (implicit or explicit), see gomp_barrier_handle_tasks called from the
> bar.[ch] code, or at other spots, e.g. during taskwait construct
> (GOMP_taskwait) or at the end of taskgroup (GOMP_taskgroup_end).
>
> > How do the other worker threads enter the libgomp runtime?
>
> If you never encounter a parallel, teams or target construct, then there is
> just one thread that does everything (well, the library is written such that
> if you by hand pthread_create, each such thread acts as a separate initial
> thread from OpenMP POV).
> Threads are created e.g. during parallel construct (GOMP_parallel), where
> for non-nested parallelism as the standard requires it reuses existing
> threads if possible or spawns new ones, see mainly team.c (gomp_team_start)
> for the function that spawns new threads or awakes the ones waiting for
> work, or gomp_thread_start in the same file for the function actually run by
> the libgomp library created threads.
>
> > I can't find the entry point of the worker threads from the event tracing 
> > and the assembly dump.
> > Second, How is the task priority set?
>
> By the user, through priority clause, passed to GOMP_task and then taken
> into account when handling tasks in the various queues.
>
> Jakub



-- 
Janne Blomqvist


Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime

2019-06-05 Thread 김규래
> On Wed, Jun 5, 2019 at 9:25 PM 김규래  wrote: 
>
> > Hi, thanks for the detailed explanation.
> > I think I now get the picture.
> > Judging from my current understanding, the task-parallelism currently works 
> > as follows:
> > 1. Tasks are placed in a global shared queue.
> > 2. Workers consume the tasks by bashing the queue in a while loop, just as 
> > self-scheduling (dynamic scheduling)/
> >
> > Then the improvements including work-stealing must be done by:
> > 1. Each worker holds a dedicated task queue reducing the resource 
> > contention.
> > 2. The tasks are distributed in a round-robin fashion

> For nested task submission (does OpenMP support that?) you probably
> want to submit to the local queue rather than round-robin, no? 
 
Hi Janne,
I believe you're talking about spawning child tasks within an already running 
task, which I believe is supported by OpenMP.
In this case, pushing to the local queue could introduce a deadlock if the 
master thread waits on the spawned tasks. 
A short one though since work-stealing can resolve the deadlock.
A better way to handle this is to round-robin the child tasks to the queues 
excluding the queue of the thread consuming the current task.
Then waiting on the tasks should never cause a deadlock.
Or did I misunderstand your question?
Maybe there is a specific reason for avoiding avoid round-robin that I missed?
 
Ray Kim


Re: Preventing ISO C errors when using macros for builtin types

2019-06-05 Thread Jozef Lawrynowicz
On Wed, 5 Jun 2019 11:49:21 -0500
Segher Boessenkool  wrote:

> On Wed, Jun 05, 2019 at 02:25:59PM +0100, Jozef Lawrynowicz wrote:
> > I'm assuming it would not be valid to modify the behaviour of __extension__
> > so it can be placed within a declaration, and not just at the
> > beginning. However, there is minimal documentation on this keyword (it does 
> > not
> > state that it can be used in declarations, even though it can), so I wonder
> > what the "rules" are.  
> 
> The documentation says
> 
>   '-pedantic' and other options cause warnings for many GNU C extensions.
>   You can prevent such warnings within one expression by writing
>   '__extension__' before the expression.  '__extension__' has no effect
>   aside from this.
> 
> It's not clear to me why you cannot simply put __extension__ earlier in
> your case?

If I am modifying tests on an individual basis, then indeed I can put
__extension__ earlier in the declaration and it fixes the issue.

But for a fix within the compiler to prevent having to modify individual tests,
I could add __extension__ to the beginning of the macro - but the macro may
not end up at the beginning of a declaration in the source code.

For example, say __SIZE_TYPE__ now expands to "__extension__ __int20 unsigned",
then the following usage of __SIZE_TYPE__ would be ok, as __extension__ is at
the beginning of the declaration:

> __SIZE_TYPE__ size;

But the following would emit an error, because __extension__ is not at the
beginning of the declaration:

> typedef __SIZE_TYPE__ size_t;

I'm mainly just wondering if a change to the compiler to allow the usage of
__extension__ within a declaration would be allowed.
However since there are many contexts where usage of a type such as
__SIZE_TYPE__ is valid, but where __extension__ preceding the type wouldn't be
valid, this is probably not worth exploring..

I'm aware of some different ways that the tests can be fixed up to prevent
the ISO C errors caused by -pedantic-errors.
I'm currently thinking the best way to fix these issues in the testsuite might
be to implement a directive such as "dg-prune-options". This directive can then
prune "-pedantic-errors" from the options to be passed to GCC when some
specified condition (e.g. -mlarge is in the opt list) is matched.
This would at least avoid having to re-jig tests to get the __extension__
keyword into the right places, as the dg-prune-options can just be added to the
problematic tests.

Thanks,
Jozef


Re: [GSoC'19, libgomp work-stealing] Task parallelism runtime

2019-06-05 Thread Janne Blomqvist
On Wed, Jun 5, 2019 at 10:42 PM 김규래  wrote:

> > On Wed, Jun 5, 2019 at 9:25 PM 김규래  wrote:
>
> >
> > > Hi, thanks for the detailed explanation.
> > > I think I now get the picture.
> > > Judging from my current understanding, the task-parallelism currently
> works as follows:
> > > 1. Tasks are placed in a global shared queue.
> > > 2. Workers consume the tasks by bashing the queue in a while loop,
> just as self-scheduling (dynamic scheduling)/
> > >
> > > Then the improvements including work-stealing must be done by:
> > > 1. Each worker holds a dedicated task queue reducing the resource
> contention.
> > > 2. The tasks are distributed in a round-robin fashion
>
> > For nested task submission (does OpenMP support that?) you probably
> > want to submit to the local queue rather than round-robin, no?
>
>
>
> Hi Janne,
>
> I believe you're talking about spawning child tasks within an already
> running task, which I believe is supported by OpenMP.
>
> In this case, pushing to the local queue could introduce a deadlock if the
> master thread waits on the spawned tasks.
>
> A short one though since work-stealing can resolve the deadlock.
>
> A better way to handle this is to round-robin the child tasks to the
> queues excluding the queue of the thread consuming the current task.
>
> Then waiting on the tasks should never cause a deadlock.
>
> Or did I misunderstand your question?
>
> Maybe there is a specific reason for avoiding avoid round-robin that I
> missed?
>
Better cache locality.

Another option, which I guess starts to go out of scope of your gsoc, is
parallel depth first (PDF) search (Blelloch 1999) as an alternative to work
stealing. Here's a presentation about some recent work in this area,
although for Julia and not OpenMP (no idea if PDF would fit with OpenMP at
all): https://www.youtube.com/watch?v=YdiZa0Y3F3c

-- 
Janne Blomqvist


On-Demand range technology [6/5] - Integration

2019-06-05 Thread Andrew MacLeod
After the various discussions, I've evaluated how I think everything can 
fit together, so this is my proposal for integration with trunk.



The complete Ranger prototype consists of 5 major  components, one of 
which is missing/un-implemented as yet :-)


1 - irange -  This is the non-symbolic range implementation we've 
been using which represents ranges as groups of ordered sub-ranges.
2 - range-ops -  This is the component which extracts ranges from 
statements, and so performs the functionality of extract_range_from_*,  
except it operates on the irange API and also allows for solving of 
operands other than just the LHS of the expression.
3 - GORI-computes -  This is the component which utilizes range-ops to 
compute a range on an outgoing edge for any ssa-name in the definition 
chain of the branch

  a_3 = b_6 * 2
  d_8 = a_3 - 20
 if (d_8 < 30)
   the GORI-compute component can generate ranges for d_8, a_3 and b_6.
4 - GORI-Cache and the Ranger.  Working together, this provides the 
on-demand range functionality to resolve ranges
5 - relational/equivalency tracker - This is the sketched out but 
unimplemented bit which tracks the symbolic relationships, and remove 
the need for ranges to support symbolics. ( <,<=, >, >=, ==, != and none).


The consensus appears to be that range-ops and gori-computes are good 
candidates to replace aspects of vr-values and assert generation.


A)
Until I get to (5) (relational tracker), using (1) (irange) is a 
non-starter since it doesn't handle symbolics.


To eliminate the range issue from the equation,  Aldy is currently 
working on unifying the irange and value_range APIs.   This will allow 
the rest of the ranger code base to use the value_range implementation 
transparently.   We can talk about irange or some alternate 
implementation of ranges at some later point, but there'll be an API 
that works for all clients.


The existing value_range API gets a few tweaks/cleanups, but mostly 
there is an additional set of calls to query sub-ranges which the ranger 
and range-ops require. These routines basically translate the various 
value ranges formats into discrete sub-ranges.  Thru these rotuines, 
ANTI_RANGE will appear as 2 sub-ranges, VARYING as a [MIN, MAX] range, 
and UNDEFINED as an empty range [].  These additions should allow 
value_range to function as the range implementation for both the ranger 
and VRP.


I suspect he will have patches coming shortly that will help to unify 
the 2 range implementations, we can discuss details over those patches..


B)
A Unified range API then allows us to work on integrating the range-ops 
and GORI-computes component into the code base.   Range ops would 
replace the various extract_range_from_*_ routines in vr_values for 
statement level ranges.  GORI-computes would then replace the assert 
building code for calculating outgoing ranges on edges.  In theory EVRP 
then simply calls range_on_edge() from gori_compute instead of 
register_edge_assert() .


The range ops code is designed to perform all evaluations assuming an 
arbitrary number of sub-ranges.  Aldy spent a lot of time last year 
unifying the VRP code and the range-ops code to get the identical 
results, and they frequently share a common base. He  has gone thru 
excruciating care to ensure the calculations are identical and verifies 
it by calculating everything using both code bases, comparing them, and 
aborting if the results ever get diverge.


We will need to adjust the range-ops code to work with symbolics in 
certain place. This means PLUS, MINUS, all the relations (<,>, etc), and 
copy. Probably something else as it is encountered. This is un-sized as 
yet, but I'm hoping won't be too bad assuming we can utilize some of the 
existing code for those bits..  More details when we actually start 
doing this and find the lurking dragons.


we'll worry about bitmasks and equivalencies when we get closer to 
functioning, but I don't foresee too much problem since value_range_base 
is still being used.



C)  That will keep us busy for a while, and should result in the core 
integration.   Meanwhile, we'll try to figure out the relational code 
design. I'll go back to my original design, adjust that, then we can 
figure out how best to proceed to address the various needs.


D) Finally, the GORI-cache and on-demand ranger are blocked until the 
above work is finished.



One additional thing I would like to do eventually is tweak EVRP 
slightly to align with the ranger model.


The ranger API is basically just 5 entry points which the ranger uses to 
determine ranges.

    range_of_expr  - range of a use on a statement
    range_of_stmt  - range of the result of the statement, (calculated 
by range-ops).

    range_on_edge - range on an edge - (provided by gori_computes)
    range_on_entry - range on entry to a block (provided by gori-cache)
    range_on_exit - range after the last statement in a block

Abstracted and simplified, I beli

Re: Preventing ISO C errors when using macros for builtin types

2019-06-05 Thread Segher Boessenkool
On Wed, Jun 05, 2019 at 08:49:39PM +0100, Jozef Lawrynowicz wrote:
> On Wed, 5 Jun 2019 11:49:21 -0500
> Segher Boessenkool  wrote:
> > The documentation says
> > 
> >   '-pedantic' and other options cause warnings for many GNU C extensions.
> >   You can prevent such warnings within one expression by writing
> >   '__extension__' before the expression.  '__extension__' has no effect
> >   aside from this.
> > 
> > It's not clear to me why you cannot simply put __extension__ earlier in
> > your case?
> 
> If I am modifying tests on an individual basis, then indeed I can put
> __extension__ earlier in the declaration and it fixes the issue.

Or you could fix it like this in your header file?

> But for a fix within the compiler to prevent having to modify individual 
> tests,
> I could add __extension__ to the beginning of the macro - but the macro may
> not end up at the beginning of a declaration in the source code.
> 
> For example, say __SIZE_TYPE__ now expands to "__extension__ __int20 
> unsigned",
> then the following usage of __SIZE_TYPE__ would be ok, as __extension__ is at
> the beginning of the declaration:
> 
> > __SIZE_TYPE__ size;

Don't use macros for types?  You could use something like

  __extension__ typedef unsigned __int20 __SIZE_TYPE__;

(stddef.h already uses typedefs like that, btw, for __SIZE_TYPE__ in fact).

> I'm mainly just wondering if a change to the compiler to allow the usage of
> __extension__ within a declaration would be allowed.

Well, how would that work?  We'd need to modify the grammar to allow
__extension__ pretty much anywhere, and then change all compiler code
to not pedwarn until it has parsed a full expression (or statement, or
file, or however this will work).  Or make it not warn for things after
the __extension__ only, or make it only *guaranteed* not to warn for
things *after* the __extension__.

None of those options are very appetising, IMO.


Segher


Re: ARM peephole2 from 2003 never merged, still valid

2019-06-05 Thread Jeff Law
On 6/2/19 6:28 AM, Segher Boessenkool wrote:
> On Sat, Jun 01, 2019 at 11:41:30PM +, Fredrik Hederstierna wrote:
>> +(define_peephole2
>> +  [(set (match_operand:SI 0 "arm_general_register_operand" "")
>> +   (match_operand:SI 1 "arm_general_register_operand" ""))
>> +   (set (reg:CC CC_REGNUM)
>> +   (compare:CC (match_dup 0) (const_int 0)))]
>> +  "TARGET_ARM"
>> +  [(parallel [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 1) (const_int 
>> 0)))
>> + (set (match_dup 0) (match_dup 1))])]
>> +  ""
>> +)
> 
> Hi Fredrik,
> 
> Do you have a testcase for this?  I wonder if it would be better handled
> during combine, and what that then tried; or perhaps these opportunities
> are created later, making a peephole a more attractive solution.
We have two independent insns with no output/true dependency between
them.  So there's really not anything for combine to do here.

What is more interesting to me is whether or not this could be handled
by compare-elim and a define_insn rather than a define_peephole2.  THe
existing pattern and the new one both seem well suited for compare-elim.

I do think a testcase is warranted here.  Fredrik, if you could reduce
one from CSiBE that might be appropriate.


jeff


Re: ARM peephole2 from 2003 never merged, still valid

2019-06-05 Thread Segher Boessenkool
On Wed, Jun 05, 2019 at 05:02:53PM -0600, Jeff Law wrote:
> On 6/2/19 6:28 AM, Segher Boessenkool wrote:
> > On Sat, Jun 01, 2019 at 11:41:30PM +, Fredrik Hederstierna wrote:
> >> +(define_peephole2
> >> +  [(set (match_operand:SI 0 "arm_general_register_operand" "")
> >> +   (match_operand:SI 1 "arm_general_register_operand" ""))
> >> +   (set (reg:CC CC_REGNUM)
> >> +   (compare:CC (match_dup 0) (const_int 0)))]
> >> +  "TARGET_ARM"
> >> +  [(parallel [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 1) 
> >> (const_int 0)))
> >> + (set (match_dup 0) (match_dup 1))])]
> >> +  ""
> >> +)
> > 
> > Do you have a testcase for this?  I wonder if it would be better handled
> > during combine, and what that then tried; or perhaps these opportunities
> > are created later, making a peephole a more attractive solution.
> We have two independent insns with no output/true dependency between
> them.  So there's really not anything for combine to do here.

op0 := op1;
CC := op0 cmp 0;

That's a perfectly fine dependency I think?

> What is more interesting to me is whether or not this could be handled
> by compare-elim and a define_insn rather than a define_peephole2.  THe
> existing pattern and the new one both seem well suited for compare-elim.
> 
> I do think a testcase is warranted here.  Fredrik, if you could reduce
> one from CSiBE that might be appropriate.

Yeah, let's see what is really going on :-)


Segher