[libgomp] MEMMODEL_* constants and OMP_STACKSIZE: a few questions/proposals

2013-07-15 Thread Kévin PETIT
Ping.

Hi,

I’ve recently started to work on libgomp with the goal of proposing
a new way of handling queues of tasks based on the work done by a
PhD student.

While working on libgomp’s code I noticed two things that puzzled me:

- The code uses gcc’s atomic builtins but doesn’t use the
__ATOMIC_* constants to select the desired memory model for the
operation. Instead it uses the MEMMODEL_* constants defined in
libgomp.h. Is there a good reason for that? I’d be very tempted to
(possibly) “fix” it with the attached patch.

- When using the OMP_STACKSIZE environment variable to select
the stack size of the threads, it only applies to the threads created
by pthread_create and has no effect on the main thread of the process.
This behaviour looks compliant with version 3.1 of the spec:

“OMP_STACKSIZE sets the stacksize-var ICV that specifies the size of
the stack for threads created by the OpenMP implementation”

but I was wondering if it was the best thing to do? I discovered this
when playing with the UTS benchmark of the BOTS benchmark suite which
can require quite big stacks for some input datasets. It uses
OMP_STACKSIZE to set its requirements but that doesn’t prevent a task
with stack requirements bigger than the default size to be scheduled
on the main thread of the process, leading the application to crash
because of a stack overflow.

Should the application be setting itself the size of its main thread’s
stack? Shouldn’t something be done in the compiler/runtime to handle
this? That wouldn’t be not compliant with the spec and much more
intuitive to the programmer: “I can expect that every thread will have
OMP_STACKSIZE worth of stack”.

I should hopefully write again soon with some more useful patches and
proposals. In the meantime, thank you for your answers.

Regards,

Kévin
From b1a6f296d6c4304b41f735859e226213881f861e Mon Sep 17 00:00:00 2001
From: Kevin PETIT 
Date: Mon, 20 May 2013 13:28:40 +0100
Subject: [PATCH] Don't redefine an equivalent to the __ATOMIC_* constants

If we can use the __atomic_* builtins it means that we also have
the __ATOMIC_* constants available. This patch removes the
definition of the MEMMODEL_* constants in libgomp.h and changes
the code to use __ATOMIC_*.
---
 libgomp/config/linux/affinity.c |2 +-
 libgomp/config/linux/bar.c  |   10 +-
 libgomp/config/linux/bar.h  |8 
 libgomp/config/linux/lock.c |   10 +-
 libgomp/config/linux/mutex.c|6 +++---
 libgomp/config/linux/mutex.h|4 ++--
 libgomp/config/linux/ptrlock.c  |6 +++---
 libgomp/config/linux/ptrlock.h  |6 +++---
 libgomp/config/linux/sem.c  |6 +++---
 libgomp/config/linux/sem.h  |4 ++--
 libgomp/config/linux/wait.h |2 +-
 libgomp/critical.c  |2 +-
 libgomp/libgomp.h   |   11 ---
 libgomp/ordered.c   |4 ++--
 libgomp/task.c  |4 ++--
 15 files changed, 37 insertions(+), 48 deletions(-)

diff --git a/libgomp/config/linux/affinity.c b/libgomp/config/linux/affinity.c
index dc6c7e5..fcbc323 100644
--- a/libgomp/config/linux/affinity.c
+++ b/libgomp/config/linux/affinity.c
@@ -108,7 +108,7 @@ gomp_init_thread_affinity (pthread_attr_t *attr)
   unsigned int cpu;
   cpu_set_t cpuset;
 
-  cpu = __atomic_fetch_add (&affinity_counter, 1, MEMMODEL_RELAXED);
+  cpu = __atomic_fetch_add (&affinity_counter, 1, __ATOMIC_RELAXED);
   cpu %= gomp_cpu_affinity_len;
   CPU_ZERO (&cpuset);
   CPU_SET (gomp_cpu_affinity[cpu], &cpuset);
diff --git a/libgomp/config/linux/bar.c b/libgomp/config/linux/bar.c
index 35baa88..8a81f7c 100644
--- a/libgomp/config/linux/bar.c
+++ b/libgomp/config/linux/bar.c
@@ -38,14 +38,14 @@ gomp_barrier_wait_end (gomp_barrier_t *bar, 
gomp_barrier_state_t state)
   /* Next time we'll be awaiting TOTAL threads again.  */
   bar->awaited = bar->total;
   __atomic_store_n (&bar->generation, bar->generation + 4,
-   MEMMODEL_RELEASE);
+   __ATOMIC_RELEASE);
   futex_wake ((int *) &bar->generation, INT_MAX);
 }
   else
 {
   do
do_wait ((int *) &bar->generation, state);
-  while (__atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE) == state);
+  while (__atomic_load_n (&bar->generation, __ATOMIC_ACQUIRE) == state);
 }
 }
 
@@ -95,7 +95,7 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, 
gomp_barrier_state_t state)
}
   else
{
- __atomic_store_n (&bar->generation, state + 3, MEMMODEL_RELEASE);
+ __atomic_store_n (&bar->generation, state + 3, __ATOMIC_RELEASE);
  futex_wake ((int *) &bar->generation, INT_MAX);
  return;
}
@@ -105,11 +105,11 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, 
gomp_barrier_state_t state)
   do
 {
   do_wait ((int *) &bar->generation, generation);
-  gen = __atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE);
+  gen = __atomic_load_n (&bar->gene

Target options

2013-07-15 Thread Hendrik Greving
Hi,

when defining target options with Mask() and "Target" going to
target_flags. Can I use Init(1) to define the default, or is "Init"
only used to initialize "Var(name)" kind of options? If so, what's the
proper way to define defaults, it wasn't clear to me when checking
mips/i386 definitions for instance.

Thanks,
Hendrik Greving


Re: Target options

2013-07-15 Thread Hendrik Greving
Along the same lines, what's the difference of target_flags (I know
from old compilers) and target_flags_explicit (I do not know)?

Thanks,
Regards,
Hendrik Greving

On Mon, Jul 15, 2013 at 10:30 AM, Hendrik Greving
 wrote:
> Hi,
>
> when defining target options with Mask() and "Target" going to
> target_flags. Can I use Init(1) to define the default, or is "Init"
> only used to initialize "Var(name)" kind of options? If so, what's the
> proper way to define defaults, it wasn't clear to me when checking
> mips/i386 definitions for instance.
>
> Thanks,
> Hendrik Greving


gettext prereq vs po/zh_TW

2013-07-15 Thread DJ Delorie

The gcc prereq page says gettext 0.14.5 is the minimum version, but
po/zh_TW.po has lines like this:

#, fuzzy
#~| msgid "Unexpected EOF"
#~ msgid "Unexpected type..."
#~ msgstr "未預期的型態…"

The | syntax appears to have been added in gettext 0.16, and gettext
0.14 can't process it.

Seems to have been a result of this request:

http://gcc.gnu.org/ml/gcc-patches/2013-06/msg01436.html


Re: Target options

2013-07-15 Thread Shiva Chen
2013/7/16 Hendrik Greving :
> Along the same lines, what's the difference of target_flags (I know
> from old compilers) and target_flags_explicit (I do not know)?
>
> Thanks,
> Regards,
> Hendrik Greving
>
> On Mon, Jul 15, 2013 at 10:30 AM, Hendrik Greving
>  wrote:
>> Hi,
>>
>> when defining target options with Mask() and "Target" going to
>> target_flags. Can I use Init(1) to define the default, or is "Init"
>> only used to initialize "Var(name)" kind of options? If so, what's the
>> proper way to define defaults, it wasn't clear to me when checking
>> mips/i386 definitions for instance.
>>
>> Thanks,
>> Hendrik Greving

Hi,  Greving

To my understanding, when the option use MASK(F)
we shouldn't use Init (), one approach is to initialize it in
option_override target hook
Ex:  target_flags |= MASK_F

target_flags_explicit determine whether user have given the flag value
(disable/enable)or not.

Ex:
if the flag initial value depends on cpu type
when the cpu type is A, flag F should enable
However, user may disable the flag explicitly
we wish user semantic could take effect.

Therefore, the condition would be

if (cpu_type == A
&& !(target_flags_explicit & MASK_F))
   target_flags |= MASK_F;

Cheers,
Shiva


Re: Delay scheduling due to possible future multiple issue in VLIW

2013-07-15 Thread Maxim Kuvyrkov
Paulo,

GCC schedule is not particularly designed for VLIW architectures, but it 
handles them reasonably well.  For the example of your code both schedules take 
same time to execute:

38: 0: r1 = e[r0]
40: 4: [r0] = r1
41: 5: r0 = r0+4
43: 5: p0 = r1!=0
44: 6: jump p0

and

38: 0: r1 = e[r0]
41: 1: r0 = r0+4
40: 4: [r0] = r1
43: 5: p0 = r1!=0
44: 6: jump p0

[It is true that the first schedule takes less space due to fortunate VLIW 
packing.]

You are correct that GCC scheduler is greedy and that it tries to issue 
instructions as soon as possible (i.e., it is better to issue something on the 
cycle, than nothing at all), which is a sensible strategy.  For small basic 
block the greedy algorithm may cause artifacts like the one you describe.

You could try increasing size of regions on which scheduler operates by 
switching your port to use scheb-ebb scheduler, which was originally developed 
for ia64.

Regards,

--
Maxim Kuvyrkov
KugelWorks



On 27/06/2013, at 8:35 PM, Paulo Matos wrote:

> Let me add to my own post saying that it seems that the problem is that the 
> list scheduler is greedy in the sense that it will take an instruction from 
> the ready list no matter what when waiting and trying to pair it with later 
> on with another instruction might be more beneficial. In a sense it seems 
> that the idea is that 'issuing instructions as soon as possible is better' 
> which might be true for a single issue chip but a VLIW with multiple issue 
> has to contend with other problems.
> 
> Any thoughts on this?
> 
> Paulo Matos
> 
> 
>> -Original Message-
>> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Paulo
>> Matos
>> Sent: 26 June 2013 15:08
>> To: gcc@gcc.gnu.org
>> Subject: Delay scheduling due to possible future multiple issue in VLIW
>> 
>> Hello,
>> 
>> We have a port for a VLIW machine using gcc head 4.8 with an maximum issue of
>> 2 per clock cycle (sometimes only 1 due to machine constraints).
>> We are seeing the following situation in sched2:
>> 
>> ;;   --- forward dependences: 
>> 
>> ;;   --- Region Dependences --- b 3 bb 0
>> ;;  insn  codebb   dep  prio  cost   reservation
>> ;;    --   ---       ---
>> ;;   38  1395 3 0 6 4
>> (p0+long_imm+ldst0+lock0),nothing*3 : 44m 43 41 40
>> ;;   40   491 3 1 2 2   (p0+long_imm+ldst0+lock0),nothing
>> : 44m 41
>> ;;   41   536 3 2 1 1   (p0+no_stl2)|(p1+no_dual)   : 44
>> ;;   43  1340 3 1 2 1   (p0+no_stl2)|(p1+no_dual)   : 44m
>> ;;   44  1440 3 4 1 1   (p0+long_imm)   :
>> 
>> ;;  dependencies resolved: insn 38
>> ;;  tick updated: insn 38 into ready
>> ;;  dependencies resolved: insn 41
>> ;;  tick updated: insn 41 into ready
>> ;;  Advanced a state.
>> ;;  Ready list after queue_to_ready:41:4  38:2
>> ;;  Ready list after ready_sort:41:4  38:2
>> ;;  Ready list (t =   0):41:4  38:2
>> ;;  Chosen insn : 38
>> ;;0--> b  0: i  38r1=zxn([r0+`b'])
>> :(p0+long_imm+ldst0+lock0),nothing*3
>> ;;  dependencies resolved: insn 43
>> ;;  Ready-->Q: insn 43: queued for 4 cycles (change queue index).
>> ;;  tick updated: insn 43 into queue with cost=4
>> ;;  dependencies resolved: insn 40
>> ;;  Ready-->Q: insn 40: queued for 4 cycles (change queue index).
>> ;;  tick updated: insn 40 into queue with cost=4
>> ;;  Ready-->Q: insn 41: queued for 1 cycles (resource conflict).
>> ;;  Ready list (t =   0):
>> ;;  Advanced a state.
>> ;;  Q-->Ready: insn 41: moving to ready without stalls
>> ;;  Ready list after queue_to_ready:41:4
>> ;;  Ready list after ready_sort:41:4
>> ;;  Ready list (t =   1):41:4
>> ;;  Chosen insn : 41
>> ;;1--> b  0: i  41r0=r0+0x4
>> :(p0+no_stl2)|(p1+no_dual)
>> 
>> So, it is scheduling first insn 38 followed by 41.
>> The insn chain for bb3 before sched2 looks like:
>> (insn 38 36 40 3 (set (reg:DI 1 r1)
>>(zero_extend:DI (mem:SI (plus:SI (reg:SI 0 r0 [orig:119 ivtmp.13 ]
>> [119])
>>(symbol_ref:SI ("b") [flags 0x80]  > 0x2b9c011f75a0 b>)) [2 MEM[symbol: b, index: ivtmp.13_7, offset: 0B]+0 S4
>> A32]))) pr3115b.c:13 1395 {zero_extendsidi2}
>> (nil))
>> (insn 40 38 41 3 (set (mem:SI (plus:SI (reg:SI 0 r0 [orig:119 ivtmp.13 ]
>> [119])
>>(symbol_ref:SI ("a") [flags 0x80]  > a>)) [2 MEM[symbol: a, index: ivtmp.13_7, offset: 0B]+0 S4 A32])
>>(reg:SI 1 r1 [orig:118 D.3048 ] [118])) pr3115b.c:13 491 {fp_movsi}
>> (nil))
>> (insn 41 40 43 3 (set (reg:SI 0 r0 [orig:119 ivtmp.13 ] [119])
>>(plus:SI (reg:SI 0 r0 [orig:119 ivtmp.13 ] [119])
>>(const_int 4 [0x4]))) 536 {addsi3}
>> (nil))