Re: Extending jumps after block reordering

2007-11-02 Thread Eric Botcazou
> Questions:
> * shorten_branches() computes sizes of instructions so I know what the
>   distance is between a jump instr and its target label.  But how do I know
>   what is the maximum distance each kind of branch can safely take?
>   bb-reorder.c assumes that its only when cold/hot partitions are crossed
> it has to use indirect jumps, which is not the appropriate test in my case.

You cannot easily, it's buried in the architecture back-ends.

> * do I get it right that shorten_branches() does not really modify
> instructions but it helps to shorten branches by providing more accurate
> insns lengths?

Yes, but this should work automatically.  IOW, as Ian said, you shouldn't need 
to do anything special.  Maybe it's simply a latent bug in the PPC back-end.

-- 
Eric Botcazou


Re: Tree-SSA and POST_INC address mode inompatible in GCC4?

2007-11-02 Thread Ramana Radhakrishnan
Hi Bingfeng,


On 11/2/07, Bingfeng Mei <[EMAIL PROTECTED]> wrote:
> Hello,
>
> I look at the following the code to see what is the difference between
> GCC4 and GCC3 in using POST_INC address mode (or other similar modes).
>
> void tst(char * __restrict__ a, char * __restrict__ b){
>   *a++ = *b++;
>   *a++ = *b++;
>   *a++ = *b++;
>   *a++ = *b++;
>   *a++ = *b++;
>   *a++ = *b++;
>   *a = *b;
> }


We have seen this in a number of other ports as well - I had hacked up
a patch to sort this precise problem out but that was for trunk / 4.3
and is not applicable for 4.2.x since the autoincrement detector was
rewritten post 4.2.


http://gcc.gnu.org/ml/gcc-patches/2007-09/msg01060.html

I haven't yet had time to rework this based on the comments but it
surely is on my radar of things to do.

cheers
Ramana


>
>
> Using ARM processor as a target, GCC4.2.2 generates the following
> assembly:
> tst:
> @ args = 0, pretend = 0, frame = 0
> @ frame_needed = 0, uses_anonymous_args = 0
> @ link register save eliminated.
> mov r2, r1
> ldrbip, [r2], #1@ zero_extendqisi2
> mov r3, r0
> strbip, [r3], #1
> ldrbr1, [r1, #1]@ zero_extendqisi2
> strbr1, [r0, #1]
> ldrbr1, [r2, #1]@ zero_extendqisi2
> strbr1, [r3, #1]
> add r2, r2, #1
> ldrbr1, [r2, #1]@ zero_extendqisi2
> add r3, r3, #1
> strbr1, [r3, #1]
> add r2, r2, #1
> ldrbr1, [r2, #1]@ zero_extendqisi2
> add r3, r3, #1
> strbr1, [r3, #1]
> add r2, r2, #1
> ldrbr1, [r2, #1]@ zero_extendqisi2
> add r3, r3, #1
> strbr1, [r3, #1]
> ldrbr2, [r2, #2]@ zero_extendqisi2
> @ lr needed for prologue
> strbr2, [r3, #2]
> bx  lr
> .size   tst, .-tst
> .ident  "GCC: (GNU) 4.2.2"
>
> And GCC3.4.6 generates much better code by using POST_INC address mode
> extensively
>
> tst:
> @ args = 0, pretend = 0, frame = 0
> @ frame_needed = 0, uses_anonymous_args = 0
> @ link register save eliminated.
> ldrbr3, [r1], #1@ zero_extendqisi2
> strbr3, [r0], #1
> ldrbr3, [r1], #1@ zero_extendqisi2
> strbr3, [r0], #1
> ldrbr3, [r1], #1@ zero_extendqisi2
> strbr3, [r0], #1
> ldrbr3, [r1], #1@ zero_extendqisi2
> strbr3, [r0], #1
> ldrbr3, [r1], #1@ zero_extendqisi2
> strbr3, [r0], #1
> ldrbr3, [r1], #1@ zero_extendqisi2
> strbr3, [r0], #1
> ldrbr3, [r1, #0]@ zero_extendqisi2
> @ lr needed for prologue
> strbr3, [r0, #0]
> mov pc, lr
> .size   tst, .-tst
> .ident  "GCC: (GNU) 3.4.6"
>
> I look at dumped tst.c.102t.final_cleanup:
> tst (a, b)
> {
>   char * restrict a.54;
>   char * restrict a.53;
>   char * restrict a.52;
>   char * restrict a.51;
>   char * restrict a.50;
>   char * restrict b.48;
>   char * restrict b.47;
>   char * restrict b.46;
>   char * restrict b.45;
>   char * restrict b.44;
>
> :
>   *a = *b;
>   a.50 = a + 1B;
>   b.44 = b + 1B;
>   *a.50 = *b.44;
>   a.51 = a.50 + 1B;
>   b.45 = b.44 + 1B;
>   *a.51 = *b.45;
>   a.52 = a.51 + 1B;
>   b.46 = b.45 + 1B;
>   *a.52 = *b.46;
>   a.53 = a.52 + 1B;
>   b.47 = b.46 + 1B;
>   *a.53 = *b.47;
>   a.54 = a.53 + 1B;
>   b.48 = b.47 + 1B;
>   *a.54 = *b.48;
>   *(a.54 + 1B) = *(b.48 + 1B);
>   return;
>
> }
> I believe it is a fundermental issue for Tree-SSA IR. POST_INC address
> mode requires a pattern that the same variable is used for incrementing
> (both USE and DEF), while the SSA form produces a different varible for
> each DEF. Therefore, GCC4 cannot efficiently use POST_INC and other
> similar address modes. Is there any solution to overcome this problem?
> Any suggestion is greatly appreciated.
>
>
> Bingfeng Mei
> Broadcom UK
>
>


-- 
Ramana Radhakrishnan
GNU Tools
Celunite Inc.


Re: Results of 7z-4.55 performance with current GCCs.

2007-11-02 Thread J.C. Pizarro
2007/11/2, NightStrike <[EMAIL PROTECTED]>:
> On 11/1/07, Ted Byers <[EMAIL PROTECTED]> wrote:
> > --- David Miller <[EMAIL PROTECTED]> wrote:
> > > ...
>
> I agree with you 100%.  It has always been my view that if you can't
> compile fast enough, then get another machine and use distcc, or get a
> quad core and do make -j5, etc etc.  Compile time should never
> outweigh code correctness, and if it takes longer to compile more
> correct code, then that's just the nature of moving forward into the
> future.

"Save the planet and don't add more wood to the fire".


Re: gomp slowness

2007-11-02 Thread Daniel Jacobowitz
On Fri, Nov 02, 2007 at 07:39:33AM -0700, Ian Lance Taylor wrote:
> The only way I can interpret your comments is that you are assuming
> that all TLS is Global Dynamic (e.g., accessed from a dlopen'ed shared
> library).  But stack based thread local storage won't work for
> dlopen'ed shared libraries at all.

Actually, from context I assume he's talking about pthread_setspecific
and does not know about __thread.

-- 
Daniel Jacobowitz
CodeSourcery


Re: gomp slowness

2007-11-02 Thread Ian Lance Taylor
skaller <[EMAIL PROTECTED]> writes:

> A really cool (non-Posix) implementation would put TLS globals
> on the stack base .. but this does require at least one extra
> machine register in languages like C which don't provide
> a static display (pointer to parent function). For languages
> that do, such as Modula and most FPLs, the display pointer
> has to be provided anyhow, so the TLS globals come at
> no extra cost.

In a C executable, TLS requires one extra machine register.  TLS
variables are accessed via offsets from that register.  So what's the
significant difference between that and your proposal?


> There are bound to be performance issues if you have to query
> any kind of global data base shared between threads to obtain
> data local to the thread. The only exception to this is the data
> held in the 'task state', which is typically just the machine
> registers and in particular the stack.

TLS does not require querying a global data base to get thread local
data.

The only way I can interpret your comments is that you are assuming
that all TLS is Global Dynamic (e.g., accessed from a dlopen'ed shared
library).  But stack based thread local storage won't work for
dlopen'ed shared libraries at all.

I think you need to look at the TLS access code before deciding that
it has bad performance.  Make sure to look at the code in the
executable after the linker has optimized it.

Ian


RE: Tree-SSA and POST_INC address mode inompatible in GCC4?

2007-11-02 Thread Bingfeng Mei
Hi, Ramana,
I tried the trunk version  with/without your patch. It still produces
the same code as gcc4.2.2 does. In auto-inc-dec.c, the comments say 

 *a
   ...
   a <- a + c

becomes

   *(a += c) post

But the problem is after Tree-SSA pass,  there is no
   a <- a + c
But something like
   a_1 <- a + c

Unless the auto-inc-dec.c can reverse a_1 <- a + c to a <- a + c. I
don't see this transformation is applicable in most scenarios. Any
comments? 

Cheers,
Bingfeng


-Original Message-
From: Ramana Radhakrishnan [mailto:[EMAIL PROTECTED] 
Sent: 02 November 2007 12:39
To: Bingfeng Mei
Cc: gcc@gcc.gnu.org
Subject: Re: Tree-SSA and POST_INC address mode inompatible in GCC4?

Hi Bingfeng,


On 11/2/07, Bingfeng Mei <[EMAIL PROTECTED]> wrote:
> Hello,
>
> I look at the following the code to see what is the difference between
> GCC4 and GCC3 in using POST_INC address mode (or other similar modes).
>
> void tst(char * __restrict__ a, char * __restrict__ b){
>   *a++ = *b++;
>   *a++ = *b++;
>   *a++ = *b++;
>   *a++ = *b++;
>   *a++ = *b++;
>   *a++ = *b++;
>   *a = *b;
> }


We have seen this in a number of other ports as well - I had hacked up
a patch to sort this precise problem out but that was for trunk / 4.3
and is not applicable for 4.2.x since the autoincrement detector was
rewritten post 4.2.


http://gcc.gnu.org/ml/gcc-patches/2007-09/msg01060.html

I haven't yet had time to rework this based on the comments but it
surely is on my radar of things to do.

cheers
Ramana


>
>
> Using ARM processor as a target, GCC4.2.2 generates the following
> assembly:
> tst:
> @ args = 0, pretend = 0, frame = 0
> @ frame_needed = 0, uses_anonymous_args = 0
> @ link register save eliminated.
> mov r2, r1
> ldrbip, [r2], #1@ zero_extendqisi2
> mov r3, r0
> strbip, [r3], #1
> ldrbr1, [r1, #1]@ zero_extendqisi2
> strbr1, [r0, #1]
> ldrbr1, [r2, #1]@ zero_extendqisi2
> strbr1, [r3, #1]
> add r2, r2, #1
> ldrbr1, [r2, #1]@ zero_extendqisi2
> add r3, r3, #1
> strbr1, [r3, #1]
> add r2, r2, #1
> ldrbr1, [r2, #1]@ zero_extendqisi2
> add r3, r3, #1
> strbr1, [r3, #1]
> add r2, r2, #1
> ldrbr1, [r2, #1]@ zero_extendqisi2
> add r3, r3, #1
> strbr1, [r3, #1]
> ldrbr2, [r2, #2]@ zero_extendqisi2
> @ lr needed for prologue
> strbr2, [r3, #2]
> bx  lr
> .size   tst, .-tst
> .ident  "GCC: (GNU) 4.2.2"
>
> And GCC3.4.6 generates much better code by using POST_INC address mode
> extensively
>
> tst:
> @ args = 0, pretend = 0, frame = 0
> @ frame_needed = 0, uses_anonymous_args = 0
> @ link register save eliminated.
> ldrbr3, [r1], #1@ zero_extendqisi2
> strbr3, [r0], #1
> ldrbr3, [r1], #1@ zero_extendqisi2
> strbr3, [r0], #1
> ldrbr3, [r1], #1@ zero_extendqisi2
> strbr3, [r0], #1
> ldrbr3, [r1], #1@ zero_extendqisi2
> strbr3, [r0], #1
> ldrbr3, [r1], #1@ zero_extendqisi2
> strbr3, [r0], #1
> ldrbr3, [r1], #1@ zero_extendqisi2
> strbr3, [r0], #1
> ldrbr3, [r1, #0]@ zero_extendqisi2
> @ lr needed for prologue
> strbr3, [r0, #0]
> mov pc, lr
> .size   tst, .-tst
> .ident  "GCC: (GNU) 3.4.6"
>
> I look at dumped tst.c.102t.final_cleanup:
> tst (a, b)
> {
>   char * restrict a.54;
>   char * restrict a.53;
>   char * restrict a.52;
>   char * restrict a.51;
>   char * restrict a.50;
>   char * restrict b.48;
>   char * restrict b.47;
>   char * restrict b.46;
>   char * restrict b.45;
>   char * restrict b.44;
>
> :
>   *a = *b;
>   a.50 = a + 1B;
>   b.44 = b + 1B;
>   *a.50 = *b.44;
>   a.51 = a.50 + 1B;
>   b.45 = b.44 + 1B;
>   *a.51 = *b.45;
>   a.52 = a.51 + 1B;
>   b.46 = b.45 + 1B;
>   *a.52 = *b.46;
>   a.53 = a.52 + 1B;
>   b.47 = b.46 + 1B;
>   *a.53 = *b.47;
>   a.54 = a.53 + 1B;
>   b.48 = b.47 + 1B;
>   *a.54 = *b.48;
>   *(a.54 + 1B) = *(b.48 + 1B);
>   return;
>
> }
> I believe it is a fundermental issue for Tree-SSA IR. POST_INC address
> mode requires a pattern that the same variable is used for
incrementing
> (both USE and DEF), while the SSA form produces a different varible
for
> each DEF. Therefore, GCC4 cannot efficiently use POST_INC and other
> similar address modes. Is there any solution to overcome this problem?
> Any suggestion is greatly appreciated.
>
>
> Bingfeng Mei
> Broadcom UK
>
>


-- 
Ramana Radhakrishnan
GNU Tools
Celunite Inc.




Re: Dependency output

2007-11-02 Thread Tom Tromey
> "timtuun" == timtuun  <[EMAIL PROTECTED]> writes:

timtuun> I was wondering if there is a particular reason why object
timtuun> name in dependency output doesn't include the directory where
timtuun> the output is written?

Just conservatism -- the options have worked this way for a long time.
See PR 30491.

timtuun> Am I completely wrong saying that some older version it would
timtuun> have been objects/buffer.o: ... instead of just buffer.o:
timtuun> ... ?

Yeah, that would have been a better choice.  I don't know why it was
not done that way.  I'm reluctant to change it, however, for fear of
breaking a script that uses gcc.

Also, it is easy enough to use -MT to get the target name you want.
This is what automake does, for instance.

Tom


Dependency output

2007-11-02 Thread timtuun
Hi.

I was wondering if there is a particular reason why object name in
dependency output doesn't include the directory where the output is
written? For example when compiling vim version 7.1 I get the following
result.

[EMAIL PROTECTED] vim71]$ gcc --version
gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13)

$DEPENDENCIES_OUTPUT is set

gcc -c -I. -Iproto -DHAVE_CONFIG_H -DFEAT_GUI_GTK  -I/usr/include/gtk-2.0
-I/usr/lib/gtk-2.0/include -I/usr/include/atk-1.0 -I/usr/include/cairo
-I/usr/include/pango-1.0 -I/usr/include/glib-2.0
-I/usr/lib/glib-2.0/include -I/usr/include/freetype2
-I/usr/include/libpng12 -g -O2 -o objects/buffer.o buffer.c

Produces following in the dependency file:

buffer.o: buffer.c vim.h auto/config.h feature.h os_unix.h auto/osdef.h \
  ascii.h keymap.h term.h macros.h option.h structs.h regexp.h gui.h \
  gui_beval.h /usr/include/gtk-2.0/gtk/gtkwidget.h 

Am I completely wrong saying that some older version it would have been
objects/buffer.o: ... instead of just buffer.o: ... ?

Timo



Tree-SSA and POST_INC address mode inompatible in GCC4?

2007-11-02 Thread Bingfeng Mei
Hello,

I look at the following the code to see what is the difference between
GCC4 and GCC3 in using POST_INC address mode (or other similar modes). 

void tst(char * __restrict__ a, char * __restrict__ b){
  *a++ = *b++;
  *a++ = *b++;
  *a++ = *b++;
  *a++ = *b++;
  *a++ = *b++;
  *a++ = *b++;
  *a = *b;
}


Using ARM processor as a target, GCC4.2.2 generates the following
assembly:
tst:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
mov r2, r1
ldrbip, [r2], #1@ zero_extendqisi2
mov r3, r0
strbip, [r3], #1
ldrbr1, [r1, #1]@ zero_extendqisi2
strbr1, [r0, #1]
ldrbr1, [r2, #1]@ zero_extendqisi2
strbr1, [r3, #1]
add r2, r2, #1
ldrbr1, [r2, #1]@ zero_extendqisi2
add r3, r3, #1
strbr1, [r3, #1]
add r2, r2, #1
ldrbr1, [r2, #1]@ zero_extendqisi2
add r3, r3, #1
strbr1, [r3, #1]
add r2, r2, #1
ldrbr1, [r2, #1]@ zero_extendqisi2
add r3, r3, #1
strbr1, [r3, #1]
ldrbr2, [r2, #2]@ zero_extendqisi2
@ lr needed for prologue
strbr2, [r3, #2]
bx  lr
.size   tst, .-tst
.ident  "GCC: (GNU) 4.2.2"

And GCC3.4.6 generates much better code by using POST_INC address mode
extensively

tst:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
ldrbr3, [r1], #1@ zero_extendqisi2
strbr3, [r0], #1
ldrbr3, [r1], #1@ zero_extendqisi2
strbr3, [r0], #1
ldrbr3, [r1], #1@ zero_extendqisi2
strbr3, [r0], #1
ldrbr3, [r1], #1@ zero_extendqisi2
strbr3, [r0], #1
ldrbr3, [r1], #1@ zero_extendqisi2
strbr3, [r0], #1
ldrbr3, [r1], #1@ zero_extendqisi2
strbr3, [r0], #1
ldrbr3, [r1, #0]@ zero_extendqisi2
@ lr needed for prologue
strbr3, [r0, #0]
mov pc, lr
.size   tst, .-tst
.ident  "GCC: (GNU) 3.4.6"

I look at dumped tst.c.102t.final_cleanup:
tst (a, b)
{
  char * restrict a.54;
  char * restrict a.53;
  char * restrict a.52;
  char * restrict a.51;
  char * restrict a.50;
  char * restrict b.48;
  char * restrict b.47;
  char * restrict b.46;
  char * restrict b.45;
  char * restrict b.44;

:
  *a = *b;
  a.50 = a + 1B;
  b.44 = b + 1B;
  *a.50 = *b.44;
  a.51 = a.50 + 1B;
  b.45 = b.44 + 1B;
  *a.51 = *b.45;
  a.52 = a.51 + 1B;
  b.46 = b.45 + 1B;
  *a.52 = *b.46;
  a.53 = a.52 + 1B;
  b.47 = b.46 + 1B;
  *a.53 = *b.47;
  a.54 = a.53 + 1B;
  b.48 = b.47 + 1B;
  *a.54 = *b.48;
  *(a.54 + 1B) = *(b.48 + 1B);
  return;

}
I believe it is a fundermental issue for Tree-SSA IR. POST_INC address
mode requires a pattern that the same variable is used for incrementing
(both USE and DEF), while the SSA form produces a different varible for
each DEF. Therefore, GCC4 cannot efficiently use POST_INC and other
similar address modes. Is there any solution to overcome this problem?
Any suggestion is greatly appreciated. 


Bingfeng Mei
Broadcom UK



Re: RFC: Creating a live, all-encompassing architectural document for GCC

2007-11-02 Thread Gerald Pfeifer
On Fri, 26 Oct 2007, Diego Novillo wrote:
> So, I think the problem goes a bit beyond mere documentation of how a
> module works at a high level.  I would like to have a navigable
> document that also describes the flow of things, interfaces and
> helpers.  Starting at main.c:main() and ending at toplev.c:finalize().

Something like this is a key element for documentation, but it's hard
to do the way we have been documenting things indeed.  That sounds like
a very good idea.

> - Navigable.

That seems to ask for, at least on the module level and below, for
something similar to literary programming or we'll run out of sync
quickly.

> - Close correspondence to mainline.
> 
> This is where it gets hard.  We need to have a way of enforcing code
> updates that change internal or external API properties to be
> reflected in the document.  With this I don't mean that every single
> patch should be accompanied with a documentation change.  However, if
> a patch refactors a module and its internal interfaces are changed,
> then the patch should be accompanied with a change to the
> documentation.

I guess that's my main concern as well: how can we keep the various
bits of documentation -- comments in the code, texinfo, and your
proposed one -- in sync and do so without adding much effort for
individual contributors?

Another concern is the question of copyright assignments.  We are
requiring these for texinfo changes, but do not have anything in
place for the Wiki.  If we now get significantly improved documentation
on the latter, we may not be able to move that into the regular manuals.
Is this an issue?

Gerald


Re: gomp slowness

2007-11-02 Thread skaller

On Fri, 2007-11-02 at 07:39 -0700, Ian Lance Taylor wrote:
> skaller <[EMAIL PROTECTED]> writes:

> In a C executable, TLS requires one extra machine register.  

You mean gcc?

> TLS
> variables are accessed via offsets from that register.  So what's the
> significant difference between that and your proposal?

I wasn't making a proposal. 

> The only way I can interpret your comments is that you are assuming
> that all TLS is Global Dynamic (e.g., accessed from a dlopen'ed shared
> library).  But stack based thread local storage won't work for
> dlopen'ed shared libraries at all.

No, I was assuming implementation of a call like:

void *pthread_getspecific(pthread_key_t key);


> I think you need to look at the TLS access code before deciding that
> it has bad performance. 

You already said it costs a register? That's a REALLY high cost
to pay to support badly designed software.

-- 
John Skaller 
Felix, successor to C++: http://felix.sf.net


Re: Last argument of lang_hooks_for_callgraph.analzye_tree unused?

2007-11-02 Thread Diego Novillo
On 11/1/07, Jan Hubicka <[EMAIL PROTECTED]> wrote:

> Just go ahead and kill it.  I would preffer to remove the whole hook,
> but we still keep some non-GIMPLE expressions in static initializers :(

Yeah, that's too bad.

Attached is the patch I committed, tested on x86_64.  This fixes the
latent bug in calls to analyze_expr that were being called with a
cgraph node instead of a decl.

Tested on x86_64.


Diego.
2007-11-02  Diego Novillo  <[EMAIL PROTECTED]>

* langhooks.h (struct lang_hooks_for_callgraph): Remove third
argument from function pointer ANALYZE_EXPR.  Update all
users.
* cgraph.c (debug_cgraph_node): New.
(debug_cgraph): New.

Index: cgraphbuild.c
===
--- cgraphbuild.c   (revision 129823)
+++ cgraphbuild.c   (working copy)
@@ -35,7 +35,7 @@ along with GCC; see the file COPYING3.  
Called via walk_tree: TP is pointer to tree to be examined.  */
 
 static tree
-record_reference (tree *tp, int *walk_subtrees, void *data)
+record_reference (tree *tp, int *walk_subtrees, void *data ATTRIBUTE_UNUSED)
 {
   tree t = *tp;
 
@@ -46,8 +46,7 @@ record_reference (tree *tp, int *walk_su
{
  varpool_mark_needed_node (varpool_node (t));
  if (lang_hooks.callgraph.analyze_expr)
-   return lang_hooks.callgraph.analyze_expr (tp, walk_subtrees,
- data);
+   return lang_hooks.callgraph.analyze_expr (tp, walk_subtrees);
}
   break;
 
@@ -73,7 +72,7 @@ record_reference (tree *tp, int *walk_su
}
 
   if ((unsigned int) TREE_CODE (t) >= LAST_AND_UNUSED_TREE_CODE)
-   return lang_hooks.callgraph.analyze_expr (tp, walk_subtrees, data);
+   return lang_hooks.callgraph.analyze_expr (tp, walk_subtrees);
   break;
 }
 
Index: cgraph.c
===
--- cgraph.c(revision 129823)
+++ cgraph.c(working copy)
@@ -657,7 +657,9 @@ cgraph_node_name (struct cgraph_node *no
 const char * const cgraph_availability_names[] =
   {"unset", "not_available", "overwrittable", "available", "local"};
 
-/* Dump given cgraph node.  */
+
+/* Dump call graph node NODE to file F.  */
+
 void
 dump_cgraph_node (FILE *f, struct cgraph_node *node)
 {
@@ -742,7 +744,17 @@ dump_cgraph_node (FILE *f, struct cgraph
   fprintf (f, "\n");
 }
 
-/* Dump the callgraph.  */
+
+/* Dump call graph node NODE to stderr.  */
+
+void
+debug_cgraph_node (struct cgraph_node *node)
+{
+  dump_cgraph_node (stderr, node);
+}
+
+
+/* Dump the callgraph to file F.  */
 
 void
 dump_cgraph (FILE *f)
@@ -754,7 +766,18 @@ dump_cgraph (FILE *f)
 dump_cgraph_node (f, node);
 }
 
+
+/* Dump the call graph to stderr.  */
+
+void
+debug_cgraph (void)
+{
+  dump_cgraph (stderr);
+}
+
+
 /* Set the DECL_ASSEMBLER_NAME and update cgraph hashtables.  */
+
 void
 change_decl_assembler_name (tree decl, tree name)
 {
Index: cgraph.h
===
--- cgraph.h(revision 129823)
+++ cgraph.h(working copy)
@@ -288,7 +288,9 @@ extern GTY(()) int cgraph_order;
 
 /* In cgraph.c  */
 void dump_cgraph (FILE *);
+void debug_cgraph (void);
 void dump_cgraph_node (FILE *, struct cgraph_node *);
+void debug_cgraph_node (struct cgraph_node *);
 void cgraph_insert_node_to_hashtable (struct cgraph_node *node);
 void cgraph_remove_edge (struct cgraph_edge *);
 void cgraph_remove_node (struct cgraph_node *);
Index: cp/cp-tree.h
===
--- cp/cp-tree.h(revision 129823)
+++ cp/cp-tree.h(working copy)
@@ -4303,7 +4303,7 @@ extern tree cp_build_parm_decl(tree, 
 extern tree get_guard  (tree);
 extern tree get_guard_cond (tree);
 extern tree set_guard  (tree);
-extern tree cxx_callgraph_analyze_expr (tree *, int *, tree);
+extern tree cxx_callgraph_analyze_expr (tree *, int *);
 extern void mark_needed(tree);
 extern bool decl_needed_p  (tree);
 extern void note_vague_linkage_fn  (tree);
Index: cp/decl2.c
===
--- cp/decl2.c  (revision 129823)
+++ cp/decl2.c  (working copy)
@@ -3026,8 +3026,7 @@ generate_ctor_and_dtor_functions_for_pri
Here we must deal with member pointers.  */
 
 tree
-cxx_callgraph_analyze_expr (tree *tp, int *walk_subtrees ATTRIBUTE_UNUSED,
-   tree from ATTRIBUTE_UNUSED)
+cxx_callgraph_analyze_expr (tree *tp, int *walk_subtrees ATTRIBUTE_UNUSED)
 {
   tree t = *tp;
 
Index: langhooks.c
===
--- langhooks.c (revision 129823)
+++ langhooks.c (working copy)
@@ -486,8 +486,7 @@ lhd_print_error_function (diagnostic_con
 
 tree
 

Re: gomp slowness

2007-11-02 Thread skaller

On Fri, 2007-11-02 at 10:46 -0400, Daniel Jacobowitz wrote:
> On Fri, Nov 02, 2007 at 07:39:33AM -0700, Ian Lance Taylor wrote:
> > The only way I can interpret your comments is that you are assuming
> > that all TLS is Global Dynamic (e.g., accessed from a dlopen'ed shared
> > library).  But stack based thread local storage won't work for
> > dlopen'ed shared libraries at all.
> 
> Actually, from context I assume he's talking about pthread_setspecific
> and does not know about __thread.

Yes, I was talking about pthread_*, i.e. posix threads.

I do know about __thread though. 

My argument is basically: there is no need for any such
feature in a well written program. Each thread already has
its own local stack. Global variables should not be used
in the first place (except for signals etc where
there is no choice).

So any (application) program needing TLS (other than the stack) 
is automatically badly designed.  I've been writing code for 
three decades without using any global variables, ever since
I learned about re-entrancy.

-- 
John Skaller 
Felix, successor to C++: http://felix.sf.net


Re: gomp slowness

2007-11-02 Thread skaller

On Thu, 2007-11-01 at 21:02 -0700, Gary Funck wrote:
> On Thu, Oct 18, 2007 at 11:42:52AM +1000, skaller wrote:
> > 
> > DO you know how thread local variables are handled?
> > [Not using Posix TLS I hope .. that would be a disaster]
> 
> Would you please elaborate? 

Sure ..

>  What's wrong with the POSIX TLS implementation?   

I have no idea, the implementation is irrelevant: the
interface is likely orders of magnitude slower
than the proper way to do thread local storage:
use the stack.

Posix TLS is a hack. New code should NOT use TLS, it is only
for supporting broken legacy code. For example in the C
library the global errno variable. This is a DESIGN FAULT
in ISO C89 which can be 'repaired' by using TLS, but one should
never design such a bad interface in new code.

A really cool (non-Posix) implementation would put TLS globals
on the stack base .. but this does require at least one extra
machine register in languages like C which don't provide
a static display (pointer to parent function). For languages
that do, such as Modula and most FPLs, the display pointer
has to be provided anyhow, so the TLS globals come at
no extra cost.

> Do you know of any studies?

No, but I would guess gcc has some performance regression tests?

> I ask, because we presently use the TLS facility extensively,
> and have suspected that there are significant performance
> problems, but haven't looked into the issue.

There are bound to be performance issues if you have to query
any kind of global data base shared between threads to obtain
data local to the thread. The only exception to this is the data
held in the 'task state', which is typically just the machine
registers and in particular the stack.

If this data is on or reachable from the machine stack in the
first place, there's no performance problem and no need for TLS.


-- 
John Skaller 
Felix, successor to C++: http://felix.sf.net


Re: gomp slowness

2007-11-02 Thread Ian Lance Taylor
skaller <[EMAIL PROTECTED]> writes:

> On Fri, 2007-11-02 at 07:39 -0700, Ian Lance Taylor wrote:
> > skaller <[EMAIL PROTECTED]> writes:
> 
> > In a C executable, TLS requires one extra machine register.  
> 
> You mean gcc?

I don't understand the question.  I mean in a C/C++ executable which
uses TLS.  By TLS I mean __thread, not pthread_get_specific.  In the
GNU/Linux world, TLS conventionally means specifically __thread.


> > I think you need to look at the TLS access code before deciding that
> > it has bad performance. 
> 
> You already said it costs a register? That's a REALLY high cost
> to pay to support badly designed software.

It only costs a register for code which accesses a TLS variable, of
course.


> My argument is basically: there is no need for any such
> feature in a well written program. Each thread already has
> its own local stack. Global variables should not be used
> in the first place (except for signals etc where
> there is no choice).

While global variables are rarely required, there are many cases where
a class is most reasonably implemented using static variables.  For
example, it's very hard to implement malloc without using any static
variables.  And for performance those static variables should be
thread local.

It's just not plausible to say that a C/C++ program should be written
without static variables.  Modern programs are written by different
organizations.  There is no way to pass appropriate state through all
the required interfaces.

Ian


Re: Results of 7z-4.55 performance with current GCCs.

2007-11-02 Thread NightStrike
On 11/1/07, Ted Byers <[EMAIL PROTECTED]> wrote:
> --- David Miller <[EMAIL PROTECTED]> wrote:
> > From: NightStrike <[EMAIL PROTECTED]>
> > Date: Thu, 1 Nov 2007 22:34:33 -0400
> >
> > > I think what is more important is the resulting
> > binary -- does it
> > > run faster?
> >
> > The answer to this is situational dependant.
> >
> > For example, for me, the speed of compilation at -O2
> > is very important
> > because I'm constantly doing full tree build
> > regressions.
> >
> > There are large groups of us who pine for
> > compilation to be as fast
> > as the old MIPS compilers were, and they were fully
> > optimizing
> > and even had a more advanced register allocator than
> > GCC has now.
> >
> I find it hard to fathom why the OP would be concerned
> with compile and run times measured in minutes and
> seconds. I don't know how long your full tree build
> regressions take, but for me, a very small application
> will take half an hour to compile, and a large one
> could take all day.  But if by hand tuning my code,
> and pushing my development tools to their limits, I
> can have my application finish a task in minutes where
> my predecessors' versions took hours (something I
> commonly see, perhaps by chance, with the projects I
> find myself working on), the savings of my clients'
> users' time is greater than the cost of my time by
> several orders of magnitude, so I don't mind waiting
> for a build to finish if the end product is provably
> correct.
>
> There is much more to both compile time and run time
> performance than how fast your development tools are.
> I expect more recent tools to take longer than the
> tools I used even five years ago, simply because there
> is much more for them to do; and as they get better, I
> can use more demanding parts of the language (my
> preferred language is C++) that simply weren't
> practical a few years ago.  As I do this, then my
> tools must work harder still.  It isn't only the
> tools, but what you do with them ...
>
> If I may state the obvious, an outstanding programmer
> can easily make a mediocre development tool look good,
> while a mediocre programmer can make even the best
> tools look very bad.  That said, I often download open
> source applications (all good quality), and the GCC
> suite takes longer to build than all the rest combined
> (that is, of the ones I download), and since that
> finishes in but a few hours on my machine, I won't
> worry about how fast gcc compiles code until it takes
> many days to compile itself.  :-)
>
> As you say, performance questions and answers depend
> on the situation.  But I say, the single most
> important question is, "Is the code correct?"  that
> is, does it produce output that is provably correct.
> There is no point in having an insanely fast program
> if it only, or even only generally, produces garbage.
> As important as performance is, the correctness of the
> code is, to my mind, infinitely more important than
> either compile or runtime performance!
>
> I would encourage the good folk who work on GCC to
> focus on making the code correct first, and only after
> that can be proven, worry about making it faster.
> Really bad things can happen to real people if my
> programs give incorrect results (think about things
> like contaminant transport, dose/risk assessments,
> &c., and how someone I have never met may suffer if my
> application gives a consultant or civil servant
> unreliable results).  When you think about the things
> relevant to the work I do, you will understand why I
> don't care if my build times are measured in hours or
> days or even weeks as long as my clients' users' can
> work more efficiently and obtain provably correct
> results from my programs.  Computers are cheap these
> days, so if I find myself too often waiting for a
> build to complete, I'l just get another computer to
> work on while I wait for the one doing the build to
> finish.
>
> I don't help develop GCC, but may I express to those
> that do that I apreciate their efforts.

I agree with you 100%.  It has always been my view that if you can't
compile fast enough, then get another machine and use distcc, or get a
quad core and do make -j5, etc etc.  Compile time should never
outweigh code correctness, and if it takes longer to compile more
correct code, then that's just the nature of moving forward into the
future.


Re: gomp slowness

2007-11-02 Thread Olivier Galibert
On Sat, Nov 03, 2007 at 03:31:14AM +1100, skaller wrote:
> On Fri, 2007-11-02 at 07:39 -0700, Ian Lance Taylor wrote:
> > I think you need to look at the TLS access code before deciding that
> > it has bad performance. 
> 
> You already said it costs a register? That's a REALLY high cost
> to pay to support badly designed software.

Not if you have a lot of registers (anything modern but i386) or if
the register can not really be used for anything else (%fs on i386 for
instance).

  OG.


Re: gomp slowness

2007-11-02 Thread Olivier Galibert
On Sat, Nov 03, 2007 at 03:38:51AM +1100, skaller wrote:
> My argument is basically: there is no need for any such
> feature in a well written program. Each thread already has
> its own local stack. Global variables should not be used
> in the first place (except for signals etc where
> there is no choice).

And for libraries where there is no choice.  Happens rather often.

  OG.


Re: Autovectorized HIRLAM - latest results.

2007-11-02 Thread Toon Moene

Sebastian Pop wrote:


On Oct 29, 2007 10:49 AM, Dorit Nuzman <[EMAIL PROTECTED]> wrote:



I wonder if it's versioning-for-aliasing (run-time dependence testing) that
was responsible for a lot of the new vectorizable loops


It is then possible that the code size noticeably increased.  Toon
could you provide more data on the size of the executables with and
without vectorization, and also:


Unfortunately, the binaries are gone.


$ grep 'versioning for alias checks' HL_Prepare_00.html | wc -l


This I can do:

1095

or slightly over half of the difference.

Kind regards,

--
Toon Moene - e-mail: [EMAIL PROTECTED] - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.indiv.nluug.nl/~toon/
GNU Fortran's path to Fortran 2003: http://gcc.gnu.org/wiki/Fortran2003


Re: gomp slowness

2007-11-02 Thread Robert Dewar

Olivier Galibert wrote:

On Sat, Nov 03, 2007 at 03:38:51AM +1100, skaller wrote:

My argument is basically: there is no need for any such
feature in a well written program. Each thread already has
its own local stack. Global variables should not be used
in the first place (except for signals etc where
there is no choice).


And for libraries where there is no choice.  Happens rather often.


There are lots of cases where global thread specific variables
are useful in practice, ask anyone who has programmed real world
large scale real time embedded programs. One obvious example is
the stack limit for checking stack overflow on subprogram entry.


  OG.





Re: gomp slowness

2007-11-02 Thread Joel Dice

On Sat, 3 Nov 2007, skaller wrote:

On Fri, 2007-11-02 at 10:46 -0400, Daniel Jacobowitz wrote:

On Fri, Nov 02, 2007 at 07:39:33AM -0700, Ian Lance Taylor wrote:

The only way I can interpret your comments is that you are assuming
that all TLS is Global Dynamic (e.g., accessed from a dlopen'ed shared
library).  But stack based thread local storage won't work for
dlopen'ed shared libraries at all.


Actually, from context I assume he's talking about pthread_setspecific
and does not know about __thread.


Yes, I was talking about pthread_*, i.e. posix threads.

I do know about __thread though.

My argument is basically: there is no need for any such
feature in a well written program. Each thread already has
its own local stack. Global variables should not be used
in the first place (except for signals etc where
there is no choice).

So any (application) program needing TLS (other than the stack)
is automatically badly designed.  I've been writing code for
three decades without using any global variables, ever since
I learned about re-entrancy.


While I agree that global and thread-local variables are to be avoided in 
general, I wonder how you would treat the following case:


Function A needs to supply some data to function Z, but only via a call 
stack containing functions B through Y.  Must B through Y all take that 
data as a parameter and pass it as an argument?  Would it not be more 
efficient to store it in a register on a register-rich machine instead of 
copying in on every call?


Even putting runtime efficiency aside, the code in the intermediate 
functions is made arguably less readable by each bit of context it has to 
forward along but never use itself.  OTOH, using thread-locals introduces 
"action at a distance", which never helps readability.


Re: Results of 7z-4.55 performance with current GCCs.

2007-11-02 Thread David Miller
From: NightStrike <[EMAIL PROTECTED]>
Date: Fri, 2 Nov 2007 10:42:01 -0400

> I agree with you 100%.  It has always been my view that if you can't
> compile fast enough, then get another machine and use distcc, or get a
> quad core and do make -j5, etc etc.

I have 64 cpu machines and use make -j64, it's still not fast enough
and I know it could be much faster.

Note that a faster GCC would also make GCC development go
significantly quicker, since every developer has to do a full
bootstrap and regression run before checking in non-trivial changes.

The world is not black and white, it is shades of gray.  Compilation
time matters to some people, and I'm not in the slightest suggesting
that it should trump code correctness.  I don't know where anyone got
that idea from what I've been saying.


Re: gcc 4.3.0 revision 129794 on hppa2.0w-hp-hpux11.00

2007-11-02 Thread John David Anglin
> Comments? Should I file a bug report?

Yes.  It would help to know where the reference to ggc_free
in rtl.o comes from (i.e., look at preprocessed source for this
file).  I don't think this file normally uses ggc_free().

Dave
-- 
J. David Anglin  [EMAIL PROTECTED]
National Research Council of Canada  (613) 990-0752 (FAX: 952-6602)


gcc-4.3-20071102 is now available

2007-11-02 Thread gccadmin
Snapshot gcc-4.3-20071102 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/4.3-20071102/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 4.3 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/trunk revision 129862

You'll find:

gcc-4.3-20071102.tar.bz2  Complete GCC (includes all of below)

gcc-core-4.3-20071102.tar.bz2 C front end and core compiler

gcc-ada-4.3-20071102.tar.bz2  Ada front end and runtime

gcc-fortran-4.3-20071102.tar.bz2  Fortran front end and runtime

gcc-g++-4.3-20071102.tar.bz2  C++ front end and runtime

gcc-java-4.3-20071102.tar.bz2 Java front end and runtime

gcc-objc-4.3-20071102.tar.bz2 Objective-C front end and runtime

gcc-testsuite-4.3-20071102.tar.bz2The GCC testsuite

Diffs from 4.3-20071026 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-4.3
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Re: gomp slowness

2007-11-02 Thread skaller

On Fri, 2007-11-02 at 10:29 -0700, Ian Lance Taylor wrote:
> skaller <[EMAIL PROTECTED]> writes:
> 
> > On Fri, 2007-11-02 at 07:39 -0700, Ian Lance Taylor wrote:
> > > skaller <[EMAIL PROTECTED]> writes:
> > 
> > > In a C executable, TLS requires one extra machine register.  
> > 
> > You mean gcc?
> 
> I don't understand the question.  I mean in a C/C++ executable which
> uses TLS.  By TLS I mean __thread, not pthread_get_specific.  In the
> GNU/Linux world, TLS conventionally means specifically __thread.

I see why you didn't understand the question .. you're so buried
in gcc world C=gcc .. whereas for me C=ISO Standard, any compiler.
TLS = __thread jargon noted, thanks (for me, TLS meant both
generic 'storage local to a thread=the stack' or 'posix',
not __thread which is a gcc extension).

> It only costs a register for code which accesses a TLS variable, of
> course.

What do you mean 'code'? Do mean individual function? If so,
how is the register loaded? 

> While global variables are rarely required, there are many cases where
> a class is most reasonably implemented using static variables.  For
> example, it's very hard to implement malloc without using any static
> variables.

That's true, and again is an example of a badly designed function
in the C standard, probably a legacy of the days when the OS
supplied each individual heap block (and long before threads existed).
This should really be done with two functions: one that fetches large
block from the OS, and another than suballocates it. That avoids
any static variables and would permit unsynchronised allocation
in threaded code if the programmer designed it that way.


>   And for performance those static variables should be
> thread local.
> 
> It's just not plausible to say that a C/C++ program should be written
> without static variables.  

Rubbish:) 30 years of programming, never used one in serious code
unless some brain dead API forced it.

> Modern programs are written by different
> organizations.  There is no way to pass appropriate state through all
> the required interfaces.

Of course there is. It's called design by contract.
I do it all the time. I am appalled at code bases like
GTK and interfaces like OpenMP which get such really
basic things wrong. 

-- 
John Skaller 
Felix, successor to C++: http://felix.sf.net


Re: gomp slowness

2007-11-02 Thread skaller

On Fri, 2007-11-02 at 19:56 +0100, Olivier Galibert wrote:
> On Sat, Nov 03, 2007 at 03:31:14AM +1100, skaller wrote:
> > On Fri, 2007-11-02 at 07:39 -0700, Ian Lance Taylor wrote:
> > > I think you need to look at the TLS access code before deciding that
> > > it has bad performance. 
> > 
> > You already said it costs a register? That's a REALLY high cost
> > to pay to support badly designed software.
> 
> Not if you have a lot of registers (anything modern but i386) or if
> the register can not really be used for anything else (%fs on i386 for
> instance).

This is not true. If you use a register for any purpose like this,
it can't be used for anything else and that has a cost.

On x86_64 which I use, every register is valuable. Don't you dare
take one away, it would have a serious performance impact AND
it would stop ME using that register for something which my application
might consider much more important, for example as a pointer to the
minor heap in a copying collector (Ocaml does this, it is the
reason it has very high performance.. I would do this in Felix too
if I could figure out how to reload the variable in a callback
invoked by foreign code). This applies to %fs on i386 too: if
the compiler uses that register, other uses are denied, and the
compiler can't tell which is more important.


-- 
John Skaller 
Felix, successor to C++: http://felix.sf.net


Re: gomp slowness

2007-11-02 Thread skaller

On Fri, 2007-11-02 at 20:00 +0100, Olivier Galibert wrote:
> On Sat, Nov 03, 2007 at 03:38:51AM +1100, skaller wrote:
> > My argument is basically: there is no need for any such
> > feature in a well written program. Each thread already has
> > its own local stack. Global variables should not be used
> > in the first place (except for signals etc where
> > there is no choice).
> 
> And for libraries where there is no choice.  Happens rather often.

Yes, but that is covered by 'well written program' qualification.
If you have to use a badly designed library which used global
variables, then I agree you're stuck. That is what I meant
by saying TLS is a hack to support legacy code, and should
not be required in new code. Clearly, if you mix new code
with code that calls legacy libraries, the problem is not
eliminated until you re-write those libraries, or rewrite
you app to use a better library.

I'd be rather worried if a compiler used up a register just
to support legacy libraries that my code might never call.

-- 
John Skaller 
Felix, successor to C++: http://felix.sf.net


Re: gomp slowness

2007-11-02 Thread Andrew Pinski
> This is not true. If you use a register for any purpose like this,
> it can't be used for anything else and that has a cost.

This is a segment register.  Please go and read about what segment
registers.  They are not real registers and cannot be used for
anything except memory accesses.  They date back to the 16bits
processor days when you could have 24bits of memory but you needed to
switch the segment register, remember far/near memory??  Now some
other ABIs on other processors just use a normal register (on PPC, it
is r13) but then again those processors usually have larger register
set than x86 (or x86_64).

Thanks,
Andrew Pinski


Re: gomp slowness

2007-11-02 Thread skaller

On Fri, 2007-11-02 at 15:31 -0400, Robert Dewar wrote:
> Olivier Galibert wrote:

> There are lots of cases where global thread specific variables
> are useful in practice, ask anyone who has programmed real world
> large scale real time embedded programs.

No. And I have done just that myself. There is a use for 
hackery, for example profiling, debugging, etc.

Otherwise .. well the major real time code I did work on
had a big effort in place to *eliminate* all singletons
and static variables because they caused huge problems
generalising the code. Millions of lines of badly designed C++.

In fact, it is the other way around: for SMALL programs running
on tiny processors, global storage has to be used sometimes
because of the limitations of the processor. For example
I wrote lots of assembler for the 6802 microcontroller,
which doesn't allow you to *access* the subroutine stack.

>  One obvious example is
> the stack limit for checking stack overflow on subprogram entry.

Not required if the stack limit is stored in the stack itself.
The fact that this is hard to arrange just shows you AGAIN
there is badly designed interface somewhere along the line.

Felix pthreads record the stack base and stack pointer,
which is used for a conservative scan of the stack by the
garbage collector.. guess what? No static variables.
There's no stack limit check, but that would be easy to
add to the code. (but I think the way you'd do this
in Linux code would be to use mmap() and an invalid
block: AFAIK that's what Ocaml does, so the young
heap allocator is a SINGLE register increment .. this
is rather fast .. :)

-- 
John Skaller 
Felix, successor to C++: http://felix.sf.net


Re: gomp slowness

2007-11-02 Thread Ian Lance Taylor
skaller <[EMAIL PROTECTED]> writes:

> On Fri, 2007-11-02 at 10:29 -0700, Ian Lance Taylor wrote:
> > skaller <[EMAIL PROTECTED]> writes:
> > 
> > > On Fri, 2007-11-02 at 07:39 -0700, Ian Lance Taylor wrote:
> > > > skaller <[EMAIL PROTECTED]> writes:
> > > 
> > > > In a C executable, TLS requires one extra machine register.  
> > > 
> > > You mean gcc?
> > 
> > I don't understand the question.  I mean in a C/C++ executable which
> > uses TLS.  By TLS I mean __thread, not pthread_get_specific.  In the
> > GNU/Linux world, TLS conventionally means specifically __thread.
> 
> I see why you didn't understand the question .. you're so buried
> in gcc world C=gcc .. whereas for me C=ISO Standard, any compiler.
> TLS = __thread jargon noted, thanks (for me, TLS meant both
> generic 'storage local to a thread=the stack' or 'posix',
> not __thread which is a gcc extension).

I am familiar with other compilers.  TLS (__thread) is not specific to
gcc.  In fact, it was invented by Sun, and first implemented in their
compiler.


> > It only costs a register for code which accesses a TLS variable, of
> > course.
> 
> What do you mean 'code'? Do mean individual function? If so,
> how is the register loaded? 

It depends on the processor.

On the x86 and x86_64 a segment register is used.  The segment
register is set by the OS when context switching to the thread.  (As
mentioned downthread, segment registers are special in the hardware
and are not available for general compiler use.)


> > It's just not plausible to say that a C/C++ program should be written
> > without static variables.  
> 
> Rubbish:) 30 years of programming, never used one in serious code
> unless some brain dead API forced it.

Those brain dead APIs are the ones the rest of us actually work with
every day.  If you can disregard them, that's fine.  The rest of us
will continue to operate in the real world.  Nobody is forcing you to
use TLS for anything at all, so you can continue to ignore it.

Ian


Re: gomp slowness

2007-11-02 Thread skaller

On Fri, 2007-11-02 at 13:47 -0600, Joel Dice wrote:

> > So any (application) program needing TLS (other than the stack)
> > is automatically badly designed.  I've been writing code for
> > three decades without using any global variables, ever since
> > I learned about re-entrancy.
> 
> While I agree that global and thread-local variables are to be avoided in 
> general, I wonder how you would treat the following case:

> Function A needs to supply some data to function Z, but only via a call 
> stack containing functions B through Y.  Must B through Y all take that 
> data as a parameter and pass it as an argument? 

Yes. Felix does precisely this.
Note the ABI will often pass it in a register,
and the variable is not used in functions B-Y .. 
so gcc+ABI will probably optimise away the explicit pointer 
passing anyhow, in effect leaving exactly the 'store it
in an unused register' anyhow.

Felix trusts the gcc to handle this reasonably
well and it does seem to do so in most cases.

>  Would it not be more 
> efficient to store it in a register on a register-rich machine instead of 
> copying in on every call?

No: 'As-if' rule. Copying is efficient because it can be optimised
away. You write your C code with explicit passing and complain
right here in this list if gcc doesn't optimise it efficiently :)

> Even putting runtime efficiency aside, the code in the intermediate 
> functions is made arguably less readable by each bit of context it has to 
> forward along but never use itself.  

No, it's the other way around. When you use global variables to 
store state, you cannot easily reason about your code because
the coupling is not represented in the function argument-parameter
bindings explicitly.

> OTOH, using thread-locals introduces 
> "action at a distance", which never helps readability.

Yes.

-- 
John Skaller 
Felix, successor to C++: http://felix.sf.net


Re: gomp slowness

2007-11-02 Thread skaller

On Sat, 2007-11-03 at 12:27 +1100, skaller wrote:
> On Fri, 2007-11-02 at 10:29 -0700, Ian Lance Taylor wrote:

> Of course there is. It's called design by contract.
> I do it all the time. I am appalled at code bases like
> GTK and interfaces like OpenMP which get such really
> basic things wrong. 

Argg .. correction! I meant OpenGL, not OpenMP!
WOOPS!!! Associative memory has its problems :)

-- 
John Skaller 
Felix, successor to C++: http://felix.sf.net


Re: gomp slowness

2007-11-02 Thread skaller

On Fri, 2007-11-02 at 18:45 -0700, Andrew Pinski wrote:
> > This is not true. If you use a register for any purpose like this,
> > it can't be used for anything else and that has a cost.
> 
> This is a segment register.  Please go and read about what segment
> registers.

I know how the x86 works quite well .. perhaps unfortunately I've
written several major applications in x86 assembler (including
a complete text editor). 

-- 
John Skaller 
Felix, successor to C++: http://felix.sf.net


Re: gomp slowness

2007-11-02 Thread Robert Dewar

skaller wrote:

On Fri, 2007-11-02 at 18:45 -0700, Andrew Pinski wrote:

This is not true. If you use a register for any purpose like this,
it can't be used for anything else and that has a cost.

This is a segment register.  Please go and read about what segment
registers.


I know how the x86 works quite well .. perhaps unfortunately I've
written several major applications in x86 assembler (including
a complete text editor). 


in which case you should understand that the use of FS is free







Re: gomp slowness

2007-11-02 Thread Robert Dewar

skaller wrote:


This is not true. If you use a register for any purpose like this,
it can't be used for anything else and that has a cost.

On x86_64 which I use, every register is valuable. Don't you dare
take one away, it would have a serious performance impact AND
it would stop ME using that register for something which my application
might consider much more important, for example as a pointer to the
minor heap in a copying collector (Ocaml does this, it is the
reason it has very high performance.. I would do this in Felix too
if I could figure out how to reload the variable in a callback
invoked by foreign code). This applies to %fs on i386 too: if
the compiler uses that register, other uses are denied, and the
compiler can't tell which is more important.


You really can't be serious in your comment about fs, if you
understand the architecture ...








Re: gomp slowness

2007-11-02 Thread skaller

On Fri, 2007-11-02 at 23:54 -0400, Robert Dewar wrote:
> skaller wrote:
> > On Fri, 2007-11-02 at 18:45 -0700, Andrew Pinski wrote:
> >>> This is not true. If you use a register for any purpose like this,
> >>> it can't be used for anything else and that has a cost.
> >> This is a segment register.  Please go and read about what segment
> >> registers.
> > 
> > I know how the x86 works quite well .. perhaps unfortunately I've
> > written several major applications in x86 assembler (including
> > a complete text editor). 
> 
> in which case you should understand that the use of FS is free
> > 

No. Free for whom? Sure, if the ABI doesn't specify it and
gcc doesn't otherwise use it then gcc using it doesn't impact
an ISO C program.

but it *could* impact a program that had already chosen
to conditionally use that register if it happened to be
possible with the C compiler, as it is with gcc,
or if the compiler didn't use it (promise!) and one was
prepared to add some assembler.

I'm sure you know this kind of thing is quite common in
language translators and interpreters, especially with respect
to memory allocation (possibly garbage collection) .. in fact
the same kind of use as suggested for __thread.

Neko, for example, uses a register. AFAIK MLton does the
same kind of thing. If gcc team thinks ANY register is free
to steal they'd be wrong -- that doesn't mean it shouldn't
be used, just that it definitely is NOT free.

I can tell you I definitely considered using FS for the
Felix thread frame pointer to save passing that pointer
between every function.. 

-- 
John Skaller 
Felix, successor to C++: http://felix.sf.net


Re: gomp slowness

2007-11-02 Thread Ian Lance Taylor
skaller <[EMAIL PROTECTED]> writes:

> Neko, for example, uses a register. AFAIK MLton does the
> same kind of thing. If gcc team thinks ANY register is free
> to steal they'd be wrong -- that doesn't mean it shouldn't
> be used, just that it definitely is NOT free.

To be clear, it is not the gcc team which is stealing the register.
As I said earlier, TLS (i.e., __thread) was defined by Sun.  They
defined the implementation for i386 and SPARC.  Other organizations
have carried it forward to other processors.  Here is the Sun
documentation:
http://docs.sun.com/app/docs/doc/817-1984/6mhm7pl2a
TLS is implemented via a combination of the compiler, the system
library, and the kernel.

As I said before, the register is only stolen for code which actually
uses TLS.

Ian


Re: gomp slowness

2007-11-02 Thread skaller

On Fri, 2007-11-02 at 23:56 -0400, Robert Dewar wrote:
> skaller wrote:

> You really can't be serious in your comment about fs, if you
> understand the architecture ...

You're just not thinking the same way I am. A CPU has state,
the compiler and application program manage that state.

If the compiler can use a register, the application can too.
They're therefore competing for the use of the register.

If gcc wants to use it for TLS but I want to use it for
a different purpose, there's a conflict.

If the OS uses the register and denies it to the programmer,
or specifies a particular use, then BOTH the compiler and
application have to respect that.

Bottom line: gcc isn't the only code generator. I can write
assembler by hand. I can also access machine registers
directly with gcc (nice feature!)

GHC Haskell  takes the assembler output gcc produces and runs 
a Perl script over it to fix it up. It's called 'registering' 
the code. GHC uses continuation passing and gcc can't represent the
model very well .. that's GHC's solution: to trick gcc.
you can bet they're interested in what optimisations gcc
does because they have to find the 'fixup' points.

Mercury, Felix and Mlton use 'assembler labels' to work around
gcc's own inability to compile large functions, coupled
with C's lack of decent control structures (which can
be implemented in flat code that gcc can't compile).

In Mlton, it's known gcc can break it by duplicating the
code (and thus the assembler labels). This has never
happened with Felix, but in theory it might.

All these applications are using gcc with C plus tricks
to establish their own environment. So using a register
isn't free.

-- 
John Skaller 
Felix, successor to C++: http://felix.sf.net


Re: gomp slowness

2007-11-02 Thread skaller

On Fri, 2007-11-02 at 22:35 -0700, Ian Lance Taylor wrote:
> skaller <[EMAIL PROTECTED]> writes:
> 
> > Neko, for example, uses a register. AFAIK MLton does the
> > same kind of thing. If gcc team thinks ANY register is free
> > to steal they'd be wrong -- that doesn't mean it shouldn't
> > be used, just that it definitely is NOT free.
> 
> To be clear, it is not the gcc team which is stealing the register.
> As I said earlier, TLS (i.e., __thread) was defined by Sun.  They
> defined the implementation for i386 and SPARC.  Other organizations
> have carried it forward to other processors.  Here is the Sun
> documentation:
> http://docs.sun.com/app/docs/doc/817-1984/6mhm7pl2a
> TLS is implemented via a combination of the compiler, the system
> library, and the kernel.

Thanks! So the use is defined by a protocol which gcc is following.

> As I said before, the register is only stolen for code which actually
> uses TLS.

So scanning that document, for x86_64, fs is used in startup
code, presumably if, and only if, there is a linker section
containing __thread variables?


-- 
John Skaller 
Felix, successor to C++: http://felix.sf.net


Re: gomp slowness

2007-11-02 Thread Ian Lance Taylor
skaller <[EMAIL PROTECTED]> writes:

> > As I said before, the register is only stolen for code which actually
> > uses TLS.
> 
> So scanning that document, for x86_64, fs is used in startup
> code, presumably if, and only if, there is a linker section
> containing __thread variables?

Yes.

Ian