Re: Extending jumps after block reordering
> Questions: > * shorten_branches() computes sizes of instructions so I know what the > distance is between a jump instr and its target label. But how do I know > what is the maximum distance each kind of branch can safely take? > bb-reorder.c assumes that its only when cold/hot partitions are crossed > it has to use indirect jumps, which is not the appropriate test in my case. You cannot easily, it's buried in the architecture back-ends. > * do I get it right that shorten_branches() does not really modify > instructions but it helps to shorten branches by providing more accurate > insns lengths? Yes, but this should work automatically. IOW, as Ian said, you shouldn't need to do anything special. Maybe it's simply a latent bug in the PPC back-end. -- Eric Botcazou
Re: Tree-SSA and POST_INC address mode inompatible in GCC4?
Hi Bingfeng, On 11/2/07, Bingfeng Mei <[EMAIL PROTECTED]> wrote: > Hello, > > I look at the following the code to see what is the difference between > GCC4 and GCC3 in using POST_INC address mode (or other similar modes). > > void tst(char * __restrict__ a, char * __restrict__ b){ > *a++ = *b++; > *a++ = *b++; > *a++ = *b++; > *a++ = *b++; > *a++ = *b++; > *a++ = *b++; > *a = *b; > } We have seen this in a number of other ports as well - I had hacked up a patch to sort this precise problem out but that was for trunk / 4.3 and is not applicable for 4.2.x since the autoincrement detector was rewritten post 4.2. http://gcc.gnu.org/ml/gcc-patches/2007-09/msg01060.html I haven't yet had time to rework this based on the comments but it surely is on my radar of things to do. cheers Ramana > > > Using ARM processor as a target, GCC4.2.2 generates the following > assembly: > tst: > @ args = 0, pretend = 0, frame = 0 > @ frame_needed = 0, uses_anonymous_args = 0 > @ link register save eliminated. > mov r2, r1 > ldrbip, [r2], #1@ zero_extendqisi2 > mov r3, r0 > strbip, [r3], #1 > ldrbr1, [r1, #1]@ zero_extendqisi2 > strbr1, [r0, #1] > ldrbr1, [r2, #1]@ zero_extendqisi2 > strbr1, [r3, #1] > add r2, r2, #1 > ldrbr1, [r2, #1]@ zero_extendqisi2 > add r3, r3, #1 > strbr1, [r3, #1] > add r2, r2, #1 > ldrbr1, [r2, #1]@ zero_extendqisi2 > add r3, r3, #1 > strbr1, [r3, #1] > add r2, r2, #1 > ldrbr1, [r2, #1]@ zero_extendqisi2 > add r3, r3, #1 > strbr1, [r3, #1] > ldrbr2, [r2, #2]@ zero_extendqisi2 > @ lr needed for prologue > strbr2, [r3, #2] > bx lr > .size tst, .-tst > .ident "GCC: (GNU) 4.2.2" > > And GCC3.4.6 generates much better code by using POST_INC address mode > extensively > > tst: > @ args = 0, pretend = 0, frame = 0 > @ frame_needed = 0, uses_anonymous_args = 0 > @ link register save eliminated. > ldrbr3, [r1], #1@ zero_extendqisi2 > strbr3, [r0], #1 > ldrbr3, [r1], #1@ zero_extendqisi2 > strbr3, [r0], #1 > ldrbr3, [r1], #1@ zero_extendqisi2 > strbr3, [r0], #1 > ldrbr3, [r1], #1@ zero_extendqisi2 > strbr3, [r0], #1 > ldrbr3, [r1], #1@ zero_extendqisi2 > strbr3, [r0], #1 > ldrbr3, [r1], #1@ zero_extendqisi2 > strbr3, [r0], #1 > ldrbr3, [r1, #0]@ zero_extendqisi2 > @ lr needed for prologue > strbr3, [r0, #0] > mov pc, lr > .size tst, .-tst > .ident "GCC: (GNU) 3.4.6" > > I look at dumped tst.c.102t.final_cleanup: > tst (a, b) > { > char * restrict a.54; > char * restrict a.53; > char * restrict a.52; > char * restrict a.51; > char * restrict a.50; > char * restrict b.48; > char * restrict b.47; > char * restrict b.46; > char * restrict b.45; > char * restrict b.44; > > : > *a = *b; > a.50 = a + 1B; > b.44 = b + 1B; > *a.50 = *b.44; > a.51 = a.50 + 1B; > b.45 = b.44 + 1B; > *a.51 = *b.45; > a.52 = a.51 + 1B; > b.46 = b.45 + 1B; > *a.52 = *b.46; > a.53 = a.52 + 1B; > b.47 = b.46 + 1B; > *a.53 = *b.47; > a.54 = a.53 + 1B; > b.48 = b.47 + 1B; > *a.54 = *b.48; > *(a.54 + 1B) = *(b.48 + 1B); > return; > > } > I believe it is a fundermental issue for Tree-SSA IR. POST_INC address > mode requires a pattern that the same variable is used for incrementing > (both USE and DEF), while the SSA form produces a different varible for > each DEF. Therefore, GCC4 cannot efficiently use POST_INC and other > similar address modes. Is there any solution to overcome this problem? > Any suggestion is greatly appreciated. > > > Bingfeng Mei > Broadcom UK > > -- Ramana Radhakrishnan GNU Tools Celunite Inc.
Re: Results of 7z-4.55 performance with current GCCs.
2007/11/2, NightStrike <[EMAIL PROTECTED]>: > On 11/1/07, Ted Byers <[EMAIL PROTECTED]> wrote: > > --- David Miller <[EMAIL PROTECTED]> wrote: > > > ... > > I agree with you 100%. It has always been my view that if you can't > compile fast enough, then get another machine and use distcc, or get a > quad core and do make -j5, etc etc. Compile time should never > outweigh code correctness, and if it takes longer to compile more > correct code, then that's just the nature of moving forward into the > future. "Save the planet and don't add more wood to the fire".
Re: gomp slowness
On Fri, Nov 02, 2007 at 07:39:33AM -0700, Ian Lance Taylor wrote: > The only way I can interpret your comments is that you are assuming > that all TLS is Global Dynamic (e.g., accessed from a dlopen'ed shared > library). But stack based thread local storage won't work for > dlopen'ed shared libraries at all. Actually, from context I assume he's talking about pthread_setspecific and does not know about __thread. -- Daniel Jacobowitz CodeSourcery
Re: gomp slowness
skaller <[EMAIL PROTECTED]> writes: > A really cool (non-Posix) implementation would put TLS globals > on the stack base .. but this does require at least one extra > machine register in languages like C which don't provide > a static display (pointer to parent function). For languages > that do, such as Modula and most FPLs, the display pointer > has to be provided anyhow, so the TLS globals come at > no extra cost. In a C executable, TLS requires one extra machine register. TLS variables are accessed via offsets from that register. So what's the significant difference between that and your proposal? > There are bound to be performance issues if you have to query > any kind of global data base shared between threads to obtain > data local to the thread. The only exception to this is the data > held in the 'task state', which is typically just the machine > registers and in particular the stack. TLS does not require querying a global data base to get thread local data. The only way I can interpret your comments is that you are assuming that all TLS is Global Dynamic (e.g., accessed from a dlopen'ed shared library). But stack based thread local storage won't work for dlopen'ed shared libraries at all. I think you need to look at the TLS access code before deciding that it has bad performance. Make sure to look at the code in the executable after the linker has optimized it. Ian
RE: Tree-SSA and POST_INC address mode inompatible in GCC4?
Hi, Ramana, I tried the trunk version with/without your patch. It still produces the same code as gcc4.2.2 does. In auto-inc-dec.c, the comments say *a ... a <- a + c becomes *(a += c) post But the problem is after Tree-SSA pass, there is no a <- a + c But something like a_1 <- a + c Unless the auto-inc-dec.c can reverse a_1 <- a + c to a <- a + c. I don't see this transformation is applicable in most scenarios. Any comments? Cheers, Bingfeng -Original Message- From: Ramana Radhakrishnan [mailto:[EMAIL PROTECTED] Sent: 02 November 2007 12:39 To: Bingfeng Mei Cc: gcc@gcc.gnu.org Subject: Re: Tree-SSA and POST_INC address mode inompatible in GCC4? Hi Bingfeng, On 11/2/07, Bingfeng Mei <[EMAIL PROTECTED]> wrote: > Hello, > > I look at the following the code to see what is the difference between > GCC4 and GCC3 in using POST_INC address mode (or other similar modes). > > void tst(char * __restrict__ a, char * __restrict__ b){ > *a++ = *b++; > *a++ = *b++; > *a++ = *b++; > *a++ = *b++; > *a++ = *b++; > *a++ = *b++; > *a = *b; > } We have seen this in a number of other ports as well - I had hacked up a patch to sort this precise problem out but that was for trunk / 4.3 and is not applicable for 4.2.x since the autoincrement detector was rewritten post 4.2. http://gcc.gnu.org/ml/gcc-patches/2007-09/msg01060.html I haven't yet had time to rework this based on the comments but it surely is on my radar of things to do. cheers Ramana > > > Using ARM processor as a target, GCC4.2.2 generates the following > assembly: > tst: > @ args = 0, pretend = 0, frame = 0 > @ frame_needed = 0, uses_anonymous_args = 0 > @ link register save eliminated. > mov r2, r1 > ldrbip, [r2], #1@ zero_extendqisi2 > mov r3, r0 > strbip, [r3], #1 > ldrbr1, [r1, #1]@ zero_extendqisi2 > strbr1, [r0, #1] > ldrbr1, [r2, #1]@ zero_extendqisi2 > strbr1, [r3, #1] > add r2, r2, #1 > ldrbr1, [r2, #1]@ zero_extendqisi2 > add r3, r3, #1 > strbr1, [r3, #1] > add r2, r2, #1 > ldrbr1, [r2, #1]@ zero_extendqisi2 > add r3, r3, #1 > strbr1, [r3, #1] > add r2, r2, #1 > ldrbr1, [r2, #1]@ zero_extendqisi2 > add r3, r3, #1 > strbr1, [r3, #1] > ldrbr2, [r2, #2]@ zero_extendqisi2 > @ lr needed for prologue > strbr2, [r3, #2] > bx lr > .size tst, .-tst > .ident "GCC: (GNU) 4.2.2" > > And GCC3.4.6 generates much better code by using POST_INC address mode > extensively > > tst: > @ args = 0, pretend = 0, frame = 0 > @ frame_needed = 0, uses_anonymous_args = 0 > @ link register save eliminated. > ldrbr3, [r1], #1@ zero_extendqisi2 > strbr3, [r0], #1 > ldrbr3, [r1], #1@ zero_extendqisi2 > strbr3, [r0], #1 > ldrbr3, [r1], #1@ zero_extendqisi2 > strbr3, [r0], #1 > ldrbr3, [r1], #1@ zero_extendqisi2 > strbr3, [r0], #1 > ldrbr3, [r1], #1@ zero_extendqisi2 > strbr3, [r0], #1 > ldrbr3, [r1], #1@ zero_extendqisi2 > strbr3, [r0], #1 > ldrbr3, [r1, #0]@ zero_extendqisi2 > @ lr needed for prologue > strbr3, [r0, #0] > mov pc, lr > .size tst, .-tst > .ident "GCC: (GNU) 3.4.6" > > I look at dumped tst.c.102t.final_cleanup: > tst (a, b) > { > char * restrict a.54; > char * restrict a.53; > char * restrict a.52; > char * restrict a.51; > char * restrict a.50; > char * restrict b.48; > char * restrict b.47; > char * restrict b.46; > char * restrict b.45; > char * restrict b.44; > > : > *a = *b; > a.50 = a + 1B; > b.44 = b + 1B; > *a.50 = *b.44; > a.51 = a.50 + 1B; > b.45 = b.44 + 1B; > *a.51 = *b.45; > a.52 = a.51 + 1B; > b.46 = b.45 + 1B; > *a.52 = *b.46; > a.53 = a.52 + 1B; > b.47 = b.46 + 1B; > *a.53 = *b.47; > a.54 = a.53 + 1B; > b.48 = b.47 + 1B; > *a.54 = *b.48; > *(a.54 + 1B) = *(b.48 + 1B); > return; > > } > I believe it is a fundermental issue for Tree-SSA IR. POST_INC address > mode requires a pattern that the same variable is used for incrementing > (both USE and DEF), while the SSA form produces a different varible for > each DEF. Therefore, GCC4 cannot efficiently use POST_INC and other > similar address modes. Is there any solution to overcome this problem? > Any suggestion is greatly appreciated. > > > Bingfeng Mei > Broadcom UK > > -- Ramana Radhakrishnan GNU Tools Celunite Inc.
Re: Dependency output
> "timtuun" == timtuun <[EMAIL PROTECTED]> writes: timtuun> I was wondering if there is a particular reason why object timtuun> name in dependency output doesn't include the directory where timtuun> the output is written? Just conservatism -- the options have worked this way for a long time. See PR 30491. timtuun> Am I completely wrong saying that some older version it would timtuun> have been objects/buffer.o: ... instead of just buffer.o: timtuun> ... ? Yeah, that would have been a better choice. I don't know why it was not done that way. I'm reluctant to change it, however, for fear of breaking a script that uses gcc. Also, it is easy enough to use -MT to get the target name you want. This is what automake does, for instance. Tom
Dependency output
Hi. I was wondering if there is a particular reason why object name in dependency output doesn't include the directory where the output is written? For example when compiling vim version 7.1 I get the following result. [EMAIL PROTECTED] vim71]$ gcc --version gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13) $DEPENDENCIES_OUTPUT is set gcc -c -I. -Iproto -DHAVE_CONFIG_H -DFEAT_GUI_GTK -I/usr/include/gtk-2.0 -I/usr/lib/gtk-2.0/include -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/freetype2 -I/usr/include/libpng12 -g -O2 -o objects/buffer.o buffer.c Produces following in the dependency file: buffer.o: buffer.c vim.h auto/config.h feature.h os_unix.h auto/osdef.h \ ascii.h keymap.h term.h macros.h option.h structs.h regexp.h gui.h \ gui_beval.h /usr/include/gtk-2.0/gtk/gtkwidget.h Am I completely wrong saying that some older version it would have been objects/buffer.o: ... instead of just buffer.o: ... ? Timo
Tree-SSA and POST_INC address mode inompatible in GCC4?
Hello, I look at the following the code to see what is the difference between GCC4 and GCC3 in using POST_INC address mode (or other similar modes). void tst(char * __restrict__ a, char * __restrict__ b){ *a++ = *b++; *a++ = *b++; *a++ = *b++; *a++ = *b++; *a++ = *b++; *a++ = *b++; *a = *b; } Using ARM processor as a target, GCC4.2.2 generates the following assembly: tst: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. mov r2, r1 ldrbip, [r2], #1@ zero_extendqisi2 mov r3, r0 strbip, [r3], #1 ldrbr1, [r1, #1]@ zero_extendqisi2 strbr1, [r0, #1] ldrbr1, [r2, #1]@ zero_extendqisi2 strbr1, [r3, #1] add r2, r2, #1 ldrbr1, [r2, #1]@ zero_extendqisi2 add r3, r3, #1 strbr1, [r3, #1] add r2, r2, #1 ldrbr1, [r2, #1]@ zero_extendqisi2 add r3, r3, #1 strbr1, [r3, #1] add r2, r2, #1 ldrbr1, [r2, #1]@ zero_extendqisi2 add r3, r3, #1 strbr1, [r3, #1] ldrbr2, [r2, #2]@ zero_extendqisi2 @ lr needed for prologue strbr2, [r3, #2] bx lr .size tst, .-tst .ident "GCC: (GNU) 4.2.2" And GCC3.4.6 generates much better code by using POST_INC address mode extensively tst: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldrbr3, [r1], #1@ zero_extendqisi2 strbr3, [r0], #1 ldrbr3, [r1], #1@ zero_extendqisi2 strbr3, [r0], #1 ldrbr3, [r1], #1@ zero_extendqisi2 strbr3, [r0], #1 ldrbr3, [r1], #1@ zero_extendqisi2 strbr3, [r0], #1 ldrbr3, [r1], #1@ zero_extendqisi2 strbr3, [r0], #1 ldrbr3, [r1], #1@ zero_extendqisi2 strbr3, [r0], #1 ldrbr3, [r1, #0]@ zero_extendqisi2 @ lr needed for prologue strbr3, [r0, #0] mov pc, lr .size tst, .-tst .ident "GCC: (GNU) 3.4.6" I look at dumped tst.c.102t.final_cleanup: tst (a, b) { char * restrict a.54; char * restrict a.53; char * restrict a.52; char * restrict a.51; char * restrict a.50; char * restrict b.48; char * restrict b.47; char * restrict b.46; char * restrict b.45; char * restrict b.44; : *a = *b; a.50 = a + 1B; b.44 = b + 1B; *a.50 = *b.44; a.51 = a.50 + 1B; b.45 = b.44 + 1B; *a.51 = *b.45; a.52 = a.51 + 1B; b.46 = b.45 + 1B; *a.52 = *b.46; a.53 = a.52 + 1B; b.47 = b.46 + 1B; *a.53 = *b.47; a.54 = a.53 + 1B; b.48 = b.47 + 1B; *a.54 = *b.48; *(a.54 + 1B) = *(b.48 + 1B); return; } I believe it is a fundermental issue for Tree-SSA IR. POST_INC address mode requires a pattern that the same variable is used for incrementing (both USE and DEF), while the SSA form produces a different varible for each DEF. Therefore, GCC4 cannot efficiently use POST_INC and other similar address modes. Is there any solution to overcome this problem? Any suggestion is greatly appreciated. Bingfeng Mei Broadcom UK
Re: RFC: Creating a live, all-encompassing architectural document for GCC
On Fri, 26 Oct 2007, Diego Novillo wrote: > So, I think the problem goes a bit beyond mere documentation of how a > module works at a high level. I would like to have a navigable > document that also describes the flow of things, interfaces and > helpers. Starting at main.c:main() and ending at toplev.c:finalize(). Something like this is a key element for documentation, but it's hard to do the way we have been documenting things indeed. That sounds like a very good idea. > - Navigable. That seems to ask for, at least on the module level and below, for something similar to literary programming or we'll run out of sync quickly. > - Close correspondence to mainline. > > This is where it gets hard. We need to have a way of enforcing code > updates that change internal or external API properties to be > reflected in the document. With this I don't mean that every single > patch should be accompanied with a documentation change. However, if > a patch refactors a module and its internal interfaces are changed, > then the patch should be accompanied with a change to the > documentation. I guess that's my main concern as well: how can we keep the various bits of documentation -- comments in the code, texinfo, and your proposed one -- in sync and do so without adding much effort for individual contributors? Another concern is the question of copyright assignments. We are requiring these for texinfo changes, but do not have anything in place for the Wiki. If we now get significantly improved documentation on the latter, we may not be able to move that into the regular manuals. Is this an issue? Gerald
Re: gomp slowness
On Fri, 2007-11-02 at 07:39 -0700, Ian Lance Taylor wrote: > skaller <[EMAIL PROTECTED]> writes: > In a C executable, TLS requires one extra machine register. You mean gcc? > TLS > variables are accessed via offsets from that register. So what's the > significant difference between that and your proposal? I wasn't making a proposal. > The only way I can interpret your comments is that you are assuming > that all TLS is Global Dynamic (e.g., accessed from a dlopen'ed shared > library). But stack based thread local storage won't work for > dlopen'ed shared libraries at all. No, I was assuming implementation of a call like: void *pthread_getspecific(pthread_key_t key); > I think you need to look at the TLS access code before deciding that > it has bad performance. You already said it costs a register? That's a REALLY high cost to pay to support badly designed software. -- John Skaller Felix, successor to C++: http://felix.sf.net
Re: Last argument of lang_hooks_for_callgraph.analzye_tree unused?
On 11/1/07, Jan Hubicka <[EMAIL PROTECTED]> wrote: > Just go ahead and kill it. I would preffer to remove the whole hook, > but we still keep some non-GIMPLE expressions in static initializers :( Yeah, that's too bad. Attached is the patch I committed, tested on x86_64. This fixes the latent bug in calls to analyze_expr that were being called with a cgraph node instead of a decl. Tested on x86_64. Diego. 2007-11-02 Diego Novillo <[EMAIL PROTECTED]> * langhooks.h (struct lang_hooks_for_callgraph): Remove third argument from function pointer ANALYZE_EXPR. Update all users. * cgraph.c (debug_cgraph_node): New. (debug_cgraph): New. Index: cgraphbuild.c === --- cgraphbuild.c (revision 129823) +++ cgraphbuild.c (working copy) @@ -35,7 +35,7 @@ along with GCC; see the file COPYING3. Called via walk_tree: TP is pointer to tree to be examined. */ static tree -record_reference (tree *tp, int *walk_subtrees, void *data) +record_reference (tree *tp, int *walk_subtrees, void *data ATTRIBUTE_UNUSED) { tree t = *tp; @@ -46,8 +46,7 @@ record_reference (tree *tp, int *walk_su { varpool_mark_needed_node (varpool_node (t)); if (lang_hooks.callgraph.analyze_expr) - return lang_hooks.callgraph.analyze_expr (tp, walk_subtrees, - data); + return lang_hooks.callgraph.analyze_expr (tp, walk_subtrees); } break; @@ -73,7 +72,7 @@ record_reference (tree *tp, int *walk_su } if ((unsigned int) TREE_CODE (t) >= LAST_AND_UNUSED_TREE_CODE) - return lang_hooks.callgraph.analyze_expr (tp, walk_subtrees, data); + return lang_hooks.callgraph.analyze_expr (tp, walk_subtrees); break; } Index: cgraph.c === --- cgraph.c(revision 129823) +++ cgraph.c(working copy) @@ -657,7 +657,9 @@ cgraph_node_name (struct cgraph_node *no const char * const cgraph_availability_names[] = {"unset", "not_available", "overwrittable", "available", "local"}; -/* Dump given cgraph node. */ + +/* Dump call graph node NODE to file F. */ + void dump_cgraph_node (FILE *f, struct cgraph_node *node) { @@ -742,7 +744,17 @@ dump_cgraph_node (FILE *f, struct cgraph fprintf (f, "\n"); } -/* Dump the callgraph. */ + +/* Dump call graph node NODE to stderr. */ + +void +debug_cgraph_node (struct cgraph_node *node) +{ + dump_cgraph_node (stderr, node); +} + + +/* Dump the callgraph to file F. */ void dump_cgraph (FILE *f) @@ -754,7 +766,18 @@ dump_cgraph (FILE *f) dump_cgraph_node (f, node); } + +/* Dump the call graph to stderr. */ + +void +debug_cgraph (void) +{ + dump_cgraph (stderr); +} + + /* Set the DECL_ASSEMBLER_NAME and update cgraph hashtables. */ + void change_decl_assembler_name (tree decl, tree name) { Index: cgraph.h === --- cgraph.h(revision 129823) +++ cgraph.h(working copy) @@ -288,7 +288,9 @@ extern GTY(()) int cgraph_order; /* In cgraph.c */ void dump_cgraph (FILE *); +void debug_cgraph (void); void dump_cgraph_node (FILE *, struct cgraph_node *); +void debug_cgraph_node (struct cgraph_node *); void cgraph_insert_node_to_hashtable (struct cgraph_node *node); void cgraph_remove_edge (struct cgraph_edge *); void cgraph_remove_node (struct cgraph_node *); Index: cp/cp-tree.h === --- cp/cp-tree.h(revision 129823) +++ cp/cp-tree.h(working copy) @@ -4303,7 +4303,7 @@ extern tree cp_build_parm_decl(tree, extern tree get_guard (tree); extern tree get_guard_cond (tree); extern tree set_guard (tree); -extern tree cxx_callgraph_analyze_expr (tree *, int *, tree); +extern tree cxx_callgraph_analyze_expr (tree *, int *); extern void mark_needed(tree); extern bool decl_needed_p (tree); extern void note_vague_linkage_fn (tree); Index: cp/decl2.c === --- cp/decl2.c (revision 129823) +++ cp/decl2.c (working copy) @@ -3026,8 +3026,7 @@ generate_ctor_and_dtor_functions_for_pri Here we must deal with member pointers. */ tree -cxx_callgraph_analyze_expr (tree *tp, int *walk_subtrees ATTRIBUTE_UNUSED, - tree from ATTRIBUTE_UNUSED) +cxx_callgraph_analyze_expr (tree *tp, int *walk_subtrees ATTRIBUTE_UNUSED) { tree t = *tp; Index: langhooks.c === --- langhooks.c (revision 129823) +++ langhooks.c (working copy) @@ -486,8 +486,7 @@ lhd_print_error_function (diagnostic_con tree
Re: gomp slowness
On Fri, 2007-11-02 at 10:46 -0400, Daniel Jacobowitz wrote: > On Fri, Nov 02, 2007 at 07:39:33AM -0700, Ian Lance Taylor wrote: > > The only way I can interpret your comments is that you are assuming > > that all TLS is Global Dynamic (e.g., accessed from a dlopen'ed shared > > library). But stack based thread local storage won't work for > > dlopen'ed shared libraries at all. > > Actually, from context I assume he's talking about pthread_setspecific > and does not know about __thread. Yes, I was talking about pthread_*, i.e. posix threads. I do know about __thread though. My argument is basically: there is no need for any such feature in a well written program. Each thread already has its own local stack. Global variables should not be used in the first place (except for signals etc where there is no choice). So any (application) program needing TLS (other than the stack) is automatically badly designed. I've been writing code for three decades without using any global variables, ever since I learned about re-entrancy. -- John Skaller Felix, successor to C++: http://felix.sf.net
Re: gomp slowness
On Thu, 2007-11-01 at 21:02 -0700, Gary Funck wrote: > On Thu, Oct 18, 2007 at 11:42:52AM +1000, skaller wrote: > > > > DO you know how thread local variables are handled? > > [Not using Posix TLS I hope .. that would be a disaster] > > Would you please elaborate? Sure .. > What's wrong with the POSIX TLS implementation? I have no idea, the implementation is irrelevant: the interface is likely orders of magnitude slower than the proper way to do thread local storage: use the stack. Posix TLS is a hack. New code should NOT use TLS, it is only for supporting broken legacy code. For example in the C library the global errno variable. This is a DESIGN FAULT in ISO C89 which can be 'repaired' by using TLS, but one should never design such a bad interface in new code. A really cool (non-Posix) implementation would put TLS globals on the stack base .. but this does require at least one extra machine register in languages like C which don't provide a static display (pointer to parent function). For languages that do, such as Modula and most FPLs, the display pointer has to be provided anyhow, so the TLS globals come at no extra cost. > Do you know of any studies? No, but I would guess gcc has some performance regression tests? > I ask, because we presently use the TLS facility extensively, > and have suspected that there are significant performance > problems, but haven't looked into the issue. There are bound to be performance issues if you have to query any kind of global data base shared between threads to obtain data local to the thread. The only exception to this is the data held in the 'task state', which is typically just the machine registers and in particular the stack. If this data is on or reachable from the machine stack in the first place, there's no performance problem and no need for TLS. -- John Skaller Felix, successor to C++: http://felix.sf.net
Re: gomp slowness
skaller <[EMAIL PROTECTED]> writes: > On Fri, 2007-11-02 at 07:39 -0700, Ian Lance Taylor wrote: > > skaller <[EMAIL PROTECTED]> writes: > > > In a C executable, TLS requires one extra machine register. > > You mean gcc? I don't understand the question. I mean in a C/C++ executable which uses TLS. By TLS I mean __thread, not pthread_get_specific. In the GNU/Linux world, TLS conventionally means specifically __thread. > > I think you need to look at the TLS access code before deciding that > > it has bad performance. > > You already said it costs a register? That's a REALLY high cost > to pay to support badly designed software. It only costs a register for code which accesses a TLS variable, of course. > My argument is basically: there is no need for any such > feature in a well written program. Each thread already has > its own local stack. Global variables should not be used > in the first place (except for signals etc where > there is no choice). While global variables are rarely required, there are many cases where a class is most reasonably implemented using static variables. For example, it's very hard to implement malloc without using any static variables. And for performance those static variables should be thread local. It's just not plausible to say that a C/C++ program should be written without static variables. Modern programs are written by different organizations. There is no way to pass appropriate state through all the required interfaces. Ian
Re: Results of 7z-4.55 performance with current GCCs.
On 11/1/07, Ted Byers <[EMAIL PROTECTED]> wrote: > --- David Miller <[EMAIL PROTECTED]> wrote: > > From: NightStrike <[EMAIL PROTECTED]> > > Date: Thu, 1 Nov 2007 22:34:33 -0400 > > > > > I think what is more important is the resulting > > binary -- does it > > > run faster? > > > > The answer to this is situational dependant. > > > > For example, for me, the speed of compilation at -O2 > > is very important > > because I'm constantly doing full tree build > > regressions. > > > > There are large groups of us who pine for > > compilation to be as fast > > as the old MIPS compilers were, and they were fully > > optimizing > > and even had a more advanced register allocator than > > GCC has now. > > > I find it hard to fathom why the OP would be concerned > with compile and run times measured in minutes and > seconds. I don't know how long your full tree build > regressions take, but for me, a very small application > will take half an hour to compile, and a large one > could take all day. But if by hand tuning my code, > and pushing my development tools to their limits, I > can have my application finish a task in minutes where > my predecessors' versions took hours (something I > commonly see, perhaps by chance, with the projects I > find myself working on), the savings of my clients' > users' time is greater than the cost of my time by > several orders of magnitude, so I don't mind waiting > for a build to finish if the end product is provably > correct. > > There is much more to both compile time and run time > performance than how fast your development tools are. > I expect more recent tools to take longer than the > tools I used even five years ago, simply because there > is much more for them to do; and as they get better, I > can use more demanding parts of the language (my > preferred language is C++) that simply weren't > practical a few years ago. As I do this, then my > tools must work harder still. It isn't only the > tools, but what you do with them ... > > If I may state the obvious, an outstanding programmer > can easily make a mediocre development tool look good, > while a mediocre programmer can make even the best > tools look very bad. That said, I often download open > source applications (all good quality), and the GCC > suite takes longer to build than all the rest combined > (that is, of the ones I download), and since that > finishes in but a few hours on my machine, I won't > worry about how fast gcc compiles code until it takes > many days to compile itself. :-) > > As you say, performance questions and answers depend > on the situation. But I say, the single most > important question is, "Is the code correct?" that > is, does it produce output that is provably correct. > There is no point in having an insanely fast program > if it only, or even only generally, produces garbage. > As important as performance is, the correctness of the > code is, to my mind, infinitely more important than > either compile or runtime performance! > > I would encourage the good folk who work on GCC to > focus on making the code correct first, and only after > that can be proven, worry about making it faster. > Really bad things can happen to real people if my > programs give incorrect results (think about things > like contaminant transport, dose/risk assessments, > &c., and how someone I have never met may suffer if my > application gives a consultant or civil servant > unreliable results). When you think about the things > relevant to the work I do, you will understand why I > don't care if my build times are measured in hours or > days or even weeks as long as my clients' users' can > work more efficiently and obtain provably correct > results from my programs. Computers are cheap these > days, so if I find myself too often waiting for a > build to complete, I'l just get another computer to > work on while I wait for the one doing the build to > finish. > > I don't help develop GCC, but may I express to those > that do that I apreciate their efforts. I agree with you 100%. It has always been my view that if you can't compile fast enough, then get another machine and use distcc, or get a quad core and do make -j5, etc etc. Compile time should never outweigh code correctness, and if it takes longer to compile more correct code, then that's just the nature of moving forward into the future.
Re: gomp slowness
On Sat, Nov 03, 2007 at 03:31:14AM +1100, skaller wrote: > On Fri, 2007-11-02 at 07:39 -0700, Ian Lance Taylor wrote: > > I think you need to look at the TLS access code before deciding that > > it has bad performance. > > You already said it costs a register? That's a REALLY high cost > to pay to support badly designed software. Not if you have a lot of registers (anything modern but i386) or if the register can not really be used for anything else (%fs on i386 for instance). OG.
Re: gomp slowness
On Sat, Nov 03, 2007 at 03:38:51AM +1100, skaller wrote: > My argument is basically: there is no need for any such > feature in a well written program. Each thread already has > its own local stack. Global variables should not be used > in the first place (except for signals etc where > there is no choice). And for libraries where there is no choice. Happens rather often. OG.
Re: Autovectorized HIRLAM - latest results.
Sebastian Pop wrote: On Oct 29, 2007 10:49 AM, Dorit Nuzman <[EMAIL PROTECTED]> wrote: I wonder if it's versioning-for-aliasing (run-time dependence testing) that was responsible for a lot of the new vectorizable loops It is then possible that the code size noticeably increased. Toon could you provide more data on the size of the executables with and without vectorization, and also: Unfortunately, the binaries are gone. $ grep 'versioning for alias checks' HL_Prepare_00.html | wc -l This I can do: 1095 or slightly over half of the difference. Kind regards, -- Toon Moene - e-mail: [EMAIL PROTECTED] - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.indiv.nluug.nl/~toon/ GNU Fortran's path to Fortran 2003: http://gcc.gnu.org/wiki/Fortran2003
Re: gomp slowness
Olivier Galibert wrote: On Sat, Nov 03, 2007 at 03:38:51AM +1100, skaller wrote: My argument is basically: there is no need for any such feature in a well written program. Each thread already has its own local stack. Global variables should not be used in the first place (except for signals etc where there is no choice). And for libraries where there is no choice. Happens rather often. There are lots of cases where global thread specific variables are useful in practice, ask anyone who has programmed real world large scale real time embedded programs. One obvious example is the stack limit for checking stack overflow on subprogram entry. OG.
Re: gomp slowness
On Sat, 3 Nov 2007, skaller wrote: On Fri, 2007-11-02 at 10:46 -0400, Daniel Jacobowitz wrote: On Fri, Nov 02, 2007 at 07:39:33AM -0700, Ian Lance Taylor wrote: The only way I can interpret your comments is that you are assuming that all TLS is Global Dynamic (e.g., accessed from a dlopen'ed shared library). But stack based thread local storage won't work for dlopen'ed shared libraries at all. Actually, from context I assume he's talking about pthread_setspecific and does not know about __thread. Yes, I was talking about pthread_*, i.e. posix threads. I do know about __thread though. My argument is basically: there is no need for any such feature in a well written program. Each thread already has its own local stack. Global variables should not be used in the first place (except for signals etc where there is no choice). So any (application) program needing TLS (other than the stack) is automatically badly designed. I've been writing code for three decades without using any global variables, ever since I learned about re-entrancy. While I agree that global and thread-local variables are to be avoided in general, I wonder how you would treat the following case: Function A needs to supply some data to function Z, but only via a call stack containing functions B through Y. Must B through Y all take that data as a parameter and pass it as an argument? Would it not be more efficient to store it in a register on a register-rich machine instead of copying in on every call? Even putting runtime efficiency aside, the code in the intermediate functions is made arguably less readable by each bit of context it has to forward along but never use itself. OTOH, using thread-locals introduces "action at a distance", which never helps readability.
Re: Results of 7z-4.55 performance with current GCCs.
From: NightStrike <[EMAIL PROTECTED]> Date: Fri, 2 Nov 2007 10:42:01 -0400 > I agree with you 100%. It has always been my view that if you can't > compile fast enough, then get another machine and use distcc, or get a > quad core and do make -j5, etc etc. I have 64 cpu machines and use make -j64, it's still not fast enough and I know it could be much faster. Note that a faster GCC would also make GCC development go significantly quicker, since every developer has to do a full bootstrap and regression run before checking in non-trivial changes. The world is not black and white, it is shades of gray. Compilation time matters to some people, and I'm not in the slightest suggesting that it should trump code correctness. I don't know where anyone got that idea from what I've been saying.
Re: gcc 4.3.0 revision 129794 on hppa2.0w-hp-hpux11.00
> Comments? Should I file a bug report? Yes. It would help to know where the reference to ggc_free in rtl.o comes from (i.e., look at preprocessed source for this file). I don't think this file normally uses ggc_free(). Dave -- J. David Anglin [EMAIL PROTECTED] National Research Council of Canada (613) 990-0752 (FAX: 952-6602)
gcc-4.3-20071102 is now available
Snapshot gcc-4.3-20071102 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/4.3-20071102/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 4.3 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/trunk revision 129862 You'll find: gcc-4.3-20071102.tar.bz2 Complete GCC (includes all of below) gcc-core-4.3-20071102.tar.bz2 C front end and core compiler gcc-ada-4.3-20071102.tar.bz2 Ada front end and runtime gcc-fortran-4.3-20071102.tar.bz2 Fortran front end and runtime gcc-g++-4.3-20071102.tar.bz2 C++ front end and runtime gcc-java-4.3-20071102.tar.bz2 Java front end and runtime gcc-objc-4.3-20071102.tar.bz2 Objective-C front end and runtime gcc-testsuite-4.3-20071102.tar.bz2The GCC testsuite Diffs from 4.3-20071026 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-4.3 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Re: gomp slowness
On Fri, 2007-11-02 at 10:29 -0700, Ian Lance Taylor wrote: > skaller <[EMAIL PROTECTED]> writes: > > > On Fri, 2007-11-02 at 07:39 -0700, Ian Lance Taylor wrote: > > > skaller <[EMAIL PROTECTED]> writes: > > > > > In a C executable, TLS requires one extra machine register. > > > > You mean gcc? > > I don't understand the question. I mean in a C/C++ executable which > uses TLS. By TLS I mean __thread, not pthread_get_specific. In the > GNU/Linux world, TLS conventionally means specifically __thread. I see why you didn't understand the question .. you're so buried in gcc world C=gcc .. whereas for me C=ISO Standard, any compiler. TLS = __thread jargon noted, thanks (for me, TLS meant both generic 'storage local to a thread=the stack' or 'posix', not __thread which is a gcc extension). > It only costs a register for code which accesses a TLS variable, of > course. What do you mean 'code'? Do mean individual function? If so, how is the register loaded? > While global variables are rarely required, there are many cases where > a class is most reasonably implemented using static variables. For > example, it's very hard to implement malloc without using any static > variables. That's true, and again is an example of a badly designed function in the C standard, probably a legacy of the days when the OS supplied each individual heap block (and long before threads existed). This should really be done with two functions: one that fetches large block from the OS, and another than suballocates it. That avoids any static variables and would permit unsynchronised allocation in threaded code if the programmer designed it that way. > And for performance those static variables should be > thread local. > > It's just not plausible to say that a C/C++ program should be written > without static variables. Rubbish:) 30 years of programming, never used one in serious code unless some brain dead API forced it. > Modern programs are written by different > organizations. There is no way to pass appropriate state through all > the required interfaces. Of course there is. It's called design by contract. I do it all the time. I am appalled at code bases like GTK and interfaces like OpenMP which get such really basic things wrong. -- John Skaller Felix, successor to C++: http://felix.sf.net
Re: gomp slowness
On Fri, 2007-11-02 at 19:56 +0100, Olivier Galibert wrote: > On Sat, Nov 03, 2007 at 03:31:14AM +1100, skaller wrote: > > On Fri, 2007-11-02 at 07:39 -0700, Ian Lance Taylor wrote: > > > I think you need to look at the TLS access code before deciding that > > > it has bad performance. > > > > You already said it costs a register? That's a REALLY high cost > > to pay to support badly designed software. > > Not if you have a lot of registers (anything modern but i386) or if > the register can not really be used for anything else (%fs on i386 for > instance). This is not true. If you use a register for any purpose like this, it can't be used for anything else and that has a cost. On x86_64 which I use, every register is valuable. Don't you dare take one away, it would have a serious performance impact AND it would stop ME using that register for something which my application might consider much more important, for example as a pointer to the minor heap in a copying collector (Ocaml does this, it is the reason it has very high performance.. I would do this in Felix too if I could figure out how to reload the variable in a callback invoked by foreign code). This applies to %fs on i386 too: if the compiler uses that register, other uses are denied, and the compiler can't tell which is more important. -- John Skaller Felix, successor to C++: http://felix.sf.net
Re: gomp slowness
On Fri, 2007-11-02 at 20:00 +0100, Olivier Galibert wrote: > On Sat, Nov 03, 2007 at 03:38:51AM +1100, skaller wrote: > > My argument is basically: there is no need for any such > > feature in a well written program. Each thread already has > > its own local stack. Global variables should not be used > > in the first place (except for signals etc where > > there is no choice). > > And for libraries where there is no choice. Happens rather often. Yes, but that is covered by 'well written program' qualification. If you have to use a badly designed library which used global variables, then I agree you're stuck. That is what I meant by saying TLS is a hack to support legacy code, and should not be required in new code. Clearly, if you mix new code with code that calls legacy libraries, the problem is not eliminated until you re-write those libraries, or rewrite you app to use a better library. I'd be rather worried if a compiler used up a register just to support legacy libraries that my code might never call. -- John Skaller Felix, successor to C++: http://felix.sf.net
Re: gomp slowness
> This is not true. If you use a register for any purpose like this, > it can't be used for anything else and that has a cost. This is a segment register. Please go and read about what segment registers. They are not real registers and cannot be used for anything except memory accesses. They date back to the 16bits processor days when you could have 24bits of memory but you needed to switch the segment register, remember far/near memory?? Now some other ABIs on other processors just use a normal register (on PPC, it is r13) but then again those processors usually have larger register set than x86 (or x86_64). Thanks, Andrew Pinski
Re: gomp slowness
On Fri, 2007-11-02 at 15:31 -0400, Robert Dewar wrote: > Olivier Galibert wrote: > There are lots of cases where global thread specific variables > are useful in practice, ask anyone who has programmed real world > large scale real time embedded programs. No. And I have done just that myself. There is a use for hackery, for example profiling, debugging, etc. Otherwise .. well the major real time code I did work on had a big effort in place to *eliminate* all singletons and static variables because they caused huge problems generalising the code. Millions of lines of badly designed C++. In fact, it is the other way around: for SMALL programs running on tiny processors, global storage has to be used sometimes because of the limitations of the processor. For example I wrote lots of assembler for the 6802 microcontroller, which doesn't allow you to *access* the subroutine stack. > One obvious example is > the stack limit for checking stack overflow on subprogram entry. Not required if the stack limit is stored in the stack itself. The fact that this is hard to arrange just shows you AGAIN there is badly designed interface somewhere along the line. Felix pthreads record the stack base and stack pointer, which is used for a conservative scan of the stack by the garbage collector.. guess what? No static variables. There's no stack limit check, but that would be easy to add to the code. (but I think the way you'd do this in Linux code would be to use mmap() and an invalid block: AFAIK that's what Ocaml does, so the young heap allocator is a SINGLE register increment .. this is rather fast .. :) -- John Skaller Felix, successor to C++: http://felix.sf.net
Re: gomp slowness
skaller <[EMAIL PROTECTED]> writes: > On Fri, 2007-11-02 at 10:29 -0700, Ian Lance Taylor wrote: > > skaller <[EMAIL PROTECTED]> writes: > > > > > On Fri, 2007-11-02 at 07:39 -0700, Ian Lance Taylor wrote: > > > > skaller <[EMAIL PROTECTED]> writes: > > > > > > > In a C executable, TLS requires one extra machine register. > > > > > > You mean gcc? > > > > I don't understand the question. I mean in a C/C++ executable which > > uses TLS. By TLS I mean __thread, not pthread_get_specific. In the > > GNU/Linux world, TLS conventionally means specifically __thread. > > I see why you didn't understand the question .. you're so buried > in gcc world C=gcc .. whereas for me C=ISO Standard, any compiler. > TLS = __thread jargon noted, thanks (for me, TLS meant both > generic 'storage local to a thread=the stack' or 'posix', > not __thread which is a gcc extension). I am familiar with other compilers. TLS (__thread) is not specific to gcc. In fact, it was invented by Sun, and first implemented in their compiler. > > It only costs a register for code which accesses a TLS variable, of > > course. > > What do you mean 'code'? Do mean individual function? If so, > how is the register loaded? It depends on the processor. On the x86 and x86_64 a segment register is used. The segment register is set by the OS when context switching to the thread. (As mentioned downthread, segment registers are special in the hardware and are not available for general compiler use.) > > It's just not plausible to say that a C/C++ program should be written > > without static variables. > > Rubbish:) 30 years of programming, never used one in serious code > unless some brain dead API forced it. Those brain dead APIs are the ones the rest of us actually work with every day. If you can disregard them, that's fine. The rest of us will continue to operate in the real world. Nobody is forcing you to use TLS for anything at all, so you can continue to ignore it. Ian
Re: gomp slowness
On Fri, 2007-11-02 at 13:47 -0600, Joel Dice wrote: > > So any (application) program needing TLS (other than the stack) > > is automatically badly designed. I've been writing code for > > three decades without using any global variables, ever since > > I learned about re-entrancy. > > While I agree that global and thread-local variables are to be avoided in > general, I wonder how you would treat the following case: > Function A needs to supply some data to function Z, but only via a call > stack containing functions B through Y. Must B through Y all take that > data as a parameter and pass it as an argument? Yes. Felix does precisely this. Note the ABI will often pass it in a register, and the variable is not used in functions B-Y .. so gcc+ABI will probably optimise away the explicit pointer passing anyhow, in effect leaving exactly the 'store it in an unused register' anyhow. Felix trusts the gcc to handle this reasonably well and it does seem to do so in most cases. > Would it not be more > efficient to store it in a register on a register-rich machine instead of > copying in on every call? No: 'As-if' rule. Copying is efficient because it can be optimised away. You write your C code with explicit passing and complain right here in this list if gcc doesn't optimise it efficiently :) > Even putting runtime efficiency aside, the code in the intermediate > functions is made arguably less readable by each bit of context it has to > forward along but never use itself. No, it's the other way around. When you use global variables to store state, you cannot easily reason about your code because the coupling is not represented in the function argument-parameter bindings explicitly. > OTOH, using thread-locals introduces > "action at a distance", which never helps readability. Yes. -- John Skaller Felix, successor to C++: http://felix.sf.net
Re: gomp slowness
On Sat, 2007-11-03 at 12:27 +1100, skaller wrote: > On Fri, 2007-11-02 at 10:29 -0700, Ian Lance Taylor wrote: > Of course there is. It's called design by contract. > I do it all the time. I am appalled at code bases like > GTK and interfaces like OpenMP which get such really > basic things wrong. Argg .. correction! I meant OpenGL, not OpenMP! WOOPS!!! Associative memory has its problems :) -- John Skaller Felix, successor to C++: http://felix.sf.net
Re: gomp slowness
On Fri, 2007-11-02 at 18:45 -0700, Andrew Pinski wrote: > > This is not true. If you use a register for any purpose like this, > > it can't be used for anything else and that has a cost. > > This is a segment register. Please go and read about what segment > registers. I know how the x86 works quite well .. perhaps unfortunately I've written several major applications in x86 assembler (including a complete text editor). -- John Skaller Felix, successor to C++: http://felix.sf.net
Re: gomp slowness
skaller wrote: On Fri, 2007-11-02 at 18:45 -0700, Andrew Pinski wrote: This is not true. If you use a register for any purpose like this, it can't be used for anything else and that has a cost. This is a segment register. Please go and read about what segment registers. I know how the x86 works quite well .. perhaps unfortunately I've written several major applications in x86 assembler (including a complete text editor). in which case you should understand that the use of FS is free
Re: gomp slowness
skaller wrote: This is not true. If you use a register for any purpose like this, it can't be used for anything else and that has a cost. On x86_64 which I use, every register is valuable. Don't you dare take one away, it would have a serious performance impact AND it would stop ME using that register for something which my application might consider much more important, for example as a pointer to the minor heap in a copying collector (Ocaml does this, it is the reason it has very high performance.. I would do this in Felix too if I could figure out how to reload the variable in a callback invoked by foreign code). This applies to %fs on i386 too: if the compiler uses that register, other uses are denied, and the compiler can't tell which is more important. You really can't be serious in your comment about fs, if you understand the architecture ...
Re: gomp slowness
On Fri, 2007-11-02 at 23:54 -0400, Robert Dewar wrote: > skaller wrote: > > On Fri, 2007-11-02 at 18:45 -0700, Andrew Pinski wrote: > >>> This is not true. If you use a register for any purpose like this, > >>> it can't be used for anything else and that has a cost. > >> This is a segment register. Please go and read about what segment > >> registers. > > > > I know how the x86 works quite well .. perhaps unfortunately I've > > written several major applications in x86 assembler (including > > a complete text editor). > > in which case you should understand that the use of FS is free > > No. Free for whom? Sure, if the ABI doesn't specify it and gcc doesn't otherwise use it then gcc using it doesn't impact an ISO C program. but it *could* impact a program that had already chosen to conditionally use that register if it happened to be possible with the C compiler, as it is with gcc, or if the compiler didn't use it (promise!) and one was prepared to add some assembler. I'm sure you know this kind of thing is quite common in language translators and interpreters, especially with respect to memory allocation (possibly garbage collection) .. in fact the same kind of use as suggested for __thread. Neko, for example, uses a register. AFAIK MLton does the same kind of thing. If gcc team thinks ANY register is free to steal they'd be wrong -- that doesn't mean it shouldn't be used, just that it definitely is NOT free. I can tell you I definitely considered using FS for the Felix thread frame pointer to save passing that pointer between every function.. -- John Skaller Felix, successor to C++: http://felix.sf.net
Re: gomp slowness
skaller <[EMAIL PROTECTED]> writes: > Neko, for example, uses a register. AFAIK MLton does the > same kind of thing. If gcc team thinks ANY register is free > to steal they'd be wrong -- that doesn't mean it shouldn't > be used, just that it definitely is NOT free. To be clear, it is not the gcc team which is stealing the register. As I said earlier, TLS (i.e., __thread) was defined by Sun. They defined the implementation for i386 and SPARC. Other organizations have carried it forward to other processors. Here is the Sun documentation: http://docs.sun.com/app/docs/doc/817-1984/6mhm7pl2a TLS is implemented via a combination of the compiler, the system library, and the kernel. As I said before, the register is only stolen for code which actually uses TLS. Ian
Re: gomp slowness
On Fri, 2007-11-02 at 23:56 -0400, Robert Dewar wrote: > skaller wrote: > You really can't be serious in your comment about fs, if you > understand the architecture ... You're just not thinking the same way I am. A CPU has state, the compiler and application program manage that state. If the compiler can use a register, the application can too. They're therefore competing for the use of the register. If gcc wants to use it for TLS but I want to use it for a different purpose, there's a conflict. If the OS uses the register and denies it to the programmer, or specifies a particular use, then BOTH the compiler and application have to respect that. Bottom line: gcc isn't the only code generator. I can write assembler by hand. I can also access machine registers directly with gcc (nice feature!) GHC Haskell takes the assembler output gcc produces and runs a Perl script over it to fix it up. It's called 'registering' the code. GHC uses continuation passing and gcc can't represent the model very well .. that's GHC's solution: to trick gcc. you can bet they're interested in what optimisations gcc does because they have to find the 'fixup' points. Mercury, Felix and Mlton use 'assembler labels' to work around gcc's own inability to compile large functions, coupled with C's lack of decent control structures (which can be implemented in flat code that gcc can't compile). In Mlton, it's known gcc can break it by duplicating the code (and thus the assembler labels). This has never happened with Felix, but in theory it might. All these applications are using gcc with C plus tricks to establish their own environment. So using a register isn't free. -- John Skaller Felix, successor to C++: http://felix.sf.net
Re: gomp slowness
On Fri, 2007-11-02 at 22:35 -0700, Ian Lance Taylor wrote: > skaller <[EMAIL PROTECTED]> writes: > > > Neko, for example, uses a register. AFAIK MLton does the > > same kind of thing. If gcc team thinks ANY register is free > > to steal they'd be wrong -- that doesn't mean it shouldn't > > be used, just that it definitely is NOT free. > > To be clear, it is not the gcc team which is stealing the register. > As I said earlier, TLS (i.e., __thread) was defined by Sun. They > defined the implementation for i386 and SPARC. Other organizations > have carried it forward to other processors. Here is the Sun > documentation: > http://docs.sun.com/app/docs/doc/817-1984/6mhm7pl2a > TLS is implemented via a combination of the compiler, the system > library, and the kernel. Thanks! So the use is defined by a protocol which gcc is following. > As I said before, the register is only stolen for code which actually > uses TLS. So scanning that document, for x86_64, fs is used in startup code, presumably if, and only if, there is a linker section containing __thread variables? -- John Skaller Felix, successor to C++: http://felix.sf.net
Re: gomp slowness
skaller <[EMAIL PROTECTED]> writes: > > As I said before, the register is only stolen for code which actually > > uses TLS. > > So scanning that document, for x86_64, fs is used in startup > code, presumably if, and only if, there is a linker section > containing __thread variables? Yes. Ian