Re: LTO and version scripts
On Mon, Jun 30, 2014 at 2:35 PM, Ulrich Drepper wrote: > Using LTO to create a DSO works fine (i.e., it performs the expected > optimizations) for symbols which are marked with visibility > attributes. It does not work, though, when the symbol is not > restricted in its visibility in the source file but instead is > prevented from being exported from the DSO by a version script (ld > --version-script=FILE). > > Is this known? I only found general problems related to linker > scripts although version script parameters do not cause any other > failures. Yes, I've run into this as well. IMHO the issue is that the linker(s) do not process the linker script "properly" when handing off the resolution data to the linker plugin. So it's a linker bug AFAIU. Richard.
Re: reverse bitfield patch
On Wed, Jul 2, 2014 at 7:10 AM, DJ Delorie wrote: > > Revisiting an old thread, as I still want to get this feature in... > > https://gcc.gnu.org/ml/gcc/2012-10/msg00099.html > >> >> Why do you need to change varasm.c at all? The hunks seem to be >> >> completely separate of the attribute. >> > >> > Because static constructors have fields in the original order, not the >> > reversed order. Otherwise code like this is miscompiled: >> >> Err - the struct also has fields in the original order - only the bit >> positions >> of the fields are different because of the layouting option. > > The order of the field decls in the type (stor-layout.c) is not > changed, only the bit position information. The order here *can't* be > changed, because the C language assumes that parameters, initializers, > etc are presented in the same order as the original declaration, > regardless of the target-specific layout. > > When the program includes an initializer: > >> > struct foo a = { 1, 2, 3 }; > > The order of 1, 2, and 3 need to correspond to the order of the > bitfields in 'a', so we can change neither the order of the bitfields > in 'a' nor the order of constructor fields. > > However, when we stream the initializer out to the .S file, we need to > pack the bitfields in the right sequence to generate the right bit > patterns in the final output image. The code in varasm.c exists to > make sure that the initializers for bitfields are written/packed in > the correct order, to correspond to the bitfield positions. I.e. the > 1,2,3 initializer needs to be written to the .S file as either 0x0123 > or 0x3210 depending on the bit positions. Ok, but as we are dealing exclusively with bitfields there is already output_constructor_bitfield which uses an intermediate state to "pack" bits into units that are then emitted. It shouldn't be hard to change that to make it pack into the appropriate bits instead. > In neither case do we change the order of the fields in the type > itself, i.e. the array/chain order. > >> And you expect no other code looks at fields of a structure and its >> initializer? It's bad to keep this not in-sync. Thus I don't think it's >> viable to re-order fields just because bit allocation is reversed. > > The fields are in sync. The varasm.c change sorts the elements as > they're being output into the byte stream in the .S, it doesn't sort > the field definitions themselves. > >> > + /* If the bitfield-order attribute has been used on this >> > +structure, the fields might not be in bit-order. In that >> > +case, we need a separate representative for each >> > +field. */ >> > The typical use-case for this feature is memory-mapped hardware, where >> > pessimum access is preferred anyway. >> >> I doubt that, looking at constraints for strict volatile bitfields. > > The code that handles representatives requires (via an assert, IIRC) > that the bit offsets within a representative be in ascending order. Well, because that's supposed to be an invariant in all record field-decls ... (which you break - for example fold_ctor_reference and friends might be unhappy about this as well). Note that code expects that representatives are byte-aligned so better would be to not assign representatives or make the code work with the swapped layout (I see no reason why that shouldn't work - maybe it works doing before swapping the layout)? > I.e. gcc ICEs if I don't bypass this. In the case of volatile > bitfields, which would be the typical use case for a reversed > bitfield, the access mode is going to match the type size regardless, > so performance is not changed by this patch. representatives are not about performance but about correctness. I'm still not happy about the idea in general (why is this a bitfield exclusive thing? If a piece of HW is big/little-endian then even regular fields would have that property. Your patch comes with no testcase - testcases should cover all attribute variants, multiple bitfield (group) sizes and mixed initializations / reads / writes and be best execute testcases. Richard.
Enable EBX for x86 in 32bits PIC code
Hi All, Currently GCC permanently reserves EBX as the GOT register. (config/i386/i386.c:4289) /* The PIC register, if it exists, is fixed. */ j = PIC_OFFSET_TABLE_REGNUM; if (j != INVALID_REGNUM) fixed_regs[j] = call_used_regs[j] = 1; This leads to significant performance losses in PIC mode: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54232 According to my measurements ~3% generally and up to 20% in inner loops. CLANG uses all registers for allocation and therefore now has competitive advantage in 32bits PIC mode comparing to GCC. This mode is used in all Android applications and therefore is important for many compiler customers. There are at least 2 possible solutions. 1. While call expand emit SET_GOT -> EBX and MOV EBX -> some local register: LGOT Prior to each call emit MOV LGOT -> EBX Use LGOT as new GOT register for globals. 2. Set EBX as each CALL parameter. Emit MOV EBX->LGOT in each call. Use LGOT as new GOT register for globals. Do you have any comments, ideas? Thanks, Evgeny
[GSoC] generation from isl_ast_node_user
Hi Tobias, I think that, according to the std::map feedback, we could use std::map now and replace it with hash_map later, if its performance is better. However, I propose to temporary postpone this and work on gimple code generation from isl_ast_node_user, because we already have generation of loops with empty bodies and generation from isl_ast_node_user can be a problem. What do you think about this? Could you please advise me an algorithm for computation of substitutions? (ClooG uses its own algorithm for this and stores substitutions in clast_user_stmt. There is an algorithm, which is used in polly, but, honestly, I don't understand it.) Could you please advise me how is it better to bind polly basic blocks to a isl_ast_node_user? I'm using the following code now, but I'm not sure if it is the right way: bb_schedule = isl_map_intersect_domain (bb_schedule, isl_set_copy (pbb->domain)); isl_id *dim_in_id = isl_map_get_tuple_id (bb_schedule, isl_dim_in); isl_id *new_dim_in_id = isl_id_alloc (isl_id_get_ctx (dim_in_id), isl_id_get_name (dim_in_id), pbb); bb_schedule = isl_map_set_tuple_id (bb_schedule, isl_dim_in, new_dim_in_id); (I'm allocating an isl_id, which contains pointer to polly basic blocks, while we're generating a isl_schedule.) gcc_assert (isl_ast_node_get_type (node) == isl_ast_node_user); isl_ast_expr *user_expr = isl_ast_node_user_get_expr (node); isl_ast_expr *name_expr = isl_ast_expr_get_op_arg (user_expr, 0); gcc_assert (isl_ast_expr_get_type (name_expr) == isl_ast_expr_id); isl_id *name_id = isl_ast_expr_get_id (name_expr); poly_bb_p pbb = (poly_bb_p) isl_id_get_user (name_id); (I'm getting this information, while we're handling isl_ast_node_user) -- Cheers, Roman Gareev
Re: Enable EBX for x86 in 32bits PIC code
On Mon, Jul 7, 2014 at 12:00 PM, Evgeny Stupachenko wrote: > Hi All, > > Currently GCC permanently reserves EBX as the GOT register. > > (config/i386/i386.c:4289) > > /* The PIC register, if it exists, is fixed. */ > j = PIC_OFFSET_TABLE_REGNUM; > if (j != INVALID_REGNUM) > fixed_regs[j] = call_used_regs[j] = 1; > > This leads to significant performance losses in PIC mode: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54232 > According to my measurements ~3% generally and up to 20% in inner loops. > > CLANG uses all registers for allocation and therefore now has > competitive advantage in 32bits PIC mode comparing to GCC. > This mode is used in all Android applications and therefore is > important for many compiler customers. > > There are at least 2 possible solutions. > > 1. > > While call expand emit SET_GOT -> EBX and MOV EBX -> some local register: LGOT > Prior to each call emit MOV LGOT -> EBX > Use LGOT as new GOT register for globals. > > 2. > > Set EBX as each CALL parameter. > Emit MOV EBX->LGOT in each call. > Use LGOT as new GOT register for globals. > > Do you have any comments, ideas? Use some LCM algorithm for placing %ebx loads, similar to how we treat vzeroupper? Compute some simple IPA info on whether %ebx is provided/needed by callers/callees? Richard. > Thanks, > Evgeny
Re: [GSoC] Question about std::map
On 05/07/2014 00:03, Trevor Saunders wrote: On Fri, Jul 04, 2014 at 09:57:11AM +0200, Tobias Grosser wrote: On 04/07/2014 04:16, Trevor Saunders wrote: On Thu, Jul 03, 2014 at 07:52:59PM +0200, Tobias Grosser wrote: On 03/07/2014 19:23, Roman Gareev wrote: Dear gcc contributors, could you please answer a few questions about std::map? Does gcc have a policy that forbids using of map in the source code of gcc? Can this using create a new installation dependency, which requires libstdc++? I would be very grateful for your comments. https://gcc.gnu.org/codingconventions.html#Standard_Library This suggests that using std::map is allowed. Running "grep 'std::' *" on gcc, we only find a couple of std::make_pair, std::min and std::max calls, but I don't see why we should not use std::map. I would say go for it if there are no vetos. It seems to be the right tool for what you are aiming for and certainly makes the code a lot more readable. I'm certainly not opposed to using the stl when it makes sense, and reinventing our own stl has a fairly high cost. However I think the question of if you should use std::map is a complicated one. First remember std::map is a tree, not a hash table, and consider which performance characteristic you want. Also remember the stl in some cases trades off possible optimizations to be more generic, if you implement your own version of something you can say ban arrays larger than 2^32 and thereby save a bit of space. When you use the stl you also need to make sure you don't call things that can throw exceptions since gcc is built with -fno-exceptions and generally isn't exception safe. I think there are some stl things you should basically always prefer e.g. std::swap, there are some things you should think about before using, and some things that are best avoided. Personally I think you should favor hash tables to std::map most of the time because of there generally better performance. btw if you have specific gripes about the stlish bits gcc already has I'd like to hear them. Hi Trev, thanks for the feedback. You seem to have a good idea for which use cases a std::map is possible. Maybe I can invite you to a patch review currently happening on gcc@gcc.gnu.org (Roman, we should do this kind of stuff on gcc-patches@ I suppose) under the title "[GSoC] generation of GCC expression trees from isl ast expressions". Feel free to ignore most of the code. The question I am unsure about is the use of ast_isl_index_hasher() and related function, which takes 105 lines of code and alomost the same functionality could be implemented by a one-line use of std::map<>. The maps generated are have you tried the hash_map class I introduced a couple days ago? (maybe we should teach its generic machinary about strings, but it should already be better than hash_table for your purpose). No. We did not yet. Thanks for pointing it out. > One thing that confuses me is that you seem to be mapping from ast node to ast node, but you do the actual mapping with strings, why is that needed? Actually it is a mapping from isl_id to tree, but there seems to be some additional information mixed in. All this is hidden in the current hash map code and makes this rather confusing. I think the best is to use the std::map to make the code work with the simplest interface possible. After everything in place we can then test if the move to hash_map gives any performance benefits. commonly small and I doubt this is in the performance critical path. I feel very worried about adding so much bloat? What would you suggest. yeah, that code kind of made my eyes glaze over on the other hand your talking about going from average constant time to average log time, which is the sort of thing I'm pretty hesitent to say is a great idea. The number of elements in these maps is most likely between 3-10. Its too bad unordered_map isn't an option. Right. Thanks again for your help! Cheers, Tobias
Re: [GSoC] Question about std::map
On 7 July 2014 12:08, Tobias Grosser wrote: > > The number of elements in these maps is most likely between 3-10. Then std::map is the wrong solution. The overhead of dereferencing all the pointers while walking through a std::map will be higher than the savings you get from logarithmic lookup. For ten elements a linear search of a std::vector will probably be quicker than lookup in a std::map. A binary search of a sorted vector (which needs no pointer-chasing because it uses random-access iterators) will definitely be faster.
Re: [GSoC] generation from isl_ast_node_user
On 07/07/2014 12:33, Roman Gareev wrote: Hi Tobias, I think that, according to the std::map feedback, we could use std::map now and replace it with hash_map later, if its performance is better. However, I propose to temporary postpone this and work on gimple code generation from isl_ast_node_user, because we already have generation of loops with empty bodies and generation from isl_ast_node_user can be a problem. What do you think about this? Could you please advise me an algorithm for computation of substitutions? (ClooG uses its own algorithm for this and stores substitutions in clast_user_stmt. There is an algorithm, which is used in polly, but, honestly, I don't understand it.) You may want to take a look at polly commit r212186, where I reworked and documented how this works. Could you please advise me how is it better to bind polly basic blocks to a isl_ast_node_user? I'm using the following code now, but I'm not sure if it is the right way: bb_schedule = isl_map_intersect_domain (bb_schedule, isl_set_copy (pbb->domain)); isl_id *dim_in_id = isl_map_get_tuple_id (bb_schedule, isl_dim_in); isl_id *new_dim_in_id = isl_id_alloc (isl_id_get_ctx (dim_in_id), isl_id_get_name (dim_in_id), pbb); bb_schedule = isl_map_set_tuple_id (bb_schedule, isl_dim_in, new_dim_in_id); (I'm allocating an isl_id, which contains pointer to polly basic blocks, while we're generating a isl_schedule.) Is this necessary? The id should already be set in (graphite-sese-to-poly.c): static isl_id * isl_id_for_pbb (scop_p s, poly_bb_p pbb) { char name[50]; snprintf (name, sizeof (name), "S_%d", pbb_index (pbb)); return isl_id_alloc (s->ctx, name, pbb); } gcc_assert (isl_ast_node_get_type (node) == isl_ast_node_user); isl_ast_expr *user_expr = isl_ast_node_user_get_expr (node); isl_ast_expr *name_expr = isl_ast_expr_get_op_arg (user_expr, 0); gcc_assert (isl_ast_expr_get_type (name_expr) == isl_ast_expr_id); isl_id *name_id = isl_ast_expr_get_id (name_expr); poly_bb_p pbb = (poly_bb_p) isl_id_get_user (name_id); (I'm getting this information, while we're handling isl_ast_node_user) Perfect! (or at least that's the same approach I have choosen for Polly) Do you have any problems with this approach? From my perspective it looks like a good solution. Tobias
Re: [GSoC] generation from isl_ast_node_user
[Forgot to answer two questions] On 07/07/2014 12:33, Roman Gareev wrote: Hi Tobias, I think that, according to the std::map feedback, we could use std::map now and replace it with hash_map later, if its performance is better. Right. > However, I propose to temporary postpone this and work on gimple code generation from isl_ast_node_user, because we already have generation of loops with empty bodies and generation from isl_ast_node_user can be a problem. What do you think about this? As I am sometimes slow in reviewing, maybe you can do this if you find free time. I would prefer to move soon to std::map, as this is the last open piece in your loop generation patch and we can finish the review after this is done. Tobias
Re: [GSoC] Question about std::map
On 07/07/2014 13:14, Jonathan Wakely wrote: On 7 July 2014 12:08, Tobias Grosser wrote: The number of elements in these maps is most likely between 3-10. Then std::map is the wrong solution. The overhead of dereferencing all the pointers while walking through a std::map will be higher than the savings you get from logarithmic lookup. For ten elements a linear search of a std::vector will probably be quicker than lookup in a std::map. A binary search of a sorted vector (which needs no pointer-chasing because it uses random-access iterators) will definitely be faster. Very good point. On the other side, we still want to hide this behind a map-like interface. So starting with a std::map may be a good thing. To tune this later we can introduce a specialized vector_map. Such a vector_map may not even want to use a std::vector, but a vector class that stores its data in stack memory. Cheers, Tobias
Re: Enable EBX for x86 in 32bits PIC code
The key problem here is that EBX is not used in register allocation. If we relax the restriction on EBX the performance is back, but there are several fails. Some of them could be fixed. However I don't like that way as EBX register is uninitialized at register allocation. Initialization (SET_GOT) appeared only at: "217r.pro_and_epilogue" phase. The key point in 2 suggestions is to set EBX register only prior to a call (as it is required by ABI). In all other cases it could be any other register. Evgeny On Mon, Jul 7, 2014 at 2:42 PM, Richard Biener wrote: > On Mon, Jul 7, 2014 at 12:00 PM, Evgeny Stupachenko > wrote: >> Hi All, >> >> Currently GCC permanently reserves EBX as the GOT register. >> >> (config/i386/i386.c:4289) >> >> /* The PIC register, if it exists, is fixed. */ >> j = PIC_OFFSET_TABLE_REGNUM; >> if (j != INVALID_REGNUM) >> fixed_regs[j] = call_used_regs[j] = 1; >> >> This leads to significant performance losses in PIC mode: >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54232 >> According to my measurements ~3% generally and up to 20% in inner loops. >> >> CLANG uses all registers for allocation and therefore now has >> competitive advantage in 32bits PIC mode comparing to GCC. >> This mode is used in all Android applications and therefore is >> important for many compiler customers. >> >> There are at least 2 possible solutions. >> >> 1. >> >> While call expand emit SET_GOT -> EBX and MOV EBX -> some local register: >> LGOT >> Prior to each call emit MOV LGOT -> EBX >> Use LGOT as new GOT register for globals. >> >> 2. >> >> Set EBX as each CALL parameter. >> Emit MOV EBX->LGOT in each call. >> Use LGOT as new GOT register for globals. >> >> Do you have any comments, ideas? > > Use some LCM algorithm for placing %ebx loads, similar to > how we treat vzeroupper? > > Compute some simple IPA info on whether %ebx is provided/needed > by callers/callees? > > Richard. > >> Thanks, >> Evgeny
Re: Enable EBX for x86 in 32bits PIC code
On Mon, Jul 07, 2014 at 03:35:06PM +0400, Evgeny Stupachenko wrote: > The key problem here is that EBX is not used in register allocation. > If we relax the restriction on EBX the performance is back, but there > are several fails. > Some of them could be fixed. > However I don't like that way as EBX register is uninitialized at > register allocation. That is nothing wrong. The magic registers are to be assumed live from the beginning until the prologue is emitted. > Initialization (SET_GOT) appeared only at: "217r.pro_and_epilogue" phase. > > The key point in 2 suggestions is to set EBX register only prior to a > call (as it is required by ABI). In all other cases it could be any > other register. You could use special call insn patterns for calls that need to have ebx set, where there would be a (use (match_operand:SI NN "register_operand" "b")) and pass in the lgot pseudo and leave the register allocator to do its job. You'd need to remember in which hard register (or memory) the register allocator wants lgot to be at the start of the first basic block (so that when prologue is expanded you know where to store it). Jakub
Re: Enable EBX for x86 in 32bits PIC code
On 07/07/14 04:42, Richard Biener wrote: 1. While call expand emit SET_GOT -> EBX and MOV EBX -> some local register: LGOT Prior to each call emit MOV LGOT -> EBX Use LGOT as new GOT register for globals. 2. Set EBX as each CALL parameter. Emit MOV EBX->LGOT in each call. Use LGOT as new GOT register for globals. Do you have any comments, ideas? IIRC we did this for a while on the PA. Basically the call expanders would set the hard PIC register from a pseudo holding the right value. We let the register allocator (of course) do the job of allocating the pseudo. In rare circumstances the allocator could allocate the pseudo to the hard PIC register, that was very rare because the hard PIC register was call-clobbered. Use some LCM algorithm for placing %ebx loads, similar to how we treat vzeroupper? Shouldn't be too hard to do this. I suspect most of the benefit is from being able to use %ebx when it's not being used for the PIC register. But it probably wouldn't hurt to optimize placements. Compute some simple IPA info on whether %ebx is provided/needed by callers/callees? Yea. Knowing if the caller/callee have the same value may also be helpful. jeff
Re: Enable EBX for x86 in 32bits PIC code
On Mon, Jul 7, 2014 at 1:47 PM, Jakub Jelinek wrote: > On Mon, Jul 07, 2014 at 03:35:06PM +0400, Evgeny Stupachenko wrote: >> The key problem here is that EBX is not used in register allocation. >> If we relax the restriction on EBX the performance is back, but there >> are several fails. >> Some of them could be fixed. >> However I don't like that way as EBX register is uninitialized at >> register allocation. > > That is nothing wrong. The magic registers are to be assumed live from the > beginning until the prologue is emitted. > >> Initialization (SET_GOT) appeared only at: "217r.pro_and_epilogue" phase. >> >> The key point in 2 suggestions is to set EBX register only prior to a >> call (as it is required by ABI). In all other cases it could be any >> other register. > > You could use special call insn patterns for calls that need to have ebx > set, where there would be a > (use (match_operand:SI NN "register_operand" "b")) > and pass in the lgot pseudo and leave the register allocator to do its job. > You'd need to remember in which hard register (or memory) the register > allocator wants lgot to be at the start of the first basic block (so that > when prologue is expanded you know where to store it). You can probably use get_hard_reg_initial_val for this. Uros.
Re: Enable EBX for x86 in 32bits PIC code
2014-07-07 15:47 GMT+04:00 Jakub Jelinek : > On Mon, Jul 07, 2014 at 03:35:06PM +0400, Evgeny Stupachenko wrote: >> The key problem here is that EBX is not used in register allocation. >> If we relax the restriction on EBX the performance is back, but there >> are several fails. >> Some of them could be fixed. >> However I don't like that way as EBX register is uninitialized at >> register allocation. > > That is nothing wrong. The magic registers are to be assumed live from the > beginning until the prologue is emitted. EBX does not need to be so magic. It is used to pass GOT pointer according to ABI, so why not to make it a part of ABI? We may use a target hook to identify if function has implicit input parameter with GOT. Then we handle this parameter in a regular way and spill it to a virtual register with the only difference - resulting RTL is written into PIC_OFFSET_TABLE_REGNUM. In a similar way we may have implicit argument for calls and fill a hard reg according to ABI from PIC_OFFSET_TABLE_REGNUM. EBX would become a regular register then. If overall it looks reasonable then I may try to make an experimental patch and check how it affects performance. Ilya > >> Initialization (SET_GOT) appeared only at: "217r.pro_and_epilogue" phase. >> >> The key point in 2 suggestions is to set EBX register only prior to a >> call (as it is required by ABI). In all other cases it could be any >> other register. > > You could use special call insn patterns for calls that need to have ebx > set, where there would be a > (use (match_operand:SI NN "register_operand" "b")) > and pass in the lgot pseudo and leave the register allocator to do its job. > You'd need to remember in which hard register (or memory) the register > allocator wants lgot to be at the start of the first basic block (so that > when prologue is expanded you know where to store it). > > Jakub
Re: reverse bitfield patch
> Ok, but as we are dealing exclusively with bitfields there is > already output_constructor_bitfield which uses an intermediate > state to "pack" bits into units that are then emitted. It shouldn't > be hard to change that to make it pack into the appropriate bits > instead. That assumes that the output unit is only emitted once per string of bitfields. If the total amount of data to output is larger than the unit size, then the units themselves need to be output in the other order also. > Note that code expects that representatives are byte-aligned so better > would be to not assign representatives or make the code work with > the swapped layout (I see no reason why that shouldn't work - maybe > it works doing before swapping the layout)? I'm OK with not assigning them, but I couldn't figure out from the code what they were for. > I'm still not happy about the idea in general (why is this a bitfield > exclusive thing? If a piece of HW is big/little-endian then even > regular fields would have that property. A bi-endian MCU with memory-mapped peripherals needs this to properly and portably describe the fields within the peripheral's registers. Without this patch, there's no way (short of two independent definitions) of assigning a name to, for example, the LSB of such a device's registers. > Your patch comes with no testcase - testcases should cover all > attribute variants, multiple bitfield (group) sizes and mixed > initializations / reads / writes and be best execute testcases. I wrote testcases, perhaps I just forgot to attach them.
[GSoC] Status - 20140707
Hi Community, All GCC GSoC students have successfully passed mid-term evaluations, and are continuing to work on their projects. Congratulations to all the students! Furthermore, Linaro has generously provided sponsorship to pay for 1 GCC GSoC student to travel to GNU Tools Cauldron this year. By the results of mid-term evaluations and mentor comments -- Prathamesh Kulkarni was selected. As always, thank you to Google for hosting the Cauldron and to Diego for procuring an extra registration spot. Our plan is to continue bringing top 1-3 GSoC students to GCC conferences each year. Hopefully, we will get more sponsorship slots from companies doing GCC development next year. We also plan to earmark funds that GCC project will receive for mentoring the students ($500 per student) towards sponsoring one of the next year's students. Thank you, and will see at the Cauldron! -- Maxim Kuvyrkov www.linaro.org