Frame pointer optimization issues
Hi, Various targets implement -momit-leaf-frame-pointer to avoid using a frame pointer in leaf functions. Currently the GCC mid-end does not provide a way of doing this, so targets have resorted to hacks. Typically this involves forcing flag_omit_frame_pointer to be true in the _option_override callback. The issue is that this doesn't work as it modifies the actual option variable. As a result the callback is not idempotent, so option save/restore when using function attributes fail as the callback is called multiple times on the modified options. Note this bug exists on all targets which override options in _option_override (and despite claims to the contrary in BZ 60580 this bug exists on all targets that implement -fomit-leaf-frame-pointer). One could hack this a bit further and set flag_omit_frame_pointer = 2 to differentiate between a user setting and the override hack, but that's just making things even worse. So I see 3 possible solutions: 1. Add a copy of flag_omit_frame_pointer, and only modify that in the override. This is the generic correct solution that allows any kind of modifications on the copies. This could be done by making all flags separate variables and automating the copy in the options parsing code. Any code that writes the x_flag_ variables should eventually be fixed to stop doing this to avoid these bugs (i386 does this 22 times and c6x 2x). 2. Change the mid-end to call _frame_pointer_required even when !flag_omit_frame_pointer. This is a generic solution which allows targets to decide when exactly to optimize frame pointers. However it does mean all implementations of _frame_pointer_required must be updated (the trivial safe fix is to add "if (!flag_omit_frame_pointer) return true;" at the start). 3. Add a new target callback to avoid having to update all targets. This replaces the existing _frame_pointer_required if implemented and avoids having to update all targets in one go. A second issue with frame pointers is that update_eliminables() in reload1.c might set frame_pointer_needed to false without any checks. This can't be used to implement -momit-leaf-frame-pointer as it doesn't always happen (ie. when _can_eliminate always returns true). However assuming it does trigger in some circumstances, the bug is that it does not check that the frame pointer really isn't required. Even if the frame pointer is completely unused and thus eliminable from the function, the frame pointer setup might still be required by external agents for debugging, unwinding and/or profiling. I believe a more elaborate check is needed, at a minimum a call to _frame_pointer_required. What do people think? My preference is for option 1 as it fixes all current and future issues with option overrides, plus option 3 to make the frame pointer callback more generic. Wilco
RE: Frame pointer optimization issues
> Richard Henderson wrote: > On 08/20/2014 08:22 AM, Wilco Dijkstra wrote: > > 2. Change the mid-end to call _frame_pointer_required even when > > !flag_omit_frame_pointer. > > Um, it does that already. At least as far as I can see from > ira_setup_eliminable_regset and update_eliminables. No, in ira_setup_eliminable_regset the frame pointer is always forced if !flag_omit_frame_pointer without allowing frame_pointer_required to override it: frame_pointer_needed = (! flag_omit_frame_pointer ... || targetm.frame_pointer_required ()); This would allow targets to choose whether to do leaf tail pointer optimization: frame_pointer_needed = ((! flag_omit_frame_pointer && targetm.frame_pointer_required ()) > It turns out to be much easier to re-enable a frame pointer for a given > function than to disable a frame pointer. Thus I believe that you should > approach -momit_leaf_frame_pointer as setting flag_omit_frame_pointer, and > then > re-enabling it in frame_pointer_required. This requires more than one line in > common/config/arch/arch.c, but it shouldn't be much more than ten. As I explained it is not correct to force flag_omit_frame_pointer to be true. This is what is done today and it fails in various cases. So unless the way options are handled is changed, this possibility is out. > > A second issue with frame pointers is that update_eliminables() in > > reload1.c might set > > frame_pointer_needed to false without any checks. > > How? I don't see that path, since the very first thing update_eliminables > does > is call frame_pointer_required -- even before it calls can_eliminate. Update_eliminables() does indeed call frame_pointer_required at the start, however this only blocks elimination *from* HARD_FRAME_POINTER_REGNUM, while the code at the end clears frame_pointer_needed if FRAME_POINTER_REGNUM can be eliminated into any register other than HARD_FRAME_POINTER_REGNUM. The middle bit of the function is not relevant as HARD_FRAME_POINTER_REGNUM should only be eliminable into SP (but even if say it could be eliminable into another register X, it will only block eliminations of X to SP). So frame_pointer_needed can be cleared even when frame_pointer_required is true... In principle if this function worked reliably then we could implement leaf FPO using this mechanism. Unfortunately it doesn't, update_eliminables is not called in trivial leaf functions even when can_eliminate always returns true, so the frame pointer is never removed. Additionally I'd be worried about compilation performance as it would introduce extra register allocation passes for ~50% of functions. Wilco
Register allocation: caller-save vs spilling
Hi, I'm investigating various register allocation inefficiencies. The first thing that stands out is that GCC both supports caller-saves as well as spilling. Spilling seems to spill all definitions and all uses of a liverange. This means you often end up with multiple reloads close together, while it would be more efficient to do a single load and then reuse the loaded value several times. Caller-save does better in that case, but it is inefficient in that it repeatedly stores registers across every call even if unchanged. If both were fixed to minimise the number of loads/stores I can't see how one could beat the other, so you'd no longer need both. Anyway due to the current implementation there are clearly cases where caller-save is best and cases where spilling is best. However I do not see it making the correct decision despite trying to account for the costs - some code is significantly faster with -fno-caller-saves, other code wins with -fcaller-saves. As an example, I see code like this on AArch64: ldr s4, .LC20 fmuls0, s0, s4 str s4, [x29, 104] bl f ldr s4, [x29, 104] fmuls0, s0, s4 With -fno-caller-saves it spills and rematerializes the constant as you'd expect: ldr s1, .LC20 fmuls0, s0, s1 bl f ldr s5, .LC20 fmuls0, s0, s5 So given this, is the cost calculation correct and does it include rematerialization? The spill code understands how to rematerialize so it should take this into account in the costs. I did find some code in ira-costs.c in scan_one_insn() that attempts something that looks like an adjustment for rematerialization but it doesn't appear to handle all cases (simple immediates, 2-instruction immediates, address-constants, non-aliased loads such as literal pool and const data loads). Also the hook CALLER_SAVE_PROFITABLE appears to have disappeared - overall performance improves significantly if I add this (basically the default heuristic used on instruction frequencies): --- a/gcc/ira-costs.c +++ b/gcc/ira-costs.c @@ -2230,6 +2230,8 @@ ira_tune_allocno_costs (void) * ALLOCNO_FREQ (a) * IRA_HARD_REGNO_ADD_COST_MULTIPLIER (regno) / 2); #endif + if (ALLOCNO_FREQ (a) < 4 * ALLOCNO_CALL_FREQ (a)) +cost = INT_MAX; } if (INT_MAX - cost < reg_costs[j]) reg_costs[j] = INT_MAX; If such a simple heuristic can beat the costs, they can't be quite right. Is there anyone who understands the cost calculations? Wilco
RE: Register allocation: caller-save vs spilling
Hi Vlad, I added you directly in case you hadn't spotted my original post. A simple example for AArch64 trunk is as follows: // Compile with: -O2 -fomit-frame-pointer -ffixed-d8 -ffixed-d9 -ffixed-d10 -ffixed-d11 -ffixed-d12 -ffixed-d13 -ffixed-d14 -ffixed-d15 -f(no-)caller-saves void g(void); float f(float x) { x += 3.0; g(); x *= 3.0; return x; } It seems that reload only ever considers rematerialization of spilled liveranges, not caller-saved ones. That means the caller-save code should either reject constants outright or the memory spill cost for these should always be lower than that of a caller-save (given memory_move_cost=4 and register_move_cost=2 as commonly used by targets, anything that can be rematerialized should have less than half the cost of being spilled or caller-saved). Wilco > -Original Message- > From: Wilco Dijkstra [mailto:wdijk...@arm.com] > Sent: 27 August 2014 17:25 > To: 'gcc@gcc.gnu.org' > Subject: Register allocation: caller-save vs spilling > > Hi, > > I'm investigating various register allocation inefficiencies. The first thing > that stands out > is that GCC both supports caller-saves as well as spilling. Spilling seems to > spill all > definitions and all uses of a liverange. This means you often end up with > multiple reloads > close together, while it would be more efficient to do a single load and then > reuse the loaded > value several times. Caller-save does better in that case, but it is > inefficient in that it > repeatedly stores registers across every call even if unchanged. If both were > fixed to > minimise the number of loads/stores I can't see how one could beat the other, > so you'd no > longer need both. > > Anyway due to the current implementation there are clearly cases where > caller-save is best and > cases where spilling is best. However I do not see it making the correct > decision despite > trying to account for the costs - some code is significantly faster with > -fno-caller-saves, > other code wins with -fcaller-saves. As an example, I see code like this on > AArch64: > > ldr s4, .LC20 > fmuls0, s0, s4 > str s4, [x29, 104] > bl f > ldr s4, [x29, 104] > fmuls0, s0, s4 > > With -fno-caller-saves it spills and rematerializes the constant as you'd > expect: > > ldr s1, .LC20 > fmuls0, s0, s1 > bl f > ldr s5, .LC20 > fmuls0, s0, s5 > > So given this, is the cost calculation correct and does it include > rematerialization? The > spill code understands how to rematerialize so it should take this into > account in the costs. > I did find some code in ira-costs.c in scan_one_insn() that attempts > something that looks like > an adjustment for rematerialization but it doesn't appear to handle all cases > (simple > immediates, 2-instruction immediates, address-constants, non-aliased loads > such as literal > pool and const data loads). > > Also the hook CALLER_SAVE_PROFITABLE appears to have disappeared - overall > performance > improves significantly if I add this (basically the default heuristic used on > instruction > frequencies): > > --- a/gcc/ira-costs.c > +++ b/gcc/ira-costs.c > @@ -2230,6 +2230,8 @@ ira_tune_allocno_costs (void) >* ALLOCNO_FREQ (a) >* IRA_HARD_REGNO_ADD_COST_MULTIPLIER (regno) / 2); > #endif > + if (ALLOCNO_FREQ (a) < 4 * ALLOCNO_CALL_FREQ (a)) > +cost = INT_MAX; > } > if (INT_MAX - cost < reg_costs[j]) > reg_costs[j] = INT_MAX; > > If such a simple heuristic can beat the costs, they can't be quite right. Note if (ALLOCNO_FREQ (a) < 2 * ALLOCNO_CALL_FREQ (a)) turns out to be best overall. > Is there anyone who understands the cost calculations? > > Wilco
IRA preferencing issues
Hi, While investigating why the IRA preferencing algorithm often chooses incorrect preferences from the costs, I noticed this thread: https://gcc.gnu.org/ml/gcc/2011-05/msg00186.html I am seeing the exact same issue on AArch64 - during the final preference selection ira-costs takes the union of any register classes that happen to have equal cost. As a result many registers get ALL_REGS as the preferred register eventhough its cost is much higher than either GENERAL_REGS or FP_REGS. So we end up with lots of scalar SIMD instructions and expensive int<->FP moves in integer code when register pressure is high. When the preference is computed correctly as in the proposed patch (choosing the first class with lowest cost, ie. GENERAL_REGS) the resulting code is much more efficient, and there are no spurious SIMD instructions. Choosing a preferred class when it doesn't have the lowest cost is clearly incorrect. So is there a good reason why the proposed patch should not be applied? I actually wonder why we'd ever need to do a union - if there are 2 classes with equal cost, you'd use the 2nd as the alternative class. The other question I had is whether there is a good way to get improve the preference in cases like this and avoid classes with equal cost altogether. The costs are clearly not equal: scalar SIMD instructions have higher latency and require extra int<->FP moves. It is possible to mark variants in the MD patterns using '?' to discourage them but that seems like a hack, just like '*'. Is there a general way to say that GENERAL_REGS is preferred over FP_REGS for SI/DI mode? Wilco
RE: IRA preferencing issues
> Matthew Fortune wrote: > Wilco Dijkstra writes: > > While investigating why the IRA preferencing algorithm often chooses > > incorrect preferences from the costs, I noticed this thread: > > https://gcc.gnu.org/ml/gcc/2011-05/msg00186.html > > > > I am seeing the exact same issue on AArch64 - during the final > > preference selection ira-costs takes the union of any register classes > > that happen to have equal cost. As a result many registers get ALL_REGS > > as the preferred register eventhough its cost is much higher than either > > GENERAL_REGS or FP_REGS. So we end up with lots of scalar SIMD > > instructions and expensive int<->FP moves in integer code when register > > pressure is high. When the preference is computed correctly as in the > > proposed patch (choosing the first class with lowest cost, ie. > > GENERAL_REGS) the resulting code is much more efficient, and there are > > no spurious SIMD instructions. > > > > Choosing a preferred class when it doesn't have the lowest cost is > > clearly incorrect. So is there a good reason why the proposed patch > > should not be applied? I actually wonder why we'd ever need to do a > > union - if there are 2 classes with equal cost, you'd use the 2nd as the > > alternative class. > > > > The other question I had is whether there is a good way to get improve > > the preference in cases like this and avoid classes with equal cost > > altogether. The costs are clearly not equal: scalar SIMD instructions > > have higher latency and require extra int<->FP moves. It is possible to > > mark variants in the MD patterns using '?' to discourage them but that > > seems like a hack, just like '*'. Is there a general way to say that > > GENERAL_REGS is preferred over FP_REGS for SI/DI mode? > > MIPS has the same problem here and we have been looking at ways to address > it purely via costings rather than changing IRA. What we have done so > far is to make the cost of a move from GENERAL_REGS to FP_REGS more > expensive than memory if the move has an integer mode. The goal for MIPS > is to never allocate an FP register to an integer mode unless it was > absolutely necessary owing to an integer to fp conversion where the > integer has to be put in an FP register. Ideally I'd like a guarantee > that FP registers will never be used unless a floating point type is > present in the source but I haven't found a way to do that given the > FP-int conversion issue requiring SImode to be allowed in FP regs. I adjusted the costs like that already on AArch64 but while this reduced the crazy spilling of integer values to FP registers and visa versa, it doesn't fix it completely. However it should not be necessary to lie about the move cost and use an unrealistically high value to get decent code... There are other issues in ira-costs that cause preferences to be incorrect: you'll find that after you increase the move costs that explicit int<->fp moves start to go via memory due to memory cost being hardcoded as 1 if an instruction pattern contains 'm' somewhere - oops... I also posted a patch that fixes the preference for new registers created by live-range splitting. > The patch for MIPS is not submitted yet but has eliminated the final > two uses of FP registers when building the whole Linux kernel with > hard-float enabled. I am however still not confident enough to say > you can build integer only code with hard-float and never touch an FP > register. Correct, as long as the preference calculations are not correct and there is no good way to influence the costs reliably, GCC will continue to use FP registers in cases when it shouldn't. It's obvious that integer operations should prefer integer registers and FP operations FP registers, so why is there no easy way to tell GCC?!? > Since there are multiple architectures suffering from this I guess we > should look at properly addressing it in generic code. Agreed. Wilco
RE: IRA preferencing issues
Interestingly even when the preferences are accurate, lra_constraints completely ignores the preferred/allocno class. If the cost of 2 alternatives is equal in every way (which will be the case if they are both legal matches as the standard cost functions are not used at all), the wrong one may be chosen depending on the order in the MD file. This is particularly bad when the spill optimization pass later removes some of the spills and correctly allocates them to their preferred allocno class... Forcing win = true if the register class of the alternative intersects with the preferred class generates significantly better spill code for cases where the preference is accurate (ie. not just ALL_REGS), resulting in far less confusion between integer and FP registers. So shouldn't get_reg_class return the preference/allocno class like below rather than NO_REGS? diff --git a/gcc/lra-constraints.c b/gcc/lra-constraints.c index 0ddd842..f38914a 100644 --- a/gcc/lra-constraints.c +++ b/gcc/lra-constraints.c @@ -263,7 +263,10 @@ get_reg_class (int regno) } if (regno >= new_regno_start) return lra_get_allocno_class (regno); - return NO_REGS; + return reg_preferred_class (regno); } Wilco
RFC: Creating a more efficient sincos interface
Hi, The existing sincos functions use 2 pointers to return the sine and cosine result. In most cases 4 memory accesses are necessary per call. This is inefficient and often significantly slower than returning values in registers. I ran a few experiments on the new optimized sincosf implementation in GLIBC using the following interface: __complex__ float sincosf2 (float); This has 50% higher throughput and a 25% reduction in latency on Cortex-A72 for random inputs in the range +-PI/4. Larger inputs take longer and thus have lower gains, but there is still a 5% gain on the (rarely used) path with full range reduction. Given sincos is used in various HPC applications this can give a worthwile speedup. LLVM already supports something similar for OSX using a struct of 2 floats. Using complex float is better since not all targets may support returning structures in floating point registers and GCC generates very inefficient code on targets that do (PR86145). What do people think? Ideally I'd like to support this in a generic way so all targets can benefit, but it's also feasible to enable it on a per-target basis. Also since not all libraries will support the new interface, there would have to be a flag or configure option to switch the new interface off if not supported (maybe automatically based on the math.h header). Wilco
Re: multiple definition of symbols" when linking executables on ARM32 and AArch64
On 06.01.20 11:03, Andrew Pinski wrote: > +GCC > > On Mon, Jan 6, 2020 at 1:52 AM Matthias Klose wrote: >> >> In an archive test rebuild with binutils and GCC trunk, I see a lot of build >> failures on both aarch64-linux-gnu and arm-linux-gnueabihf failing with >> "multiple definition of symbols" when linking executables, e.g. > > THIS IS NOT A BINUTILS OR GCC BUG. > GCC changed the default to -fno-common. > It seems like for some reason, your non-aarch64/arm builds had changed > the default back to being with -fcommon turned on. > what would that be? I'm not aware of any active change doing that. Packages > build on x86, ppc64el and s390x at least. Well if you want to build old archived code using latest GCC then you may need to force -fcommon just like you need to add many warning disables. Maybe you were using an older GCC for the other targets? As Andrew notes, this isn't Arm-specific. Wilco
Re: multiple definition of symbols" when linking executables on ARM32 and AArch64
Hi, > However, this is an undocumented change in the current NEWS, and seeing >> literally hundreds of package failures, I doubt that's the right thing to >> do, at >> least without any deprecation warning first. Could that be handled, >> deprecating >> in GCC 10 first, and the changing that for GCC 11? This change was first proposed for GCC8, and rejected because of failures in the distros. Two years have passed, and there are still failures... Would this change if we postpone it even longer? My feeling is that nobody is going to actively fix their code if the default isn't changed first. > It is hard to get a warning for things like this. Could the linker warn whenever it merges common symbols or would that give many false positives? Wilco
Re: [ARM] LLVM's -arm-assume-misaligned-load-store equivalent in GCC?
Hi Christophe, > Actually I got a confirmation of what I suspected: the offending function > foo() > is part of ARM CMSIS libraries, although the users are able to recompile them, > they don't want to modify that source code. Having a compilation option to > avoid generating problematic code sequences would be OK for them. Well if LDRD instructions are incorrectly generated in those libraries for unaligned accesses then why not report it as a CMSIS bug? Adding a complex new option like this (with high impact to code quality) to workaround what seems a simple bug is way overkill. > So from the user's perspective, the wrong code is part of a 3rd party library > which they can recompile but do not want to modify. Would we expect every user of CMSIS to find this bug, find its cause, figure out that a future GCC may have a new special option to avoid it, wait until that GCC is released and then recompile? Really, the best option is to report and modify the source until it is fixed. Cheers, Wilco
Re: help with PR78809 - inline strcmp for small constant strings
Richard Henderson wrote: > On 08/04/2017 05:59 AM, Prathamesh Kulkarni wrote: > > For i386, it seems strcmp is expanded inline via cmpstr optab by > > expand_builtin_strcmp if one of the strings is constant. Could we similarly > > define cmpstr pattern for AArch64? > > Certainly that's possible. I'd suggest to do it as a target independent way, this is not a target specific optimization and shouldn't be done in the target unless there are special strcmp instructions. > For constant strings of small length (upto 3?), I was wondering if it'd be a > good idea to manually unroll strcmp loop, similar to __strcmp_* macros in > bits/string.h?> > For eg in gimple-fold, transform > x = __builtin_strcmp(s, "ab") > to > x = s[0] - 'a'; > if (x == 0) > { > x = s[1] - 'b'; > if (x == 0) > x = s[2]; > } If there is already code that does something similar (see comment #1 in PR78809), it could be easily adapted to handle more cases. > if (memcmp(s, "ab", 3) != 0) > > to be implemented with cmp+ccmp+ccmp and one branch. Even better would be wider loads if you either know the alignment of s or it's max size (although given the overhead of creating the return value that works best for equality). Wilco
RFC: Improving GCC8 default option settings
Hi all, At the GNU Cauldron I was inspired by several interesting talks about improving GCC in various ways. While GCC has many great optimizations, a common theme is that its default settings are rather conservative. As a result users are required to enable several additional optimizations by hand to get good code. Other compilers enable more optimizations at -O2 (loop unrolling in LLVM was mentioned repeatedly) which GCC could/should do as well. Here are a few concrete proposals to improve GCC's option settings which will enable better code generation for most targets: * Make -fno-math-errno the default - this mostly affects the code generated for sqrt, which should be treated just like floating point division and not set errno by default (unless you explicitly select C89 mode). * Make -fno-trapping-math the default - another obvious one. From the docs: "Compile code assuming that floating-point operations cannot generate user-visible traps." There isn't a lot of code that actually uses user-visible traps (if any - many CPUs don't even support user traps as it's an optional IEEE feature). So assuming trapping math by default is way too conservative since there is no obvious benefit to users. * Make -fno-common the default - this was originally needed for pre-ANSI C, but is optional in C (not sure whether it is still in C99/C11). This can significantly improve code generation on targets that use anchors for globals (note the linker could report a more helpful message when ancient code that requires -fcommon fails to link). * Make -fomit-frame-pointer the default - various targets already do this at higher optimization levels, but this could easily be done for all targets. Frame pointers haven't been needed for debugging for decades, however if there are still good reasons to keep it enabled with -O0 or -O1 (I can't think of any unless it is for last-resort backtrace when there is no unwind info at a crash), we could just disable the frame pointer from -O2 onwards. These are just a few ideas to start. What do people think? I'd welcome discussion and other proposals for similar improvements. Wilco
Re: [RFC] type promotion pass
Hi Prathamesh, I've tried out the latest version and it works really well. It built and ran SPEC2017 without any issues or regressions (I didn't do a detailed comparison which would mean multiple runs, however a single run showed performance is pretty much the same on INT and 0.1% faster on FP). Codesize reduces in almost all cases (only xalancbmk increases by 600 bytes), sometimes by a huge amount. For example in gcc_r around 20% of all AND immediate instructions are removed, clear proof it removes many redundant zero/sign extensions. So consider this a big +1 from me! GCC is behind other compilers with respect to this kind of optimization and it looks like this phase does a major catchup. Like I mentioned, it doesn't have to be 100% perfect, once it has been committed, we can fine tune it and add more optimizations. Wilco
Re: [RFC] type promotion pass
David Edelsohn wrote: > Why does AArch64 define PROMOTE_MODE as SImode? GCC ports for other > RISC targets mostly seem to use a 64-bit mode. Maybe SImode is the > correct definition based on the current GCC optimization > infrastructure, but this seems like a change that should be applied to > all 64 bit RISC targets. The reason is that AArch64 supports both 32-bit registers, so when using char/short you want 32-bit operations. There is an issue in that WORD_REGISTER_OPERATIONS isn't set on AArch64, but it should be. Maybe that requires some cleanups and ensure it correctly interacts with PROMOTE_MODE. There are way too many confusing target defines like this and no general mechanism that just works like you'd expect. Promoting to an orthogonal set of registers is not something particularly unusual, so it's something GCC should support well by default... Wilco
Re: Possible gcc 4.8.5 bug about RELOC_HIDE marcro in latest kernel code
Hi Justin, > I tried centos 7.4 gcc 4.8.5-16, which seems to announce to fix this issue. > And I checked the source code, the patch had been included in. > But no luck, the bug is still there. > > Could you please please any advice to me? eg. Is there any ways to disable > such > reload compilation procedure? Reload is an intrinsic part of register allocation in GCC, it cannot be disabled. My advice would be to use GCC7 - there are many more issues in GCC4.8 which you will run into sooner or later. I've done many backports for AArch64 and generally stopped at GCC6, so please don't consider using anything older. A recent GCC will also generate MUCH more efficient code for AArch64. The same is true for GLIBC. So why use something as ancient as GCC4.8??? Wilco
Re: Possible gcc 4.8.5 bug about RELOC_HIDE marcro in latest kernel code
Hi Justin, > The 4.8.5 is default gcc version for centos 7.x If there is no newer version available you should talk to your distro. It is worth reporting this bug to them as more of their users may be affected by it. Wilco
Re: "GOT" under aarch64
Hi, You'll get GOT relocations to globals when you use -fpic: int x; int f(void) { return x; } >gcc -O2 -S -o- -fpic f: adrpx0, :got:x ldr x0, [x0, #:got_lo12:x] ldr w0, [x0] ret So it doesn't depend on the compiler but what options you compile for. There may be an issue with your setup, -fpic shouldn't be on by default. Use gcc -v -Q -c testfile.c to list all the default settings - there could be more non-standard or inefficient options enabled. Wilco
Re: Potential bug on Cortex-M due to used registers/interrupts.
Hi, > These other registers - r4 to r12 - are "callee saved". To be precise, R4-R11 are callee-saved, R0-R3, R12, LR are caller-saves and LR and PSR are clobbered by calls. LR is slightly odd in that it is a callee-save in the prolog, but not in the epilog (since LR is assumed clobbered after a call, it doesn't need to be restored, so you can use pop {regs,PC} to return). Cortex-M hardware will automatically save/restore R0-R3, R12, LR, PC, PSR on interrupts. That perfectly matches the caller-saves and clobbered registers, so there is no potential bug. Wilco
Fortran array slices and -frepack-arrays
Hi, I looked at a few performance anomalies between gfortran and Flang - it appears array slices are treated differently. Using -frepack-arrays fixed a performance issue in gfortran and didn't cause any regressions. Making input array slices contiguous helps both locality and enables more vectorization. So I wonder whether it should be made the default (-O3 or just -Ofast)? Alternatively would it be feasible in Fortran to version functions or loops if all arguments are contiguous slices? Wilco
Re: Fortran array slices and -frepack-arrays
Bin.Cheng wrote: > I don't know the implementation of the option, so two questions: > 1) When the repack is done during compilation? Is new code > manipulating data layout added > by frontend? If yes, better to do it during optimization thus is > can be on demanding? This > looks like one case of data layout transformation. Not sure if > there is enough information > to do that in optimizer. Yes it adds a runtime check at function entry and packs array slices which have a non-unity step. Currently it uses a call to _gfortran_internal_pack, however this could be inlined and use alloca rather than malloc for small slices. It might be possible to check which parameters are used a lot (or benefit from vectorization) and only pack those. > 2) For now, does this option force array repacking unconditionally? I > think it won't be too hard > to model when such data layout transformation is beneficial by > looking at loop (nest) accessing > the array and comparing against the overhead. Yes, it ensures all slices are packed, but that isn't strictly necessary. >> it be feasible in Fortran to version functions or loops if all arguments are >> contiguous slices? > I think a cost model is still needed for function/loop versioning. Absolutely. If you staticially know at the call that all slices are contiguous you could compile a version of the function using the contiguous attribute and skip all runtime checks. Such function versioning would require LTO to work well. Wilco
Re: How to get GCC on par with ICC?
Martin wrote: > Keep in mind that when discussing FP benchmarks, the used math library > can be (almost) as important as the compiler. In the case of 481.wrf, > we found that the GCC 8 + glibc 2.26 (so the "out-of-the box" GNU) > performance is about 70% of ICC's. When we just linked against AMD's > libm, we got to 83%. When we instructed GCC to generate calls to Intel's > SVML library and linked against it, we got to 91%. Using both SVML and > AMD's libm, we achieved 93%. > > That means that there likely still is 7% to be gained from more clever > optimizations in GCC but the real problem is in GNU libm. And 481.wrf > is perhaps the most extreme example but definitely not the only one. You really should retry with GLIBC 2.27 since several key math functions were rewritten from scratch by Szabolcs Nagy (all in generic C code), resulting in huge performance gains on all targets (eg. wrf improved over 50%). I fixed several double precision functions in current GLIBC to avoid extremely bad performance which had been complained about for years. There are more math functions on the way, so the GNU libm will not only catch up, but become the fastest math library available. Wilco
Missing optimization: mempcpy(3) vs memcpy(3)
Hi, I don't believe there is a missing optimization here: compilers expand mempcpy by default into memcpy since that is the standard library call. That means even if your source code contains mempcpy, there will never be any calls to mempcpy. The reason is obvious: most targets support optimized memcpy in the C library while very few optimize mempcpy. The same is true for bzero, bcmp and bcopy. Targets can do it differently, IIRC x86 is the only target that emits calls both to memcpy and mempcpy. Cheers, Wilco