Re: [AArch64] A question about Cortex-A57 pipeline description
Thanks for the reply! I see you point. Indeed, I've also seen cases where the load pipeline was overused at the beginning of a basic block, whereas at the end the code got stuck with a bunch of stores and no other instructions to run in parallel. And indeed, relaxing the restrictions makes things even worse in some cases. Anyway, I don't believe it's the best we can do, I'm going to have a closer look at the scheduler and see what can be done to improve the situation. Nikolai On 09/11/2015 07:21 PM, James Greenhalgh wrote: On Fri, Sep 11, 2015 at 04:31:37PM +0100, Nikolai Bozhenov wrote: Hi! Recently I got somewhat confused by Cortex-A57 pipeline description in GCC and I would be grateful if you could help me understand a few unclear points. Sure, Particularly I am interested in how memory operations (loads/stores) are scheduled. It seems that according to the cortex-a57.md file, firstly, two memory operations may never be scheduled at the same cycle and, secondly, two loads may never be scheduled at two consecutive cycles: ;; 5. Two pipelines for load and store operations: LS1, LS2. The most ;; valuable thing we can do is force a structural hazard to split ;; up loads/stores. (define_cpu_unit "ca57_ls_issue" "cortex_a57") (define_cpu_unit "ca57_ldr, ca57_str" "cortex_a57") (define_reservation "ca57_load_model" "ca57_ls_issue,ca57_ldr*2") (define_reservation "ca57_store_model" "ca57_ls_issue,ca57_str") However, the Cortex-A57 Software Optimization Guide states that the core is able to execute one load operation and one store operation every cycle. And that agrees with my experiments. Indeed, a loop consisting of 10 loads, 10 stores and several arithmetic operations takes on average about 10 cycles per iteration, provided that the instructions are intermixed properly. So, what is the purpose of additional restrictions imposed on the scheduler in cortex-a57.md file? It doesn't look like an error. Rather, it looks like a deliberate decision. When designing the model for the Cortex-A57 processor, I was primarily trying to build a model which would increase the blend of utilized pipelines on each cycle across a range of benchmarks, rather than to accurately reflect the constraints listed in the Cortex-A57 Software Optimisation Guide [1]. My reasoning here is that the Cortex-A57 is a high-performance processor, and an accurate model would be infeasible to build. Because of this, it is unlikely that the model in GCC will be representative of the true state of the processor, and consequently GCC may make decisions which result in an instruction stream which would bias towards one execution pipeline. In particular, given a less restrictive model, GCC will try to hoist more loads to be earlier in the basic block, which can result in less good utilization of the other execution pipelines. In my experiments, I found this model to be more beneficial across a range of benchmarks than a model with the additional restrictions I imposed relaxed. I'd be happy to consider counter-examples where this modeling produces suboptimal results - and where the changes you suggest are sufficient to resolve the issue. Thanks, James --- [1]: Cortex-A57 Software Optimisation Guide http://infocenter.arm.com/help/topic/com.arm.doc.uan0015a/cortex_a57_software_optimisation_guide_external.pdf
Re: Replacing malloc with alloca.
On 13/09/15 20:19, Florian Weimer wrote: > * Jeff Law: > >> On 09/13/2015 12:28 PM, Florian Weimer wrote: >>> * Ajit Kumar Agarwal: >>> The replacement of malloc with alloca can be done on the following analysis. If the lifetime of an object does not stretch beyond the immediate scope. In such cases the malloc can be replaced with alloca. This increases the performance to a great extent. >>> >>> You also need to make sure that the object is small (less than a page) >>> and that there is no deep recursion going on. Otherwise, the program >>> may no longer work after the transformation with real-world restricted >>> stack sizes. It may even end up with additional security issues. > >> You also have to make sure you're not inside a loop. Even a small >> allocation inside a loop is problematical from a security standpoint. >> >> You also need to look at what other objects might be on the stack and >> you have to look at the functional scope, not the immediate scope as >> alloca space isn't returned until the end of a function. > > Ah, right, alloca is unscoped (except when there are variable-length > arrays). > > Using a VLA might be the better approach (but the size concerns > remain). Introducing VLAs could alter program behavior in case a > pre-existing alloca call, leading to premature deallocation. > You also have to consider that code generated for functions containing alloca calls can also be less efficient than for functions that do not call it (cannot eliminate frame pointers, for example). So I'm not convinced this would necessarily be a performance win either. R.
Re: Replacing malloc with alloca.
On Sun, 13 Sep 2015, Ajit Kumar Agarwal wrote: The replacement of malloc with alloca can be done on the following analysis. If the lifetime of an object does not stretch beyond the immediate scope. In such cases the malloc can be replaced with alloca. This increases the performance to a great extent. Inlining helps to a great extent the scope of lifetime of an object doesn't stretch the immediate scope of an object. And the scope of replacing malloc with alloca can be identified. I am wondering what phases of our optimization pipeline the malloc is replaced with alloca and what analysis is done to transform The malloc with alloca. This greatly increases the performance of the benchmarks? Is the analysis done through Escape Analysis? If yes, then what data structure is used for the abstract execution interpretation? Did you try it? I don't think gcc ever replaces malloc with alloca. The only optimization we do with malloc/free is removing it when it is obviously unused. There are several PRs open about possible optimizations (19831 for instance). I posted a WIP patch a couple years ago to replace some malloc+free with local arrays (fixed length) but never had time to finish it. https://gcc.gnu.org/ml/gcc-patches/2013-11/msg03108.html -- Marc Glisse
Re: Replacing malloc with alloca.
On 09/14/2015 02:14 AM, Richard Earnshaw wrote: On 13/09/15 20:19, Florian Weimer wrote: * Jeff Law: On 09/13/2015 12:28 PM, Florian Weimer wrote: * Ajit Kumar Agarwal: The replacement of malloc with alloca can be done on the following analysis. If the lifetime of an object does not stretch beyond the immediate scope. In such cases the malloc can be replaced with alloca. This increases the performance to a great extent. You also need to make sure that the object is small (less than a page) and that there is no deep recursion going on. Otherwise, the program may no longer work after the transformation with real-world restricted stack sizes. It may even end up with additional security issues. You also have to make sure you're not inside a loop. Even a small allocation inside a loop is problematical from a security standpoint. You also need to look at what other objects might be on the stack and you have to look at the functional scope, not the immediate scope as alloca space isn't returned until the end of a function. Ah, right, alloca is unscoped (except when there are variable-length arrays). Using a VLA might be the better approach (but the size concerns remain). Introducing VLAs could alter program behavior in case a pre-existing alloca call, leading to premature deallocation. You also have to consider that code generated for functions containing alloca calls can also be less efficient than for functions that do not call it (cannot eliminate frame pointers, for example). So I'm not convinced this would necessarily be a performance win either. Yes, but I suspect that eliminating a single malloc/free pair dwarfs the cost of needing a frame pointer. The problem is proving when its safe to turn a malloc/free into an alloca. As folks have shown, it's non-trivial when the security aspects are considered. I've speculated that from a security standpoint that projects ought to just ban alloca, particularly glibc. It's been shown over and over again that folks just don't get it right and its ripe for exploitation. It'd be a whole lot easier to convince folks to go this direction if GCC was good about that kind of optimization. Jeff
Re: reload question about unmet constraints
> You would need some way to indicate that while Y does accept a mem, > this particular mem can't be reloaded to match. We don't have a way > to do that. As a test, I added this API. It seems to work. I suppose there could be a better API where we determine if a constrain matches various memory spaces, then compare with the memory space of the operand, but I can't prove that's sufficiently flexible for all targets that support memory spaces. Heck, I'm not even sure what to call the macro, and "TARGET_IS_THIS_MEMORY_ADDRESS_RELOADABLE_TO_MATCH_THIS_CONTRAINT_P()" is a little long ;-) What do we think of this direction? Index: reload.c === RCS file: /cvs/cvsfiles/gnupro/gcc/reload.c,v retrieving revision 1.33 diff -p -U 5 -r1.33 reload.c --- reload.c20 Feb 2014 16:40:26 - 1.33 +++ reload.c15 Sep 2015 05:38:24 - @@ -3517,20 +3517,26 @@ find_reloads (rtx insn, int replace, int && ((reg_equiv_mem (REGNO (operand)) != 0 && EXTRA_CONSTRAINT_STR (reg_equiv_mem (REGNO (operand)), c, p)) || (reg_equiv_address (REGNO (operand)) != 0))) win = 1; +#ifndef COMPATIBLE_CONSTRAINT_P +#define COMPATIBLE_CONSTRAINT_P(c,p,op) 1 +#endif + if (!MEM_P (operand) || COMPATIBLE_CONSTRAINT_P (c, p, operand)) + { /* If we didn't already win, we can reload constants via force_const_mem, and other MEMs by reloading the address like for 'o'. */ if (CONST_POOL_OK_P (operand_mode[i], operand) || MEM_P (operand)) badop = 0; constmemok = 1; offmemok = 1; break; } + } if (EXTRA_ADDRESS_CONSTRAINT (c, p)) { if (EXTRA_CONSTRAINT_STR (operand, c, p)) win = 1; Index: config/rl78/rl78.c === RCS file: /cvs/cvsfiles/gnupro/gcc/config/rl78/rl78.c,v retrieving revision 1.12.6.16 diff -p -U 5 -r1.12.6.16 rl78.c --- config/rl78/rl78.c 5 Aug 2015 13:43:59 - 1.12.6.16 +++ config/rl78/rl78.c 15 Sep 2015 05:39:04 - @@ -1041,10 +1041,18 @@ rl78_far_p (rtx x) return 0; return GET_MODE_BITSIZE (rl78_addr_space_address_mode (MEM_ADDR_SPACE (x))) == 32; } +int +rl78_compatible_constraint_p (char c, const char *p, rtx r) +{ + if (c == 'Y' && rl78_far_p (r)) +return 0; + return 1; +} + /* Return the appropriate mode for a named address pointer. */ #undef TARGET_ADDR_SPACE_POINTER_MODE #define TARGET_ADDR_SPACE_POINTER_MODE rl78_addr_space_pointer_mode static enum machine_mode Index: config/rl78/rl78.h === RCS file: /cvs/cvsfiles/gnupro/gcc/config/rl78/rl78.h,v retrieving revision 1.7.8.3 diff -p -U 5 -r1.7.8.3 rl78.h --- config/rl78/rl78.h 17 Mar 2015 14:54:35 - 1.7.8.3 +++ config/rl78/rl78.h 15 Sep 2015 05:39:28 - @@ -500,5 +500,7 @@ typedef unsigned int CUMULATIVE_ARGS; /* NOTE: defined but zero means dwarf2 debugging, but sjlj EH. */ #define DWARF2_UNWIND_INFO 0 #define REGISTER_TARGET_PRAGMAS() rl78_register_pragmas() + +#define COMPATIBLE_CONSTRAINT_P(C,P,OP) rl78_compatible_constraint_p (C,P,OP)
Re: reload question about unmet constraints
On Mon, Sep 14, 2015 at 11:05 PM, DJ Delorie wrote: > As a test, I added this API. It seems to work. I suppose there could > be a better API where we determine if a constrain matches various > memory spaces, then compare with the memory space of the operand, but > I can't prove that's sufficiently flexible for all targets that > support memory spaces. Heck, I'm not even sure what to call the > macro, and > "TARGET_IS_THIS_MEMORY_ADDRESS_RELOADABLE_TO_MATCH_THIS_CONTRAINT_P()" > is a little long ;-) > > What do we think of this direction? We already have define_constraint and define_memory_constraint. We could perhaps add a define_special_memory_constraint that returns CT_SPECIAL_MEMORY which mostly operates like CT_MEMORY, except that it doesn't assume any MEM can be reloaded to match. We already have constraint_satisfied_p, which is generated from define*_constraint. We could have a constraint_reloadable_to_match_p function parallel to that, which is for operands that don't match, but can be reloaded to match. Perhaps we don't even need a distinction between define_memory_constraint and define_special_memory_constraint. We could have constraint_reloadable_to_match_p default to the current code for memory constraints, that assumes any mem is reloadable to match, if a special reloadable condition isn't specified. Perhaps define_memory_constraint can be extended with an optional field at the end, that is used to generate the constraint_reloadable_to_match_p function. Otherwise, I think you are headed in the right direction. I would worry a bit about whether we are making reload even more complicated for folks. But given that we already have the concept of address spaces, there should be some way to expose this info to reload. Jim