Re: [AArch64] A question about Cortex-A57 pipeline description

2015-09-14 Thread Nikolai Bozhenov
Thanks for the reply! I see you point. Indeed, I've also seen cases 
where the
load pipeline was overused at the beginning of a basic block, whereas at 
the end
the code got stuck with a bunch of stores and no other instructions to 
run in
parallel. And indeed, relaxing the restrictions makes things even worse 
in some

cases. Anyway, I don't believe it's the best we can do, I'm going to have a
closer look at the scheduler and see what can be done to improve the 
situation.


Nikolai


On 09/11/2015 07:21 PM, James Greenhalgh wrote:

On Fri, Sep 11, 2015 at 04:31:37PM +0100, Nikolai Bozhenov wrote:

Hi!

Recently I got somewhat confused by Cortex-A57 pipeline description in
GCC and
I would be grateful if you could help me understand a few unclear points.

Sure,


Particularly I am interested in how memory operations (loads/stores) are
scheduled. It seems that according to the cortex-a57.md file, firstly, two
memory operations may never be scheduled at the same cycle and,
secondly, two loads may never be scheduled at two consecutive cycles:

  ;; 5.  Two pipelines for load and store operations: LS1, LS2. The most
  ;; valuable thing we can do is force a structural hazard to split
  ;; up loads/stores.

  (define_cpu_unit "ca57_ls_issue" "cortex_a57")
  (define_cpu_unit "ca57_ldr, ca57_str" "cortex_a57")
  (define_reservation "ca57_load_model" "ca57_ls_issue,ca57_ldr*2")
  (define_reservation "ca57_store_model" "ca57_ls_issue,ca57_str")

However, the Cortex-A57 Software Optimization Guide states that the core is
able to execute one load operation and one store operation every cycle. And
that agrees with my experiments. Indeed, a loop consisting of 10 loads, 10
stores and several arithmetic operations takes on average about 10 cycles per
iteration, provided that the instructions are intermixed properly.

So, what is the purpose of additional restrictions imposed on the scheduler
in cortex-a57.md file? It doesn't look like an error. Rather, it looks like a
deliberate decision.

When designing the model for the Cortex-A57 processor, I was primarily
trying to build a model which would increase the blend of utilized
pipelines on each cycle across a range of benchmarks, rather than to
accurately reflect the constraints listed in the Cortex-A57 Software
Optimisation Guide [1].

My reasoning here is that the Cortex-A57 is a high-performance processor,
and an accurate model would be infeasible to build. Because of this, it is
unlikely that the model in GCC will be representative of the true state of the
processor, and consequently GCC may make decisions which result in an
instruction stream which would bias towards one execution pipeline. In
particular, given a less restrictive model, GCC will try to hoist more
loads to be earlier in the basic block, which can result in less good
utilization of the other execution pipelines.

In my experiments, I found this model to be more beneficial across a range
of benchmarks than a model with the additional restrictions I imposed relaxed.
I'd be happy to consider counter-examples where this modeling produces
suboptimal results - and where the changes you suggest are sufficient to
resolve the issue.

Thanks,
James

---
[1]: Cortex-A57 Software Optimisation Guide
  
http://infocenter.arm.com/help/topic/com.arm.doc.uan0015a/cortex_a57_software_optimisation_guide_external.pdf





Re: Replacing malloc with alloca.

2015-09-14 Thread Richard Earnshaw
On 13/09/15 20:19, Florian Weimer wrote:
> * Jeff Law:
> 
>> On 09/13/2015 12:28 PM, Florian Weimer wrote:
>>> * Ajit Kumar Agarwal:
>>>
 The replacement of malloc with alloca can be done on the following
 analysis.

 If the lifetime of an object does not stretch beyond the immediate
 scope. In such cases the malloc can be replaced with alloca.  This
 increases the performance to a great extent.
>>>
>>> You also need to make sure that the object is small (less than a page)
>>> and that there is no deep recursion going on.  Otherwise, the program
>>> may no longer work after the transformation with real-world restricted
>>> stack sizes.  It may even end up with additional security issues.
> 
>> You also have to make sure you're not inside a loop.  Even a small
>> allocation inside a loop is problematical from a security standpoint.
>>
>> You also need to look at what other objects might be on the stack and
>> you have to look at the functional scope, not the immediate scope as
>> alloca space isn't returned until the end of a function.
> 
> Ah, right, alloca is unscoped (except when there are variable-length
> arrays).
> 
> Using a VLA might be the better approach (but the size concerns
> remain).  Introducing VLAs could alter program behavior in case a
> pre-existing alloca call, leading to premature deallocation.
> 

You also have to consider that code generated for functions containing
alloca calls can also be less efficient than for functions that do not
call it (cannot eliminate frame pointers, for example).  So I'm not
convinced this would necessarily be a performance win either.

R.


Re: Replacing malloc with alloca.

2015-09-14 Thread Marc Glisse

On Sun, 13 Sep 2015, Ajit Kumar Agarwal wrote:


The replacement of malloc with alloca can be done on the following analysis.

If the lifetime of an object does not stretch beyond the immediate scope. In 
such cases the malloc can be replaced with alloca.
This increases the performance to a great extent.

Inlining helps to a great extent the scope of lifetime of an object doesn't 
stretch the immediate scope of an object.
And the scope of replacing malloc with alloca can be identified.

I am wondering what phases of our optimization pipeline the malloc is replaced 
with alloca and what analysis is done to transform
The malloc with alloca. This greatly increases the performance of the 
benchmarks? Is the analysis done through Escape Analysis?

If yes, then what data structure is used for the abstract execution 
interpretation?


Did you try it? I don't think gcc ever replaces malloc with alloca. The 
only optimization we do with malloc/free is removing it when it is 
obviously unused. There are several PRs open about possible optimizations 
(19831 for instance).


I posted a WIP patch a couple years ago to replace some malloc+free with 
local arrays (fixed length) but never had time to finish it.

https://gcc.gnu.org/ml/gcc-patches/2013-11/msg03108.html

--
Marc Glisse


Re: Replacing malloc with alloca.

2015-09-14 Thread Jeff Law

On 09/14/2015 02:14 AM, Richard Earnshaw wrote:

On 13/09/15 20:19, Florian Weimer wrote:

* Jeff Law:


On 09/13/2015 12:28 PM, Florian Weimer wrote:

* Ajit Kumar Agarwal:


The replacement of malloc with alloca can be done on the following
analysis.

If the lifetime of an object does not stretch beyond the immediate
scope. In such cases the malloc can be replaced with alloca.  This
increases the performance to a great extent.


You also need to make sure that the object is small (less than a page)
and that there is no deep recursion going on.  Otherwise, the program
may no longer work after the transformation with real-world restricted
stack sizes.  It may even end up with additional security issues.



You also have to make sure you're not inside a loop.  Even a small
allocation inside a loop is problematical from a security standpoint.

You also need to look at what other objects might be on the stack and
you have to look at the functional scope, not the immediate scope as
alloca space isn't returned until the end of a function.


Ah, right, alloca is unscoped (except when there are variable-length
arrays).

Using a VLA might be the better approach (but the size concerns
remain).  Introducing VLAs could alter program behavior in case a
pre-existing alloca call, leading to premature deallocation.



You also have to consider that code generated for functions containing
alloca calls can also be less efficient than for functions that do not
call it (cannot eliminate frame pointers, for example).  So I'm not
convinced this would necessarily be a performance win either.
Yes, but I suspect that eliminating a single malloc/free pair dwarfs the 
cost of needing a frame pointer.   The problem is proving when its safe 
to turn a malloc/free into an alloca.  As folks have shown, it's 
non-trivial when the security aspects are considered.


I've speculated that from a security standpoint that projects ought to 
just ban alloca, particularly glibc.  It's been shown over and over 
again that folks just don't get it right and its ripe for exploitation. 
 It'd be a whole lot easier to convince folks to go this direction if 
GCC was good about that kind of optimization.


Jeff


Re: reload question about unmet constraints

2015-09-14 Thread DJ Delorie

> You would need some way to indicate that while Y does accept a mem,
> this particular mem can't be reloaded to match.  We don't have a way
> to do that.

As a test, I added this API.  It seems to work.  I suppose there could
be a better API where we determine if a constrain matches various
memory spaces, then compare with the memory space of the operand, but
I can't prove that's sufficiently flexible for all targets that
support memory spaces.  Heck, I'm not even sure what to call the
macro, and 
"TARGET_IS_THIS_MEMORY_ADDRESS_RELOADABLE_TO_MATCH_THIS_CONTRAINT_P()"
is a little long ;-)

What do we think of this direction?


Index: reload.c
===
RCS file: /cvs/cvsfiles/gnupro/gcc/reload.c,v
retrieving revision 1.33
diff -p -U 5 -r1.33 reload.c
--- reload.c20 Feb 2014 16:40:26 -  1.33
+++ reload.c15 Sep 2015 05:38:24 -
@@ -3517,20 +3517,26 @@ find_reloads (rtx insn, int replace, int
 && ((reg_equiv_mem (REGNO (operand)) != 0
  && EXTRA_CONSTRAINT_STR 
(reg_equiv_mem (REGNO (operand)), c, p))
 || (reg_equiv_address (REGNO 
(operand)) != 0)))
  win = 1;
 
+#ifndef COMPATIBLE_CONSTRAINT_P
+#define COMPATIBLE_CONSTRAINT_P(c,p,op) 1
+#endif
+   if (!MEM_P (operand) || COMPATIBLE_CONSTRAINT_P (c, 
p, operand))
+ {
/* If we didn't already win, we can reload
   constants via force_const_mem, and other
   MEMs by reloading the address like for 'o'.  */
if (CONST_POOL_OK_P (operand_mode[i], operand)
|| MEM_P (operand))
  badop = 0;
constmemok = 1;
offmemok = 1;
break;
  }
+ }
if (EXTRA_ADDRESS_CONSTRAINT (c, p))
  {
if (EXTRA_CONSTRAINT_STR (operand, c, p))
  win = 1;
 

Index: config/rl78/rl78.c
===
RCS file: /cvs/cvsfiles/gnupro/gcc/config/rl78/rl78.c,v
retrieving revision 1.12.6.16
diff -p -U 5 -r1.12.6.16 rl78.c
--- config/rl78/rl78.c  5 Aug 2015 13:43:59 -   1.12.6.16
+++ config/rl78/rl78.c  15 Sep 2015 05:39:04 -
@@ -1041,10 +1041,18 @@ rl78_far_p (rtx x)
 return 0;
 
   return GET_MODE_BITSIZE (rl78_addr_space_address_mode (MEM_ADDR_SPACE (x))) 
== 32;
 }
 
+int
+rl78_compatible_constraint_p (char c, const char *p, rtx r)
+{
+  if (c == 'Y' && rl78_far_p (r))
+return 0;
+  return 1;
+}
+
 /* Return the appropriate mode for a named address pointer.  */
 #undef  TARGET_ADDR_SPACE_POINTER_MODE
 #define TARGET_ADDR_SPACE_POINTER_MODE rl78_addr_space_pointer_mode
 
 static enum machine_mode

Index: config/rl78/rl78.h
===
RCS file: /cvs/cvsfiles/gnupro/gcc/config/rl78/rl78.h,v
retrieving revision 1.7.8.3
diff -p -U 5 -r1.7.8.3 rl78.h
--- config/rl78/rl78.h  17 Mar 2015 14:54:35 -  1.7.8.3
+++ config/rl78/rl78.h  15 Sep 2015 05:39:28 -
@@ -500,5 +500,7 @@ typedef unsigned int CUMULATIVE_ARGS;
 
 /* NOTE: defined but zero means dwarf2 debugging, but sjlj EH.  */
 #define DWARF2_UNWIND_INFO 0
 
 #define REGISTER_TARGET_PRAGMAS() rl78_register_pragmas()
+
+#define COMPATIBLE_CONSTRAINT_P(C,P,OP) rl78_compatible_constraint_p (C,P,OP)


Re: reload question about unmet constraints

2015-09-14 Thread Jim Wilson
On Mon, Sep 14, 2015 at 11:05 PM, DJ Delorie  wrote:
> As a test, I added this API.  It seems to work.  I suppose there could
> be a better API where we determine if a constrain matches various
> memory spaces, then compare with the memory space of the operand, but
> I can't prove that's sufficiently flexible for all targets that
> support memory spaces.  Heck, I'm not even sure what to call the
> macro, and 
> "TARGET_IS_THIS_MEMORY_ADDRESS_RELOADABLE_TO_MATCH_THIS_CONTRAINT_P()"
> is a little long ;-)
>
> What do we think of this direction?

We already have define_constraint and define_memory_constraint.  We
could perhaps add a define_special_memory_constraint that returns
CT_SPECIAL_MEMORY which mostly operates like CT_MEMORY, except that it
doesn't assume any MEM can be reloaded to match.

We already have constraint_satisfied_p, which is generated from
define*_constraint.  We could have a constraint_reloadable_to_match_p
function parallel to that, which is for operands that don't match, but
can be reloaded to match.  Perhaps we don't even need a distinction
between define_memory_constraint and define_special_memory_constraint.
We could have constraint_reloadable_to_match_p default to the current
code for memory constraints, that assumes any mem is reloadable to
match, if a special reloadable condition isn't specified.

Perhaps define_memory_constraint can be extended with an optional
field at the end, that is used to generate the
constraint_reloadable_to_match_p function.

Otherwise, I think you are headed in the right direction.  I would
worry a bit about whether we are making reload even more complicated
for folks.  But given that we already have the concept of address
spaces, there should be some way to expose this info to reload.

Jim