Re: [RFD] Simplifying subregs in LRA

2017-02-03 Thread Eric Botcazou
> (r243782 git:856bd6f)
> This is another case of multiple changes where some were not critical and
> overall there is a dangerous one here I believe.  The primary aim of this
> change is to reload the address before reloading the inner subreg.  This
> appears to be a dormant bug since day1 as the original logic would have
> failed when reloading an inner mem if its address was not already valid.
> 
> The potentially fatal part of this change is the introduction of a
> "return false" in the MEM subreg simplification which is placed immediately
> after restoring the original subreg expression.  I believe that if control
> ever actually reaches this statement then LRA would infinite loop as the
> MEM subreg would never be simplified. 

How can a change that is a no-op be fatal exactly?

> So what are the next steps!
> 
> 1) [BUG] Add an exclusion for WORD_REGISTER_OPERATIONS because MIPS is
> currently broken by the existing code. PR78660

That seems the way to go, with the appropriate check on the mode sizes.

> 2) [BUG] Remove the return false introduced in (r243782 git:856bd6f).

!???

> 3) [CLEANUP] Remove reg_mode argument and replace all uses of reg_mode with
>innermode.  Rename 'reg' to 'inner' and 'operand' to 'outer' and 'mode'
> to 'outermode'.
> 4) [OPTIMISATION] Change double-reload logic so that it just deals with the
>special outermode reload without adjusting the subreg.
> 5) [??] Determine if big endian still needs a special case like in reload?
>Comments anyone?

I agree that a cleanup of the code would probably be in order, with an eye on 
the reload code as a model, but that's probably not appropriate for GCC 7.

> In an attempt to make a minimal change I propose the following as it allows
> WORD_REGISTER_OPERATIONS targets to benefit from the invalid address
> reloading fix. I think the check would be more appropriately placed on the
> outer-most if (MEM_P (reg)) but this would affect the handling of many more
> subregs which seems too dangerous at this point in release.
> 
> diff --git a/gcc/lra-constraints.c b/gcc/lra-constraints.c
> index 393cc69..771475a 100644
> --- a/gcc/lra-constraints.c
> +++ b/gcc/lra-constraints.c
> @@ -1512,10 +1512,11 @@ simplify_operand_subreg (int nop, machine_mode
> reg_mode) equivalences in function lra_constraints) and because for spilled
> pseudos we allocate stack memory enough for the biggest
>corresponding paradoxical subreg.  */
> -   if (!(MEM_ALIGN (subst) < GET_MODE_ALIGNMENT (mode)
> - && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (subst)))
> -   || (MEM_ALIGN (reg) < GET_MODE_ALIGNMENT (innermode)
> -   && SLOW_UNALIGNED_ACCESS (innermode, MEM_ALIGN (reg
> +   if (!WORD_REGISTER_OPERATIONS
> +   && (!(MEM_ALIGN (subst) < GET_MODE_ALIGNMENT (mode)
> + && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (subst)))
> +   || (MEM_ALIGN (reg) < GET_MODE_ALIGNMENT (innermode)
> +   && SLOW_UNALIGNED_ACCESS (innermode, MEM_ALIGN 
(reg)
>   return true;
> 
> *curr_id->operand_loc[nop] = operand;
> 
> 
> The change will affect at least arc,mips,rx,sh,sparc though I haven't
> checked which of these default on for LRA just that they can turn on LRA.

Only MIPS and SPARC (see https://gcc.gnu.org/backends.html).

-- 
Eric Botcazou


Intel Phi co-processor support

2017-02-03 Thread Angel Dimitrov

 Hello,

 Can I compile on Linux with gfortran code and to run it on Phi 
co-processor? Or it is better to use Intel FORTRAN compiler?


 Angel



Re: Intel Phi co-processor support

2017-02-03 Thread Jakub Jelinek
On Fri, Feb 03, 2017 at 02:50:37PM +0200, Angel Dimitrov wrote:
>  Can I compile on Linux with gfortran code and to run it on Phi
> co-processor? Or it is better to use Intel FORTRAN compiler?

Depends on which XeonPhi do you have.  GCC doesn't support Knights Ferry
or Knights Corner, does support Knights Landing.
That said, for KNL I've only seen so far standalone KNL processors for
which I'm not sure if offloading is possible or desirable; IMHO if
KNL is the main processor in the computer, then everything is host
for you and thus just using non-target OpenMP code should be sufficient,
so the KNL offloading should be (mainly or solely) for the case when
KNL is a coprocessor, does such thing really exist or is planned?
Can somebody from Intel please clarify?

Jakub


RE: [RFD] Simplifying subregs in LRA

2017-02-03 Thread Matthew Fortune
Eric Botcazou  writes:
> > (r243782 git:856bd6f)
> > This is another case of multiple changes where some were not critical
> > and overall there is a dangerous one here I believe.  The primary aim
> > of this change is to reload the address before reloading the inner
> > subreg.  This appears to be a dormant bug since day1 as the original
> > logic would have failed when reloading an inner mem if its address was
> not already valid.
> >
> > The potentially fatal part of this change is the introduction of a
> > "return false" in the MEM subreg simplification which is placed
> > immediately after restoring the original subreg expression.  I believe
> > that if control ever actually reaches this statement then LRA would
> > infinite loop as the MEM subreg would never be simplified.
> 
> How can a change that is a no-op be fatal exactly?

It's not a no-op. Any MEM_P not handled by the first "if (MEM_P(reg))"
will have previously been handled by the block guarded by the following
later in the function; note the "|| MEM_P (reg)":

  /* Force a reload of the SUBREG_REG if this is a constant or PLUS or
 if there may be a problem accessing OPERAND in the outer
 mode.  */
  if ((REG_P (reg)
   && REGNO (reg) >= FIRST_PSEUDO_REGISTER
   && (hard_regno = lra_get_regno_hard_regno (REGNO (reg))) >= 0
   /* Don't reload paradoxical subregs because we could be looping
  having repeatedly final regno out of hard regs range.  */
   && (hard_regno_nregs[hard_regno][innermode]
   >= hard_regno_nregs[hard_regno][mode])
   && simplify_subreg_regno (hard_regno, innermode,
 SUBREG_BYTE (operand), mode) < 0
   /* Don't reload subreg for matching reload.  It is actually
  valid subreg in LRA.  */
   && ! LRA_SUBREG_P (operand))
  || CONSTANT_P (reg) || GET_CODE (reg) == PLUS || MEM_P (reg))
{

> > So what are the next steps!
> >
> > 1) [BUG] Add an exclusion for WORD_REGISTER_OPERATIONS because MIPS is
> > currently broken by the existing code. PR78660
> 
> That seems the way to go, with the appropriate check on the mode sizes.

I'm not sure what check to do on mode sizes. Do you think an innermode
reload is only required when both modes have the same number of words?

> > 2) [BUG] Remove the return false introduced in (r243782 git:856bd6f).
> 
> !???

If a MEM subreg is neither simplified to an outermode MEM nor reloaded
in innermode then I believe LRA will never resolve the subreg.  Even if that
is not true I'm fairly certain the addition of the code has changed
behaviour and that the change is not well understood, as explained above.

> > 3) [CLEANUP] Remove reg_mode argument and replace all uses of reg_mode
> with
> >innermode.  Rename 'reg' to 'inner' and 'operand' to 'outer' and
> 'mode'
> > to 'outermode'.
> > 4) [OPTIMISATION] Change double-reload logic so that it just deals
> with the
> >special outermode reload without adjusting the subreg.
> > 5) [??] Determine if big endian still needs a special case like in
> reload?
> >Comments anyone?
> 
> I agree that a cleanup of the code would probably be in order, with an
> eye on the reload code as a model, but that's probably not appropriate
> for GCC 7.

Indeed, definitely want to wait for GCC 8.

> > In an attempt to make a minimal change I propose the following as it
> > allows WORD_REGISTER_OPERATIONS targets to benefit from the invalid
> > address reloading fix. I think the check would be more appropriately
> > placed on the outer-most if (MEM_P (reg)) but this would affect the
> > handling of many more subregs which seems too dangerous at this point
> in release.
> >
> > diff --git a/gcc/lra-constraints.c b/gcc/lra-constraints.c index
> > 393cc69..771475a 100644
> > --- a/gcc/lra-constraints.c
> > +++ b/gcc/lra-constraints.c
> > @@ -1512,10 +1512,11 @@ simplify_operand_subreg (int nop, machine_mode
> > reg_mode) equivalences in function lra_constraints) and because for
> > spilled pseudos we allocate stack memory enough for the biggest
> >  corresponding paradoxical subreg.  */
> > - if (!(MEM_ALIGN (subst) < GET_MODE_ALIGNMENT (mode)
> > -   && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (subst)))
> > - || (MEM_ALIGN (reg) < GET_MODE_ALIGNMENT (innermode)
> > - && SLOW_UNALIGNED_ACCESS (innermode, MEM_ALIGN (reg
> > + if (!WORD_REGISTER_OPERATIONS
> > + && (!(MEM_ALIGN (subst) < GET_MODE_ALIGNMENT (mode)
> > +   && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (subst)))
> > + || (MEM_ALIGN (reg) < GET_MODE_ALIGNMENT (innermode)
> > + && SLOW_UNALIGNED_ACCESS (innermode, MEM_ALIGN
> (reg)
> > return true;
> >
> >   *curr_id->operand_loc[nop] = operand;
> >
> >
> > The change will affect at least arc,mips,rx,sh,sparc though I haven't
> > checked which of these default on for LRA just that they can turn on
> LRA.
> 
> Only MIPS and SPARC (see https://gcc.gnu.org/bac

Re: Intel Phi co-processor support

2017-02-03 Thread Ilya Verbin
2017-02-03 16:00 GMT+03:00 Jakub Jelinek :
>
> On Fri, Feb 03, 2017 at 02:50:37PM +0200, Angel Dimitrov wrote:
> >  Can I compile on Linux with gfortran code and to run it on Phi
> > co-processor? Or it is better to use Intel FORTRAN compiler?
>
> Depends on which XeonPhi do you have.  GCC doesn't support Knights Ferry
> or Knights Corner, does support Knights Landing.
> That said, for KNL I've only seen so far standalone KNL processors for
> which I'm not sure if offloading is possible or desirable; IMHO if

It is possible using so called "offload over fabric".
Here is a how-to [1], which can be adapted just by replacing "icc
-qopenmp" with "gcc -fopenmp", I guess.

[1] 
https://software.intel.com/en-us/articles/how-to-use-offload-over-fabric-with-knights-landing-intel-xeon-phi-processor

> KNL is the main processor in the computer, then everything is host
> for you and thus just using non-target OpenMP code should be sufficient,
> so the KNL offloading should be (mainly or solely) for the case when
> KNL is a coprocessor, does such thing really exist or is planned?
> Can somebody from Intel please clarify?
>
> Jakub

  -- Ilya


Re: [RFD] Simplifying subregs in LRA

2017-02-03 Thread Eric Botcazou
> If a MEM subreg is neither simplified to an outermode MEM nor reloaded
> in innermode then I believe LRA will never resolve the subreg.  Even if that
> is not true I'm fairly certain the addition of the code has changed
> behaviour and that the change is not well understood, as explained above.

Fair enough, let's remove the "return false" then.

> I hadn't spotted that table, very helpful, thanks. The other architectures
> I listed may be helped in their transition to LRA.  I guess they are
> attempting to move given optional LRA support.

You can send me the patch(es) in advance, I'll give it a try on SPARC.

-- 
Eric Botcazou


Re: [RFD] Simplifying subregs in LRA

2017-02-03 Thread Vladimir Makarov



On 02/01/2017 06:52 PM, Matthew Fortune wrote:

Hi all,

I've copied you as you have each made some significant change to a function
in LRA which I guess makes you de-facto experts.

I've spent a while researching the history of simplify_operand_subreg and
in particular the behaviour for subregs of memory.  For my sake if no-one
else here is a rundown of its evolution; corrections welcome.

Thanks for doing the research, Matt.

minated on the next iteration

(r198344 git:ea99c7a)
A special case for an LRA introduced subreg was added (LRA_SUBREG_P) that
should always be considered valid.  This I believe is to cope with cases
where a there are operands required to match but with different modes and,
presumably, one of the modes is not actually allowed.  Not 100% sure what
this is though!
As I remember x86 has such tricking x87 fp stack insn definitions with 
matching operands of different modes.  As reload does all RTL changes at 
once at its work end,  it was not a problem.  LRA transforms RTL 
permanently during its work and existing RTL illegal at some point might 
be a problem.

...

So what are the next steps!

1) [BUG] Add an exclusion for WORD_REGISTER_OPERATIONS because MIPS is currently
broken by the existing code. PR78660
2) [BUG] Remove the return false introduced in (r243782 git:856bd6f).
3) [CLEANUP] Remove reg_mode argument and replace all uses of reg_mode with
innermode.  Rename 'reg' to 'inner' and 'operand' to 'outer' and 'mode' to
'outermode'.
4) [OPTIMISATION] Change double-reload logic so that it just deals with the
special outermode reload without adjusting the subreg.
5) [??] Determine if big endian still needs a special case like in reload?
Comments anyone?
As Eric I prefer changes which affect minimum targets and minimum cases 
but still fix the bug. Almost any change in this part of LRA required 
some stabilization changes.  It is dangerous to do a big cleanup at this 
development stage of GCC.


Bigger cleanup could be done on stage1 after GCC7 release.  I hope you 
and Eric will do this.  Reload is a good reference point. Historically, 
LRA was originally written for few targets without taking corner cases 
of all other targets.  Trying to implement all reload cases without good 
understanding them or their necessity would have made LRA as a project 
impossible.

In an attempt to make a minimal change I propose the following as it allows
WORD_REGISTER_OPERATIONS targets to benefit from the invalid address reloading
fix. I think the check would be more appropriately placed on the outer-most
if (MEM_P (reg)) but this would affect the handling of many more subregs which
seems too dangerous at this point in release.

diff --git a/gcc/lra-constraints.c b/gcc/lra-constraints.c
index 393cc69..771475a 100644
--- a/gcc/lra-constraints.c
+++ b/gcc/lra-constraints.c
@@ -1512,10 +1512,11 @@ simplify_operand_subreg (int nop, machine_mode reg_mode)
 equivalences in function lra_constraints) and because for spilled
 pseudos we allocate stack memory enough for the biggest
 corresponding paradoxical subreg.  */
- if (!(MEM_ALIGN (subst) < GET_MODE_ALIGNMENT (mode)
-   && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (subst)))
- || (MEM_ALIGN (reg) < GET_MODE_ALIGNMENT (innermode)
- && SLOW_UNALIGNED_ACCESS (innermode, MEM_ALIGN (reg
+ if (!WORD_REGISTER_OPERATIONS
+ && (!(MEM_ALIGN (subst) < GET_MODE_ALIGNMENT (mode)
+   && SLOW_UNALIGNED_ACCESS (mode, MEM_ALIGN (subst)))
+ || (MEM_ALIGN (reg) < GET_MODE_ALIGNMENT (innermode)
+ && SLOW_UNALIGNED_ACCESS (innermode, MEM_ALIGN (reg)
return true;
  
  	  *curr_id->operand_loc[nop] = operand;



The change will affect at least arc,mips,rx,sh,sparc though I haven't checked
which of these default on for LRA just that they can turn on LRA.

I'll post this as a patch with appropriate updates to comments unless anyone
raises some issues.


OK.  Thanks.



[RFC] [i386] Test program for ms_abi to sysv_abi function calls

2017-02-03 Thread Daniel Santos
This is a test program designed to test 64-bit Microsoft ABI functions 
that call System V functions in a multitude of permutations to attempt 
to discover flaws in the generation of prologues and epilogues and the 
optimizationsand features that can affect them, specifically 
shrink-wrapping, sibling calls and DRAP.  This is an accompaniment to 
the below patch sets (and was instrumental in finding and fixingflaws!), 
but is not an ancestor.


Use aligned SSE movs for re-aligned MS ABI pro/epilogues - 
https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01859.html
Use out-of-line stubs for ms_abi pro/epilogues- (v3 to be posted 
shortly!  Old v2: https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02293.html)



Summary
===

A C++ generator program (gen.cc) is built and run to generate an 
unguarded C header file, which is included by msabi.c.  This header 
defines a number of ms_abi functions that call sysv_abi functions with 
various conditions. The generator has a CLI for test selection and 
currently used arguments (in msabi.exp) generates about 19k tests.  The 
results of msabi.c and do_test.S are linked to create the test program.  
Build time for msabi.c is takes 200-ish seconds on an old Phenom, but 
execution is nearly instantaneous.


Each test function is called via an assembly stub (do_test_aligned or 
do_test_unaligned) that:


1. Saves non-volatile registers to a global buffer,
2. Fills non-volatile registers with random data,
3. Calls the test function,
4. Saves resulting non-volatile registers for later comparison, and
5. Restores non-volatile registers to original values.

Upon completion, the resulting register values are compared to the 
original random values to verify correctness.  The return value is also 
verified against what is expected to validate the correctness of both 
the generated code as well as the test its self.



What Is Tested
==

The test permutations consists of:

A. A number of extra long parametersfor the function (0-5 being used now).
B. A mask of additional non-volatile registers to explicitly clobber 
(aside from RDI, RSI and XMM6-15, which are always clobbered):

  enum optional_regs
  {
OPTIONAL_REG_RBX = 0x01,
OPTIONAL_REG_RBP = 0x02,
OPTIONAL_REG_R12 = 0x04,
OPTIONAL_REG_R13 = 0x08,
OPTIONAL_REG_R14 = 0x10,
OPTIONAL_REG_R15 = 0x20,

OPTIONAL_REG_ALL = 0x3f,
OPTIONAL_REG_HFP_ALL = OPTIONAL_REG_ALL & (~OPTIONAL_REG_RBP)
  };

B. A collection of variants:
  enum fn_variants {
FN_VAR_MSABI  = 0x01,/* This value is an implementation detail
 and NOT a test permutation. */
FN_VAR_HFP= 0x02,
FN_VAR_REALIGN= 0x04,
FN_VAR_ALLOCA = 0x08,
FN_VAR_VARARGS= 0x10,
FN_VAR_SIBCALL= 0x20,
FN_VAR_SHRINK_WRAP= 0x40,

FN_VAR_HFP_OR_REALIGN = FN_VAR_HFP | FN_VAR_REALIGN,
FN_VAR_MASK   = 0x7f,
FN_VAR_COUNT  = 7
  };

The variants deserve a little more explanation.

* FN_VAR_MSABI   (implementation detail) Adds__attribute__((ms_abi)).
* FN_VAR_HFP Adds 
__attribute__((optimize("no-omit-frame-pointer"))). FN_VAR_HFPand 
FN_VAR_REALIGNare mutually exclusive.
* FN_VAR_REALIGN Adds __attribute__((__force_align_arg_pointer__)).  The 
test will call this function twice -- with an aligned and misaligned stack.
* FN_VAR_ALLOCA  The ms_abi function calls alloca and passes the 
pointer to the sysv_abi function.
* FN_VAR_VARARGS The ms_abi function takes varargs, but only passes 
the argptr to the sysv_abi function.
* FN_VAR_SIBCALL The ms_abi function returns in a way that enables 
the sibling call optimization (skipped if FN_VAR_REALIGN | FN_VAR_HFP 
are enabled).
* FN_VAR_SHRINK_WRAP Tests a global variable and uses a branch that 
enables the use shrink wrapping.  Both the fast and slow path are tested.


The following nomenclature is used for function names:
(msabi|sysv)__[r|f][a][v][s][w]
 ||||  |  |  |  |
 ||||  |  |  | Number of extra parameters (longs)
 ||||  |  | shrink wrap
 ||||  |  sibling call
 ||||  varargs
 |||alloca
 ||Forced realignment or hard frame pointer
 |Explicit clobbers (hexadecimalmask)
 Calling Convention


Examples


The function msabi_25_ra2looks like this:

__attribute__ ((noinline, ms_abi, __force_align_arg_pointer__))
long msabi_25_ra2 (long a, long b)
{
  void *alloca_mem;
  alloca_mem = alloca (8 + a);
  *(long*)alloca_mem = FLAG_ALLOCA;
  __asm__ __volatile__ ("" :::"rbx", "r12", "r15");
  return sysv_a2_noinfo (alloca_mem, a, b);
}


And the tests (both aligned and misaligned) looks something like this:

void init_test (void *fn, const char *name, enum alignment_option alignment,
enum shrink_wrap_option shrink_wrap, long ret_expected);
void do_tests ()
{
  long ret;

  long a = 1;
  lo