Re: Dealing with default recursive procedures in Fortran

2018-04-13 Thread N.M. Maclaren

On Apr 12 2018, Thomas König wrote:


with Fortran 2018, recursive is becoming the default. This will likely
have a serious impact on many user codes, which often declare large
arrays which could then overflow stacks, leading to segfaults without
further explanation.


Yes.  Been there - seen that :-)  What's worse, segfaults because of
stack overflow very often confuse debuggers, so you can't even get a
traceback of where it failed!


We could extend -fmax-stack-var-size so it allocates memory
from the heap in recursive procedures, too, and set this to
some default value.  Of course, it would also have to free them
afterwards, but we manage that for allocatable arrays already.


Yes, but I think it's a horrible idea.  See below for a better one.


We could warn when large arrays with sizes known at compile time
are translated.


Yes, but I think that's the wrong criterion.  It should be above a
certain size, probably aggregate per procedure - and controllable,
of course.  Who cares about a couple of 3x3 matrices?


We could use -fsplit-stack by default. How reliable is that
option? Can it be used, for example, with -fopenmp?
Is it available on all (relevant) platforms?
One drawback would be that this would allow for infinite
recursions to go on for much longer.


Yes.  And I don't think it's the right mechanism, anyway, except for
OpenMP.  Again, see below.


A -fcheck=stack option could be added (implemented similar to
-fsplit-stack), to be included in -fcheck=all, which would abort
with a sensible error message instead of a segfault.


Absolutely.  Or simply always check!  I haven't studied the actual code
generated by gfortran recently, but my experience of performing stack
checking is that its cost is negligible.  It got a bad name because of
the utter incompetence of the way it was usually implemented.  There is
also a very simple optimisation that often avoids it:

Leave a fixed amount of space beyond the check point and omit the check
for any leaf procedure that uses no more than that.  And, obviously,
that can be extended to non-leaf procedures with known stack use, such
as most of the intrinsics.

There is another option, which I was thinking of experimenting with in
my retirement, but probably won't, is a double stack (as in GNU Ada, and
the better Algol 68 systems).  Small, fixed objects go on the primary
stack, as usual, and large or variable-sized ones go on the secondary
stack.  Allocatable objects should go there if and only if they are not
reallocated.  My experience (a long time back) was that the improved
locality of the primary stack (often needed to control branching) by
removing large arrays from it speeded up most calls with such arrays by
several tens of percent.

Now, there is an interesting interaction with split stacks.  The only
reason to have a contiguous stack is for fast procedure call for simple
procedures.  But that doesn't apply to the secondary stack, so it can
always be split - hence there is no need for a configuration or run-time
option.  That doesn't stop it being checked against a maximum size,
either, because accumulating, decrementing and checking a count isn't a
major overhead for the use the secondary stack gets.


Regards,
Nick Maclaren.




Re: Dealing with default recursive procedures in Fortran

2018-04-13 Thread Ramana Radhakrishnan
On Thu, Apr 12, 2018 at 10:50 PM, Thomas König  wrote:
> Hello world,
>
> with Fortran 2018, recursive is becoming the default. This will likely
> have a serious impact on many user codes, which often declare large
> arrays which could then overflow stacks, leading to segfaults without
> further explanation.
>
> What could we do? A few options, not all mutally exclusive.
>
> We could extend -fmax-stack-var-size so it allocates memory
> from the heap in recursive procedures, too, and set this to
> some default value.  Of course, it would also have to free them
> afterwards, but we manage that for allocatable arrays already.
>
> We could warn when large arrays with sizes known at compile time
> are translated.
>
> We could use -fsplit-stack by default. How reliable is that
> option? Can it be used, for example, with -fopenmp?
> Is it available on all (relevant) platforms?

Not available on AArch64 (yet, though there are some patches) and Arm
(no current plans that I know of). Probably works best only on x86_64.
I don't think you can rely on that being available everywhere.
Additionally depending on the implementation IIRC that will have a
dependency on a newer glibc as well,so that would depend on a newer
glibc as well for split-stack to work reliably as  platforms.

Ramana


GCC Compiler Optimization ignores or mistreats MFENCE memory barrier related instruction

2018-04-13 Thread Vivek Kinhekar
Hi,


We are trying to create a memory barrier with following testcase.

=
#include 

void Test()
{
float fDivident = 0.1f;
float fResult = 0.0f;

fResult = ( fDivident / fResult );

__asm volatile ("mfence" ::: "memory");

printf("\nResult: %f\n", fResult);
}
==



'mfence' performs a serializing operation on all load-from-memory and 
store-to-memory instructions that were issued prior the MFENCE instruction. 
This serializing operation guarantees that every load and store instruction 
that precedes the MFENCE instruction in program order becomes globally visible 
before any load or store instruction that follows the MFENCE instruction.

The mfence instruction with memory clobber asm instruction should create a 
barrier between division and printf instructions.



When the testcase is compiled with optimization options O1 and above it can be 
observed that the mfence instruction is reordered and precedes division 
instruction.

We expected that the two sets of assembly instructions, one pertaining to 
division operation and another pertaining to the printf operation, would not 
get mixed up on reordering by the GCC compiler optimizer because of the 
presence of the __asm volatile ("mfence" ::: "memory"); line between them.



But, the generated assembly, which is inlined below for reference, isn't quite 
right as per our expectation.



pushl   %ebp# 23*pushsi2[length = 1]
movl%esp, %ebp  # 24*movsi_internal/1   [length = 2]
subl$24, %esp   # 25pro_epilogue_adjust_stack_si_add/1  
[length = 3]
mfence
fldz# 20*movxf_internal/3   [length = 2]
fdivrs  .LC0# 13*fop_xf_4_i387/1[length = 6]

You may note that the mfence instruction is generated before the fdivrs 
instruction.

Can you please let us know if the usage of the "asm (mfence)" instruction as 
given in the above testcase is the right way of creating the expected memory 
barrier between the two sets of instructions pertaining to the division and 
printf operations, respectively or not?

If yes, then we think, it's a bug in Compiler. Could you please confirm?

If no, then what is the correct usage of "asm (mfence)" so as to get/ achieve 
the memory barrier functionality as expected in the above testcase?

Thanks,
Vivek Kinhekar


Re: GCC Compiler Optimization ignores or mistreats MFENCE memory barrier related instruction

2018-04-13 Thread Alexander Monakov
On Fri, 13 Apr 2018, Vivek Kinhekar wrote:
> The mfence instruction with memory clobber asm instruction should create a
> barrier between division and printf instructions.

No, floating-point division does not touch memory, so the asm does not (and
need not) restrict its motion.

Alexander


RE: GCC Compiler Optimization ignores or mistreats MFENCE memory barrier related instruction

2018-04-13 Thread Vivek Kinhekar
Thanks for the quick response, Alexander!

Regards,
Vivek Kinhekar
+91-7709046470

-Original Message-
From: Alexander Monakov  
Sent: Friday, April 13, 2018 5:58 PM
To: Vivek Kinhekar 
Cc: gcc@gcc.gnu.org
Subject: Re: GCC Compiler Optimization ignores or mistreats MFENCE memory 
barrier related instruction

On Fri, 13 Apr 2018, Vivek Kinhekar wrote:
> The mfence instruction with memory clobber asm instruction should 
> create a barrier between division and printf instructions.

No, floating-point division does not touch memory, so the asm does not (and 
need not) restrict its motion.

Alexander


Re: Dealing with default recursive procedures in Fortran

2018-04-13 Thread Janne Blomqvist
On Fri, Apr 13, 2018 at 12:50 AM, Thomas König  wrote:

> Hello world,
>
> with Fortran 2018, recursive is becoming the default. This will likely
> have a serious impact on many user codes, which often declare large
> arrays which could then overflow stacks, leading to segfaults without
> further explanation.
>

Yes. For reference, we had some previous discussion about this at
https://gcc.gnu.org/ml/gcc-patches/2017-12/msg01417.html and
https://gcc.gnu.org/ml/fortran/2017-12/msg00082.html .


>
> What could we do? A few options, not all mutally exclusive.
>
> We could extend -fmax-stack-var-size so it allocates memory
> from the heap in recursive procedures, too, and set this to
> some default value.  Of course, it would also have to free them
> afterwards, but we manage that for allocatable arrays already.
>

+1. I think this is the pragmatic approach. It ought to work everywhere,
and should be implementable with modest effort (I did try to have a go at
it, but I'm not that familiar with that part of the frontend, so I didn't
really make any progress).


> We could warn when large arrays with sizes known at compile time
> are translated.
>

+1. Although this shouldn't perhaps be part of -Wall, as it's not a
correctness issue, only a *potential* performance one.


> We could use -fsplit-stack by default. How reliable is that
> option? Can it be used, for example, with -fopenmp?
> Is it available on all (relevant) platforms?
> One drawback would be that this would allow for infinite
> recursions to go on for much longer.
>

Seems not all targets support it, like Ramana said.

Also, IIRC go (the language, that is) originally used split stacks, but at
some point they switched to larger stacks, due to the overhead of checking
the stack usage at every procedure call. And Go isn't even a particularly
performance-oriented language (compared to C/C++/Fortran).


>
> Other ideas / options?
>

Nicks suggestion to have a separate split stack for large and
variable-sized arrays sounds good, although I suspect it would run into the
same portability issues as -fsplit-stack and then some.

So in the short term, I think what ought to be done in rough order of
importance:

1) Make -fmax-stack-var-size use the heap instead of static memory.

2) Make -frecursive the default.

3) Warning for array that is large enough to be allocated on the heap.

4) Run-time error for automatic heap allocation.

Longer term, if somebody has the energy to deal with the (potential)
portability issues, Nicks secondary split stack approach could be good.

-- 
Janne Blomqvist


RE: GCC Compiler Optimization ignores or mistreats MFENCE memory barrier related instruction

2018-04-13 Thread Vivek Kinhekar
Hello Alexander,

In the given testcase, the generated fdivrs instruction performs the division 
of a symbol ref (memory value) by FPU Stack Register and stores the value in 
FPU Stack Register. 

Please find the following RTL Dump of the fdivrs instruction generated. It 
clearly access the memory for read access!
===
#(insn:TI 13 20 16 2 (set (reg:XF 8 st)
#(div:XF (float_extend:XF (mem/u/c:SF (symbol_ref/u:SI ("*.LC0") [flags 
0x2]) [4 S4 A32]))
#(reg:XF 8 st)))  {*fop_xf_4_i387}
# (nil))
fdivrs  .LC0# 13*fop_xf_4_i387/1[length = 6]
===

Are we missing anything subtle here?

Regards,
Vivek Kinhekar

-Original Message-
From: Alexander Monakov  
Sent: Friday, April 13, 2018 5:58 PM
To: Vivek Kinhekar 
Cc: gcc@gcc.gnu.org
Subject: Re: GCC Compiler Optimization ignores or mistreats MFENCE memory 
barrier related instruction

On Fri, 13 Apr 2018, Vivek Kinhekar wrote:
> The mfence instruction with memory clobber asm instruction should 
> create a barrier between division and printf instructions.

No, floating-point division does not touch memory, so the asm does not (and 
need not) restrict its motion.

Alexander


RE: GCC Compiler Optimization ignores or mistreats MFENCE memory barrier related instruction

2018-04-13 Thread Vivek Kinhekar
Oh! Thanks for the quick response, Jakub.

Regards,
Vivek Kinhekar

-Original Message-
From: Jakub Jelinek  
Sent: Friday, April 13, 2018 7:08 PM
To: Vivek Kinhekar 
Cc: Alexander Monakov ; gcc@gcc.gnu.org
Subject: Re: GCC Compiler Optimization ignores or mistreats MFENCE memory 
barrier related instruction

On Fri, Apr 13, 2018 at 01:34:21PM +, Vivek Kinhekar wrote:
> Hello Alexander,
> 
> In the given testcase, the generated fdivrs instruction performs the 
> division of a symbol ref (memory value) by FPU Stack Register and 
> stores the value in FPU Stack Register.

The stack registers are not memory.

> Please find the following RTL Dump of the fdivrs instruction generated. 
> It clearly access the memory for read access! 

That is a constant read, that doesn't count either.  It is in memory only 
because the instruction doesn't support constant immediates, the memory is 
read-only.

Jakub


Fortran array slices and -frepack-arrays

2018-04-13 Thread Wilco Dijkstra
Hi,

I looked at a few performance anomalies between gfortran and Flang - it appears 
array slices
are treated differently. Using -frepack-arrays fixed a performance issue in 
gfortran and didn't
cause any regressions. Making input array slices contiguous helps both locality 
and enables
more vectorization.

So I wonder whether it should be made the default (-O3 or just -Ofast)? 
Alternatively would
it be feasible in Fortran to version functions or loops if all arguments are 
contiguous slices?

Wilco

Re: GCC Compiler Optimization ignores or mistreats MFENCE memory barrier related instruction

2018-04-13 Thread Jakub Jelinek
On Fri, Apr 13, 2018 at 01:34:21PM +, Vivek Kinhekar wrote:
> Hello Alexander,
> 
> In the given testcase, the generated fdivrs instruction performs the
> division of a symbol ref (memory value) by FPU Stack Register and stores
> the value in FPU Stack Register.

The stack registers are not memory.

> Please find the following RTL Dump of the fdivrs instruction generated. 
> It clearly access the memory for read access! 

That is a constant read, that doesn't count either.  It is in memory only
because the instruction doesn't support constant immediates, the memory is
read-only.

Jakub


Re: Fortran array slices and -frepack-arrays

2018-04-13 Thread Bin.Cheng
On Fri, Apr 13, 2018 at 3:32 PM, Wilco Dijkstra  wrote:
> Hi,
>
> I looked at a few performance anomalies between gfortran and Flang - it 
> appears array slices
> are treated differently. Using -frepack-arrays fixed a performance issue in 
> gfortran and didn't
> cause any regressions. Making input array slices contiguous helps both 
> locality and enables
> more vectorization.
>
> So I wonder whether it should be made the default (-O3 or just -Ofast)? 
> Alternatively would
I don't know the implementation of the option, so two questions:
1) When the repack is done during compilation?  Is new code
manipulating data layout added
 by frontend?  If yes, better to do it during optimization thus is
can be on demanding?  This
 looks like one case of data layout transformation.  Not sure if
there is enough information
 to do that in optimizer.
2) For now, does this option force array repacking unconditionally?  I
think it won't be too hard
 to model when such data layout transformation is beneficial by
looking at loop (nest) accessing
 the array and comparing against the overhead.


> it be feasible in Fortran to version functions or loops if all arguments are 
> contiguous slices?
I think a cost model is still needed for function/loop versioning.

Thanks,
bin
>
> Wilco


Re: Fortran array slices and -frepack-arrays

2018-04-13 Thread Wilco Dijkstra
Bin.Cheng wrote:
  
> I don't know the implementation of the option, so two questions:
> 1) When the repack is done during compilation?  Is new code
> manipulating data layout added
> by frontend?  If yes, better to do it during optimization thus is
> can be on demanding?  This
> looks like one case of data layout transformation.  Not sure if
> there is enough information
> to do that in optimizer.

Yes it adds a runtime check at function entry and packs array slices which
have a non-unity step. Currently it uses a call to _gfortran_internal_pack,
however this could be inlined and use alloca rather than malloc for small 
slices.

It might be possible to check which parameters are used a lot (or benefit
from vectorization) and only pack those.

> 2) For now, does this option force array repacking unconditionally?  I
> think it won't be too hard
> to model when such data layout transformation is beneficial by
> looking at loop (nest) accessing
> the array and comparing against the overhead.

Yes, it ensures all slices are packed, but that isn't strictly necessary.

>> it be feasible in Fortran to version functions or loops if all arguments are 
>> contiguous slices?
> I think a cost model is still needed for function/loop versioning.

Absolutely. If you staticially know at the call that all slices are contiguous 
you could
compile a version of the function using the contiguous attribute and skip all 
runtime
checks. Such function versioning would require LTO to work well.

Wilco