Re: Dealing with default recursive procedures in Fortran
On Apr 12 2018, Thomas König wrote: with Fortran 2018, recursive is becoming the default. This will likely have a serious impact on many user codes, which often declare large arrays which could then overflow stacks, leading to segfaults without further explanation. Yes. Been there - seen that :-) What's worse, segfaults because of stack overflow very often confuse debuggers, so you can't even get a traceback of where it failed! We could extend -fmax-stack-var-size so it allocates memory from the heap in recursive procedures, too, and set this to some default value. Of course, it would also have to free them afterwards, but we manage that for allocatable arrays already. Yes, but I think it's a horrible idea. See below for a better one. We could warn when large arrays with sizes known at compile time are translated. Yes, but I think that's the wrong criterion. It should be above a certain size, probably aggregate per procedure - and controllable, of course. Who cares about a couple of 3x3 matrices? We could use -fsplit-stack by default. How reliable is that option? Can it be used, for example, with -fopenmp? Is it available on all (relevant) platforms? One drawback would be that this would allow for infinite recursions to go on for much longer. Yes. And I don't think it's the right mechanism, anyway, except for OpenMP. Again, see below. A -fcheck=stack option could be added (implemented similar to -fsplit-stack), to be included in -fcheck=all, which would abort with a sensible error message instead of a segfault. Absolutely. Or simply always check! I haven't studied the actual code generated by gfortran recently, but my experience of performing stack checking is that its cost is negligible. It got a bad name because of the utter incompetence of the way it was usually implemented. There is also a very simple optimisation that often avoids it: Leave a fixed amount of space beyond the check point and omit the check for any leaf procedure that uses no more than that. And, obviously, that can be extended to non-leaf procedures with known stack use, such as most of the intrinsics. There is another option, which I was thinking of experimenting with in my retirement, but probably won't, is a double stack (as in GNU Ada, and the better Algol 68 systems). Small, fixed objects go on the primary stack, as usual, and large or variable-sized ones go on the secondary stack. Allocatable objects should go there if and only if they are not reallocated. My experience (a long time back) was that the improved locality of the primary stack (often needed to control branching) by removing large arrays from it speeded up most calls with such arrays by several tens of percent. Now, there is an interesting interaction with split stacks. The only reason to have a contiguous stack is for fast procedure call for simple procedures. But that doesn't apply to the secondary stack, so it can always be split - hence there is no need for a configuration or run-time option. That doesn't stop it being checked against a maximum size, either, because accumulating, decrementing and checking a count isn't a major overhead for the use the secondary stack gets. Regards, Nick Maclaren.
Re: Dealing with default recursive procedures in Fortran
On Thu, Apr 12, 2018 at 10:50 PM, Thomas König wrote: > Hello world, > > with Fortran 2018, recursive is becoming the default. This will likely > have a serious impact on many user codes, which often declare large > arrays which could then overflow stacks, leading to segfaults without > further explanation. > > What could we do? A few options, not all mutally exclusive. > > We could extend -fmax-stack-var-size so it allocates memory > from the heap in recursive procedures, too, and set this to > some default value. Of course, it would also have to free them > afterwards, but we manage that for allocatable arrays already. > > We could warn when large arrays with sizes known at compile time > are translated. > > We could use -fsplit-stack by default. How reliable is that > option? Can it be used, for example, with -fopenmp? > Is it available on all (relevant) platforms? Not available on AArch64 (yet, though there are some patches) and Arm (no current plans that I know of). Probably works best only on x86_64. I don't think you can rely on that being available everywhere. Additionally depending on the implementation IIRC that will have a dependency on a newer glibc as well,so that would depend on a newer glibc as well for split-stack to work reliably as platforms. Ramana
GCC Compiler Optimization ignores or mistreats MFENCE memory barrier related instruction
Hi, We are trying to create a memory barrier with following testcase. = #include void Test() { float fDivident = 0.1f; float fResult = 0.0f; fResult = ( fDivident / fResult ); __asm volatile ("mfence" ::: "memory"); printf("\nResult: %f\n", fResult); } == 'mfence' performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction. The mfence instruction with memory clobber asm instruction should create a barrier between division and printf instructions. When the testcase is compiled with optimization options O1 and above it can be observed that the mfence instruction is reordered and precedes division instruction. We expected that the two sets of assembly instructions, one pertaining to division operation and another pertaining to the printf operation, would not get mixed up on reordering by the GCC compiler optimizer because of the presence of the __asm volatile ("mfence" ::: "memory"); line between them. But, the generated assembly, which is inlined below for reference, isn't quite right as per our expectation. pushl %ebp# 23*pushsi2[length = 1] movl%esp, %ebp # 24*movsi_internal/1 [length = 2] subl$24, %esp # 25pro_epilogue_adjust_stack_si_add/1 [length = 3] mfence fldz# 20*movxf_internal/3 [length = 2] fdivrs .LC0# 13*fop_xf_4_i387/1[length = 6] You may note that the mfence instruction is generated before the fdivrs instruction. Can you please let us know if the usage of the "asm (mfence)" instruction as given in the above testcase is the right way of creating the expected memory barrier between the two sets of instructions pertaining to the division and printf operations, respectively or not? If yes, then we think, it's a bug in Compiler. Could you please confirm? If no, then what is the correct usage of "asm (mfence)" so as to get/ achieve the memory barrier functionality as expected in the above testcase? Thanks, Vivek Kinhekar
Re: GCC Compiler Optimization ignores or mistreats MFENCE memory barrier related instruction
On Fri, 13 Apr 2018, Vivek Kinhekar wrote: > The mfence instruction with memory clobber asm instruction should create a > barrier between division and printf instructions. No, floating-point division does not touch memory, so the asm does not (and need not) restrict its motion. Alexander
RE: GCC Compiler Optimization ignores or mistreats MFENCE memory barrier related instruction
Thanks for the quick response, Alexander! Regards, Vivek Kinhekar +91-7709046470 -Original Message- From: Alexander Monakov Sent: Friday, April 13, 2018 5:58 PM To: Vivek Kinhekar Cc: gcc@gcc.gnu.org Subject: Re: GCC Compiler Optimization ignores or mistreats MFENCE memory barrier related instruction On Fri, 13 Apr 2018, Vivek Kinhekar wrote: > The mfence instruction with memory clobber asm instruction should > create a barrier between division and printf instructions. No, floating-point division does not touch memory, so the asm does not (and need not) restrict its motion. Alexander
Re: Dealing with default recursive procedures in Fortran
On Fri, Apr 13, 2018 at 12:50 AM, Thomas König wrote: > Hello world, > > with Fortran 2018, recursive is becoming the default. This will likely > have a serious impact on many user codes, which often declare large > arrays which could then overflow stacks, leading to segfaults without > further explanation. > Yes. For reference, we had some previous discussion about this at https://gcc.gnu.org/ml/gcc-patches/2017-12/msg01417.html and https://gcc.gnu.org/ml/fortran/2017-12/msg00082.html . > > What could we do? A few options, not all mutally exclusive. > > We could extend -fmax-stack-var-size so it allocates memory > from the heap in recursive procedures, too, and set this to > some default value. Of course, it would also have to free them > afterwards, but we manage that for allocatable arrays already. > +1. I think this is the pragmatic approach. It ought to work everywhere, and should be implementable with modest effort (I did try to have a go at it, but I'm not that familiar with that part of the frontend, so I didn't really make any progress). > We could warn when large arrays with sizes known at compile time > are translated. > +1. Although this shouldn't perhaps be part of -Wall, as it's not a correctness issue, only a *potential* performance one. > We could use -fsplit-stack by default. How reliable is that > option? Can it be used, for example, with -fopenmp? > Is it available on all (relevant) platforms? > One drawback would be that this would allow for infinite > recursions to go on for much longer. > Seems not all targets support it, like Ramana said. Also, IIRC go (the language, that is) originally used split stacks, but at some point they switched to larger stacks, due to the overhead of checking the stack usage at every procedure call. And Go isn't even a particularly performance-oriented language (compared to C/C++/Fortran). > > Other ideas / options? > Nicks suggestion to have a separate split stack for large and variable-sized arrays sounds good, although I suspect it would run into the same portability issues as -fsplit-stack and then some. So in the short term, I think what ought to be done in rough order of importance: 1) Make -fmax-stack-var-size use the heap instead of static memory. 2) Make -frecursive the default. 3) Warning for array that is large enough to be allocated on the heap. 4) Run-time error for automatic heap allocation. Longer term, if somebody has the energy to deal with the (potential) portability issues, Nicks secondary split stack approach could be good. -- Janne Blomqvist
RE: GCC Compiler Optimization ignores or mistreats MFENCE memory barrier related instruction
Hello Alexander, In the given testcase, the generated fdivrs instruction performs the division of a symbol ref (memory value) by FPU Stack Register and stores the value in FPU Stack Register. Please find the following RTL Dump of the fdivrs instruction generated. It clearly access the memory for read access! === #(insn:TI 13 20 16 2 (set (reg:XF 8 st) #(div:XF (float_extend:XF (mem/u/c:SF (symbol_ref/u:SI ("*.LC0") [flags 0x2]) [4 S4 A32])) #(reg:XF 8 st))) {*fop_xf_4_i387} # (nil)) fdivrs .LC0# 13*fop_xf_4_i387/1[length = 6] === Are we missing anything subtle here? Regards, Vivek Kinhekar -Original Message- From: Alexander Monakov Sent: Friday, April 13, 2018 5:58 PM To: Vivek Kinhekar Cc: gcc@gcc.gnu.org Subject: Re: GCC Compiler Optimization ignores or mistreats MFENCE memory barrier related instruction On Fri, 13 Apr 2018, Vivek Kinhekar wrote: > The mfence instruction with memory clobber asm instruction should > create a barrier between division and printf instructions. No, floating-point division does not touch memory, so the asm does not (and need not) restrict its motion. Alexander
RE: GCC Compiler Optimization ignores or mistreats MFENCE memory barrier related instruction
Oh! Thanks for the quick response, Jakub. Regards, Vivek Kinhekar -Original Message- From: Jakub Jelinek Sent: Friday, April 13, 2018 7:08 PM To: Vivek Kinhekar Cc: Alexander Monakov ; gcc@gcc.gnu.org Subject: Re: GCC Compiler Optimization ignores or mistreats MFENCE memory barrier related instruction On Fri, Apr 13, 2018 at 01:34:21PM +, Vivek Kinhekar wrote: > Hello Alexander, > > In the given testcase, the generated fdivrs instruction performs the > division of a symbol ref (memory value) by FPU Stack Register and > stores the value in FPU Stack Register. The stack registers are not memory. > Please find the following RTL Dump of the fdivrs instruction generated. > It clearly access the memory for read access! That is a constant read, that doesn't count either. It is in memory only because the instruction doesn't support constant immediates, the memory is read-only. Jakub
Fortran array slices and -frepack-arrays
Hi, I looked at a few performance anomalies between gfortran and Flang - it appears array slices are treated differently. Using -frepack-arrays fixed a performance issue in gfortran and didn't cause any regressions. Making input array slices contiguous helps both locality and enables more vectorization. So I wonder whether it should be made the default (-O3 or just -Ofast)? Alternatively would it be feasible in Fortran to version functions or loops if all arguments are contiguous slices? Wilco
Re: GCC Compiler Optimization ignores or mistreats MFENCE memory barrier related instruction
On Fri, Apr 13, 2018 at 01:34:21PM +, Vivek Kinhekar wrote: > Hello Alexander, > > In the given testcase, the generated fdivrs instruction performs the > division of a symbol ref (memory value) by FPU Stack Register and stores > the value in FPU Stack Register. The stack registers are not memory. > Please find the following RTL Dump of the fdivrs instruction generated. > It clearly access the memory for read access! That is a constant read, that doesn't count either. It is in memory only because the instruction doesn't support constant immediates, the memory is read-only. Jakub
Re: Fortran array slices and -frepack-arrays
On Fri, Apr 13, 2018 at 3:32 PM, Wilco Dijkstra wrote: > Hi, > > I looked at a few performance anomalies between gfortran and Flang - it > appears array slices > are treated differently. Using -frepack-arrays fixed a performance issue in > gfortran and didn't > cause any regressions. Making input array slices contiguous helps both > locality and enables > more vectorization. > > So I wonder whether it should be made the default (-O3 or just -Ofast)? > Alternatively would I don't know the implementation of the option, so two questions: 1) When the repack is done during compilation? Is new code manipulating data layout added by frontend? If yes, better to do it during optimization thus is can be on demanding? This looks like one case of data layout transformation. Not sure if there is enough information to do that in optimizer. 2) For now, does this option force array repacking unconditionally? I think it won't be too hard to model when such data layout transformation is beneficial by looking at loop (nest) accessing the array and comparing against the overhead. > it be feasible in Fortran to version functions or loops if all arguments are > contiguous slices? I think a cost model is still needed for function/loop versioning. Thanks, bin > > Wilco
Re: Fortran array slices and -frepack-arrays
Bin.Cheng wrote: > I don't know the implementation of the option, so two questions: > 1) When the repack is done during compilation? Is new code > manipulating data layout added > by frontend? If yes, better to do it during optimization thus is > can be on demanding? This > looks like one case of data layout transformation. Not sure if > there is enough information > to do that in optimizer. Yes it adds a runtime check at function entry and packs array slices which have a non-unity step. Currently it uses a call to _gfortran_internal_pack, however this could be inlined and use alloca rather than malloc for small slices. It might be possible to check which parameters are used a lot (or benefit from vectorization) and only pack those. > 2) For now, does this option force array repacking unconditionally? I > think it won't be too hard > to model when such data layout transformation is beneficial by > looking at loop (nest) accessing > the array and comparing against the overhead. Yes, it ensures all slices are packed, but that isn't strictly necessary. >> it be feasible in Fortran to version functions or loops if all arguments are >> contiguous slices? > I think a cost model is still needed for function/loop versioning. Absolutely. If you staticially know at the call that all slices are contiguous you could compile a version of the function using the contiguous attribute and skip all runtime checks. Such function versioning would require LTO to work well. Wilco