Re: Loop fusion.
On Sun, Apr 22, 2018 at 3:27 PM, Toon Moene wrote: > A few days ago there was a rant on the Fortran Standardization Committee's > e-mail list about Fortran's "whole array arithmetic" being unoptimizable. > > An example picked at random from our weather forecasting code: > > ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP) > ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP) > ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP) > ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP) > > The reaction from one of the members of the committee (about "their" > compiler): > > 'And multiple consecutive array statements with the same shape are “fused” > exactly so that the compiler can generate good cache use. This sort of > optimization is pretty low hanging fruit.' > > As far as I can see loop fusion as a stand-alone optimization is not > supported as yet, although some mention is made in the context of graphite. > > Is this something that should be pursued ? Hi, I don't know the current status of fusion in graphite. As for traditional fusion transformation, I think it's not very difficult to be implemented along with existing distribution, actually, quite lot of code should be shared. What we do need are something like: more motivation cases, good/conservative cost model. Thanks, bin > > Kind regards, > > -- > Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290 > Saturnushof 14, 3738 XG Maartensdijk, The Netherlands > At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ > Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news
Re: Loop fusion.
On Sun, Apr 22, 2018 at 4:27 PM, Toon Moene wrote: > A few days ago there was a rant on the Fortran Standardization Committee's > e-mail list about Fortran's "whole array arithmetic" being unoptimizable. > > An example picked at random from our weather forecasting code: > > ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP) > ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP) > ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP) > ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP) > > The reaction from one of the members of the committee (about "their" > compiler): > > 'And multiple consecutive array statements with the same shape are “fused” > exactly so that the compiler can generate good cache use. This sort of > optimization is pretty low hanging fruit.' > > As far as I can see loop fusion as a stand-alone optimization is not > supported as yet, although some mention is made in the context of graphite. > > Is this something that should be pursued ? In principle GRAPHITE can handle loop fusion but yes, standalone fusion is sth useful. Note that while it looks "obvious" in the above source fragment the IL that is presented to optimizers may make things a lot less "low-hanging". Richard. > Kind regards, > > -- > Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290 > Saturnushof 14, 3738 XG Maartensdijk, The Netherlands > At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ > Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news
Re: Loop fusion.
On Mon, Apr 23, 2018 at 12:59 PM, Bin.Cheng wrote: > On Sun, Apr 22, 2018 at 3:27 PM, Toon Moene wrote: >> A few days ago there was a rant on the Fortran Standardization Committee's >> e-mail list about Fortran's "whole array arithmetic" being unoptimizable. >> >> An example picked at random from our weather forecasting code: >> >> ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP) >> ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP) >> ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP) >> ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP) >> >> The reaction from one of the members of the committee (about "their" >> compiler): >> >> 'And multiple consecutive array statements with the same shape are “fused” >> exactly so that the compiler can generate good cache use. This sort of >> optimization is pretty low hanging fruit.' >> >> As far as I can see loop fusion as a stand-alone optimization is not >> supported as yet, although some mention is made in the context of graphite. >> >> Is this something that should be pursued ? > Hi, > I don't know the current status of fusion in graphite. As for > traditional fusion transformation, I think it's not very difficult to > be implemented along with existing distribution, actually, quite lot > of code should be shared. What we do need are something like: more > motivation cases, good/conservative cost model. Yes, I guess before distribution you want to do maximum fusion and then apply (re-)distribution on the fused loop. The cost model should be the very same for distribution/fusion. Richard. > Thanks, > bin >> >> Kind regards, >> >> -- >> Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290 >> Saturnushof 14, 3738 XG Maartensdijk, The Netherlands >> At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ >> Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news
Re: Loop fusion.
On Mon, Apr 23, 2018 at 2:02 PM, Richard Biener wrote: > On Mon, Apr 23, 2018 at 12:59 PM, Bin.Cheng wrote: > > On Sun, Apr 22, 2018 at 3:27 PM, Toon Moene wrote: > >> A few days ago there was a rant on the Fortran Standardization > Committee's > >> e-mail list about Fortran's "whole array arithmetic" being > unoptimizable. > >> > >> An example picked at random from our weather forecasting code: > >> > >> ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP) > >> ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP) > >> ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP) > >> ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP) > >> > >> The reaction from one of the members of the committee (about "their" > >> compiler): > >> > >> 'And multiple consecutive array statements with the same shape are > “fused” > >> exactly so that the compiler can generate good cache use. This sort of > >> optimization is pretty low hanging fruit.' > >> > >> As far as I can see loop fusion as a stand-alone optimization is not > >> supported as yet, although some mention is made in the context of > graphite. > >> > >> Is this something that should be pursued ? > > Hi, > > I don't know the current status of fusion in graphite. As for > > traditional fusion transformation, I think it's not very difficult to > > be implemented along with existing distribution, actually, quite lot > > of code should be shared. What we do need are something like: more > > motivation cases, good/conservative cost model. > > Yes, I guess before distribution you want to do maximum fusion and then > apply (re-)distribution on the fused loop. The cost model should be the > very same for distribution/fusion. > > Richard. > I recall Fujitsu bragging that the key to them getting good application performance (read: outside linpack) on the K computer is extensive use of loop FISSION + software pipelining. Though I guess sw-pipelining is only useful if you have lots of architectural registers, which disqualifies x86-64.. -- Janne Blomqvist
Re: Loop fusion.
On Mon, Apr 23, 2018 at 2:31 PM, Janne Blomqvist wrote: > On Mon, Apr 23, 2018 at 2:02 PM, Richard Biener > wrote: >> >> On Mon, Apr 23, 2018 at 12:59 PM, Bin.Cheng wrote: >> > On Sun, Apr 22, 2018 at 3:27 PM, Toon Moene wrote: >> >> A few days ago there was a rant on the Fortran Standardization >> >> Committee's >> >> e-mail list about Fortran's "whole array arithmetic" being >> >> unoptimizable. >> >> >> >> An example picked at random from our weather forecasting code: >> >> >> >> ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP) >> >> ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP) >> >> ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP) >> >> ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP) >> >> >> >> The reaction from one of the members of the committee (about "their" >> >> compiler): >> >> >> >> 'And multiple consecutive array statements with the same shape are >> >> “fused” >> >> exactly so that the compiler can generate good cache use. This sort of >> >> optimization is pretty low hanging fruit.' >> >> >> >> As far as I can see loop fusion as a stand-alone optimization is not >> >> supported as yet, although some mention is made in the context of >> >> graphite. >> >> >> >> Is this something that should be pursued ? >> > Hi, >> > I don't know the current status of fusion in graphite. As for >> > traditional fusion transformation, I think it's not very difficult to >> > be implemented along with existing distribution, actually, quite lot >> > of code should be shared. What we do need are something like: more >> > motivation cases, good/conservative cost model. >> >> Yes, I guess before distribution you want to do maximum fusion and then >> apply (re-)distribution on the fused loop. The cost model should be the >> very same for distribution/fusion. >> >> Richard. > > > > I recall Fujitsu bragging that the key to them getting good application > performance (read: outside linpack) on the K computer is extensive use of > loop FISSION + software pipelining. Though I guess sw-pipelining is only > useful if you have lots of architectural registers, which disqualifies > x86-64.. FISSION we can do quite well (though we lack a cost model here), that's what loop distribution does. Richard. > > -- > Janne Blomqvist
bug ? : -Wpedantic -Wconversion 'short a=1; a-=1;' complaint
I really do not think a '-Wpedantic -Wconversion' warning should be generated for the following code, but it is (with GCC 6.4.1 and 7.3.1 on RHEL-7.5 Linux) : $ echo ' typedef unsigned short U16_t; static void f(void) { U16_t a = 1; a-=1; }' > t.C; $ g++ -std=c++14 -Wall -Wextra -pedantic -Wc++11-compat \ -Wconversion -Wcast-align -Wcast-qual -Wfloat-equal \ -Wmissing-declarations -Wlogical-op -Wpacked -Wundef \ -Wuninitialized -Wvariadic-macros -Wwrite-strings \ -c t.C -o /dev/null t.C:4:8: conversion to 'U16_t' {aka short unsigned int} from 'int' may \ alter its value. I don't control the warning flags, as shown above, that my code must compile against without warnings. But I think the warning issued above is a GCC bug - when I look at the code generated, (compile with -S -o t.S), I see it actually does generate a 16-bit subtraction, which is what I wanted: .file "t.C" .text .globl _Z1fv .type _Z1fv, @function _Z1fv: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq%rsp, %rbp .cfi_def_cfa_register 6 movw$1, -2(%rbp) subw$1, -2(%rbp) # this is a two-byte word subtract nop popq%rbp .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE0: .size _Z1fv, .-_Z1fv .ident "GCC: (GNU) 6.4.1 20180321" .section.note.GNU-stack,"",@progbits So why is it generating a warning about conversion from 'int' ? I don't see any conversion going on in assembler output - (cast 'a' to a temporary int, do 32-bit integer subtraction of '1' from it, and store sign-extended low 16 bits in 'a' - this is NOT what is going on here ) . I'd like to remove either '-pedantic' or '-Wconversion' from the warning flags, but this is not an option . Please can GCC fix this warning bug eventually - I have to wade through code that generates thousands of them per compilation. Thanks & Best Regards, Jason
Re: bug ? : -Wpedantic -Wconversion 'short a=1; a-=1;' complaint
On 23 April 2018 at 15:11, Jason Vas Dias wrote: > Please can GCC fix this warning bug eventually - I have to wade > through code that generates thousands of them per compilation. gcc@gcc.gnu.org is for discussing development of GCC, not bugs. gcc-b...@gcc.gnu.org is for automated emails generated from Bugzilla. Cross-posting to those lists is never appropriate. If you want to report a bug then please use Bugzilla, as per https://gcc.gnu.org/bugs/ If you want to ask questions about using GCC use the gcc-h...@gcc.gnu.org list. Bug reports to gcc@gcc.gnu.org don't get filed in Bugzilla and so don't get fixed.
Re: style of code examples in changes.html
On Mon, 2018-04-16 at 20:34 -0600, Martin Sebor wrote: > Hi David & Gerald, (sorry for the late response; I was offline on vacation last week) > I noticed that the coding examples in the updates I committed > to changes.html use a different formatting style than David's. > I just copied mine from GCC 7 changes.html, and those I copied > from David's for that version :) There are at least two kinds of example in the website: (a) source code examples, and (b) "screenshots" of gcc output, which can themselves contain code output as part of a diagnostic. I got sick of hand-converting (b) to our HTML tags, so I wrote a script to do it, which I used for my gcc-8/changes.html. The script is in the website's CVS repository as: bin/gcc-color-to-html.py and can be run like this: LANG=C \ gcc $@ \ -fdiagnostics-color=always 2>&1 \ | ./bin/gcc-color-to-html.py See https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00186.html I also added a around the output, though this isn't done by the above script. I actually had a fair bit more scripting than this, based on the scripting I did for my blogpost here: https://github.com/davidmalcolm/gcc-8-blogpost/blob/master/blog.html.in where lines like: INVOKE_GCC unclosed.c in a foo.html.in get turned into a "screenshot" of the pertinent gcc invocation in the foo.html. But given that we don't want to require running gcc itself to build the website (and indeed, specific gcc versions), I just used this to generate the patch. > Should we make an effort to > make them all look the same? Naturally, for (b), I favor the new style I used :) (using the black background, which may be enough to get the same look). I'm not sure if we want to use it for (a). > FWIW, I didn't notice the difference until my changes published. > I'm guessing that's because the style sheet the page uses isn't > referenced from the original document and the reference is only > added by Gerald's script. Is there a simple way to set things > up so we can see our changes as they will appear when published? I've been adding these lines to the of the page: while testing the content. Hope this is helpful Dave
Re: Loop fusion.
On 04/23/2018 01:00 PM, Richard Biener wrote: On Sun, Apr 22, 2018 at 4:27 PM, Toon Moene wrote: A few days ago there was a rant on the Fortran Standardization Committee's e-mail list about Fortran's "whole array arithmetic" being unoptimizable. An example picked at random from our weather forecasting code: ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP) ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP) ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP) ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP) The reaction from one of the members of the committee (about "their" compiler): 'And multiple consecutive array statements with the same shape are “fused” exactly so that the compiler can generate good cache use. This sort of optimization is pretty low hanging fruit.' As far as I can see loop fusion as a stand-alone optimization is not supported as yet, although some mention is made in the context of graphite. Is this something that should be pursued ? In principle GRAPHITE can handle loop fusion but yes, standalone fusion is sth useful. Note that while it looks "obvious" in the above source fragment the IL that is presented to optimizers may make things a lot less "low-hanging". Well, the loops are generated by the front end, so I *assume* they are basically the same ... Probably the largest problem to address is the heuristic for preventing register pressure going through the roof ... -- Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news
r9 on ARM32?
I'm wondering what is the role of r9 on ARM32, on Linux and Android. On Apple it is documented as long ago reserved, these days available for scratch. I've looked around a bit but haven't gotten the full answer. It is "the PIC register", I see. What does that imply? Volatile? Von-volatile? In particular I'm looking for a spare register, to pass an extra "special" parameter in, that can be considered volatile and never otherwise has a parameter. Most ABIs have a few candidates, but arm32 comes up relatively short. Intra procedural scratch (r12) probably cannot work for me. I know gcc uses it for nested function context and that is laudable. I wish I could guarantee no code between me setting it and it being consumed. And if it is volatile, I'd want the dynamic linker stubs to still preserve it incoming. Thank you, - Jay
-g and -fvar-tracking
Can somebody remind me why using -g doesn't also enable -fvar-tracking by default? At least for -g2, which is supposed to emit debug information about local variables? It seems kind of counterintuitive to me that specifying a -O option enables a pass to collect better debug information but specifying -g to request debuggable code doesn't. :-S -Sandra