Re: Worse code after bbro?
On Wed, 21 Dec 2016, Senthil Kumar Selvaraj wrote: > Hi, > > For this C code (slightly modified from PR 30908) > > void wait(int i) > { > while (i-- > 0) > asm volatile("nop" ::: "memory"); > } > > gcc 4.8 at -Os produces > > jmp .L2 > .L3: > nop > decl%edi > .L2: > testl %edi, %edi > jg .L3 > ret > > whereas gcc trunk (and 4.9 onwards, from a quick check) produces > > .L2: > testl %edi, %edi > jle .L5 > nop > decl%edi > jmp .L2 > .L5: > ret > > The code size is identical, but the trunk version executes one more > instruction everytime the loop runs (explicit jump to .L5 with trunk vs > fallthrough with 4.8) - it's faster only if the loop never runs. This > happens irrespective of the memory clobber inline assembler statement. > > Digging into the dump files, I found that the transformation occurs in > the bb reorder pass, when it calls cfg_layout_initialize, which > eventually calls try_redirect_by_replacing_jump with in_cfglayout set to > true. That function then removes the jump and causes the RTL > transformation that eventually results in slower code. > > Is this intentional? If not, what would be the best way to fix this? I belive that doing BB reorder in CFG layout mode is fundamentally flawed but I guess it's wired up so that out-of-CFG layout honors EDGE_FALLTHRU. In any way, why does BB reorder not "fix" the "bogus" reorder into-CFG-layout performs? Richard. > Regards > Senthil > > RTL before and after bbro. > > Before: > > (jump_insn 24 6 25 2 (set (pc) > (label_ref 15)) "pr30908.c":3 678 {jump} > (nil) > -> 15) > (barrier 25 24 17) > (code_label 17 25 12 3 3 "" [1 uses]) > (note 12 17 13 3 [bb 3] NOTE_INSN_BASIC_BLOCK) > (insn 13 12 14 3 (parallel [ > (asm_operands/v ("nop") ("") 0 [] > [] > [] pr30908.c:4) > (clobber (mem:BLK (scratch) [0 A8])) > (clobber (reg:CCFP 18 fpsr)) > (clobber (reg:CC 17 flags)) > ]) "pr30908.c":4 -1 > (expr_list:REG_UNUSED (reg:CCFP 18 fpsr) > (expr_list:REG_UNUSED (reg:CC 17 flags) > (nil > (insn 14 13 15 3 (parallel [ > (set (reg:SI 5 di [orig:90 ivtmp.9 ] [90]) > (plus:SI (reg:SI 5 di [orig:90 ivtmp.9 ] [90]) > (const_int -1 [0x]))) > (clobber (reg:CC 17 flags)) > ]) 210 {*addsi_1} > (expr_list:REG_UNUSED (reg:CC 17 flags) > (nil))) > (code_label 15 14 16 4 2 "" [1 uses]) > (note 16 15 18 4 [bb 4] NOTE_INSN_BASIC_BLOCK) > (insn 18 16 19 4 (set (reg:CCNO 17 flags) > (compare:CCNO (reg:SI 5 di [orig:90 ivtmp.9 ] [90]) > (const_int 0 [0]))) "pr30908.c":3 3 {*cmpsi_ccno_1} > (nil)) > (jump_insn 19 18 30 4 (set (pc) > (if_then_else (gt (reg:CCNO 17 flags) > (const_int 0 [0])) > (label_ref 17) > (pc))) "pr30908.c":3 646 {*jcc_1} > (expr_list:REG_DEAD (reg:CCNO 17 flags) > (int_list:REG_BR_PROB 8500 (nil))) > -> 17) > (note 30 19 28 5 [bb 5] NOTE_INSN_BASIC_BLOCK) > (note 28 30 29 5 NOTE_INSN_EPILOGUE_BEG) > (jump_insn 29 28 31 5 (simple_return) "pr30908.c":5 708 > {simple_return_internal} > (nil) > -> simple_return) > > After: > > > (code_label 15 6 16 3 2 "" [1 uses]) > (note 16 15 18 3 [bb 3] NOTE_INSN_BASIC_BLOCK) > (insn 18 16 19 3 (set (reg:CCNO 17 flags) > (compare:CCNO (reg:SI 5 di [orig:90 ivtmp.9 ] [90]) > (const_int 0 [0]))) "pr30908.c":3 3 {*cmpsi_ccno_1} > (nil)) > (jump_insn 19 18 12 3 (set (pc) > (if_then_else (le (reg:CCNO 17 flags) > (const_int 0 [0])) > (label_ref:DI 34) > (pc))) "pr30908.c":3 646 {*jcc_1} > (expr_list:REG_DEAD (reg:CCNO 17 flags) > (int_list:REG_BR_PROB 1500 (nil))) > -> 34) > (note 12 19 13 4 [bb 4] NOTE_INSN_BASIC_BLOCK) > (insn 13 12 14 4 (parallel [ > (asm_operands/v ("nop") ("") 0 [] > [] > [] pr30908.c:4) > (clobber (mem:BLK (scratch) [0 A8])) > (clobber (reg:CCFP 18 fpsr)) > (clobber (reg:CC 17 flags)) > ]) "pr30908.c":4 -1 > (expr_list:REG_UNUSED (reg:CCFP 18 fpsr) > (expr_list:REG_UNUSED (reg:CC 17 flags) > (nil > (insn 14 13 35 4 (parallel [ > (set (reg:SI 5 di [orig:90 ivtmp.9 ] [90]) > (plus:SI (reg:SI 5 di [orig:90 ivtmp.9 ] [90]) > (const_int -1 [0x]))) > (clobber (reg:CC 17 flags)) > ]) 210 {*addsi_1} > (expr_list:REG_UNUSED (reg:CC 17 flags) > (nil))) > (jump_insn 35 14 36 4 (set (pc) > (label_ref 15)) -1 > (nil) > -> 15) > (barrier 36 35 34) > (code_label 34 36 30 5 5 "" [1 uses]) > (note 30 34 28 5 [bb 5] NOTE_INSN_BASIC_
Re: LTO remapping/deduction of machine modes of types/decls
On Mon, 2 Jan 2017, Jakub Jelinek wrote: > On Fri, Dec 30, 2016 at 08:40:11PM +0300, Alexander Monakov wrote: > > Hello, Richard, Jakub, community, > > > > May I join/restart the old discussion about machine mode remapping at LTO > > stream-in time. To recap, when offloading to NVPTX was introduced, there > > was a problem due to differences in the set of supported modes (e.g. there > > was no 'XFmode' on NVPTX that would correspond to 'long double' tree type > > node in GIMPLE LTO streams produced by x86 host compiler). > > > > The current solution in GCC is to additionally stream a 'mode table' and > > use it > > to remap numeric mode identifiers during LTO stream-in in all trees that > > have > > modes. This is the solution initially outlined by Jakub in the message > > https://gcc.gnu.org/ml/gcc-patches/2015-02/msg00226.html . In response to > > that, > > Richard said, My suggestion at that time isn't likely working in practice due to the limitations Jakub outlines below. The situation is a bit unfortunate but expect to run into more host(!) dependences in the LTO bytecode. Yes, while it would be nice to LTO x86_64->arm and ppc64le->arm LTO bytecode it very likely isn't going to work. > In my view mode is essential part of the type system. It (sadly, but still) > participates in many ABI decisions, but more importantly especially for > floating point types it is the main source of information of what the type > actually is, as just size and precision are nowhere near enough. > The precision/size isn't able to carry information like whether the type is > decimal or binary floating, what padding it has and where, what NaN etc. > conventions it uses. So trying to throw away modes and reconstruct them > looks conceptually wrong to me. One can also just use > float __attribute__((mode (XFmode))) or float __attribute__((mode (TFmode))) > or float __attribute__((mode (KFmode))) or IFmode etc., how do you want to > differentiate between those? And I don't see how this can help with the > long double stuff for NVPTX offloading. If user uses 80-bit long double > (or mode(XFmode) floats/doubles) in his source, then as PTX only has SFmode > and DFmode (perhaps also HFmode?), the only way to get it working is through > emulation (whether soft-fp, or writing some emulation using double, > whatever). Pretending long double on the host is DFmode on the PTX side > just won't work, they have different representation. > > Jakub > > -- Richard Biener SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
Re: LTO remapping/deduction of machine modes of types/decls
On Mon, 2 Jan 2017, Jakub Jelinek wrote: > On Mon, Jan 02, 2017 at 09:49:55PM +0300, Alexander Monakov wrote: > > On Mon, 2 Jan 2017, Jakub Jelinek wrote: > > > If the host has long double the same as double, sure, PTX can use its > > > native > > > DFmode even for long double. But otherwise, the storage must be > > > transferable between accelerator and host. > > > > Hm, sorry, the 'must' is not obvious to me: is it known that the OpenMP ARB > > would find only this implementation behavior acceptable? > > long double is not non-mappable type in the spec, so it is supposed to work. > The implementation may choose not to offload whenever it sees long > double/__float128/_Float128/_Float128x etc. > > > Apart from floating-point types, are there other situations where modes > > carry > > information not deducible from the rest of the tree node? > > Dunno about fixed types, partial ints etc., but it is mostly floating point > types, sure. Mostly floats I guess. But just to say it would be very nice to have enough information in the trees so layout_type can re-construct the mode. It already does for 99% of the types... (just grep for SET_TYPE_MODE). Richard.
Re: Worse code after bbro?
On Wed, Jan 04, 2017 at 10:05:49AM +0100, Richard Biener wrote: > > The code size is identical, but the trunk version executes one more > > instruction everytime the loop runs (explicit jump to .L5 with trunk vs > > fallthrough with 4.8) - it's faster only if the loop never runs. This > > happens irrespective of the memory clobber inline assembler statement. With -Os you've asked for smaller code, not faster code. All of the block reordering is based on heuristics -- there is no polynomial time and space algorithm to do it optimally, let alone the linear time and space we need in GCC -- so there always will be cases we do not handle optimally. -Os does not get as much attention as -O2 etc., as well. OTOH this seems to be a pretty common case that we could handle. Please open a PR to keep track of this? > I belive that doing BB reorder in CFG layout mode is fundamentally > flawed but I guess it's wired up so that out-of-CFG layout honors > EDGE_FALLTHRU. Why is this fundamentally flawed? The reordering is much easier this way. > In any way, why does BB reorder not "fix" the "bogus" reorder > into-CFG-layout performs? I'm not sure what bogus reorder you're talking about here? cfg_layout_initialize should not reorder anything (other than the usual things cleanup_cfg does)? Segher
Re: GCC libatomic ABI specification draft
On 22/12/16 17:37, Segher Boessenkool wrote: > We do not always have all atomic instructions. Not all processors have > all, and it depends on the compiler flags used which are used. How would > libatomic know what compiler flags are used to compile the program it is > linked to? > > Sounds like a job for multilibs? x86_64 uses ifunc dispatch to always use atomic instructions if available (which is bad because ifunc is not supported on all platforms). either such runtime feature detection and dispatch is needed in libatomic or different abis have to be supported (with the usual hassle).
Re: Worse code after bbro?
On 01/04/2017 03:46 AM, Segher Boessenkool wrote: On Wed, Jan 04, 2017 at 10:05:49AM +0100, Richard Biener wrote: The code size is identical, but the trunk version executes one more instruction everytime the loop runs (explicit jump to .L5 with trunk vs fallthrough with 4.8) - it's faster only if the loop never runs. This happens irrespective of the memory clobber inline assembler statement. With -Os you've asked for smaller code, not faster code. All of the block reordering is based on heuristics -- there is no polynomial time and space algorithm to do it optimally, let alone the linear time and space we need in GCC -- so there always will be cases we do not handle optimally. -Os does not get as much attention as -O2 etc., as well. OTOH this seems to be a pretty common case that we could handle. Please open a PR to keep track of this? I superficially looked at this a little while ago and concluded that it's something we ought to be able to handle. However, it wasn't critical enough to me to get familiar enough with the bbro code to deeply analyze -- thus I put it into my gcc-8 queue. I belive that doing BB reorder in CFG layout mode is fundamentally flawed but I guess it's wired up so that out-of-CFG layout honors EDGE_FALLTHRU. Why is this fundamentally flawed? The reordering is much easier this way. Agreed (that we ought to be doing reordering in CFG layout mode). Jeff