Aditya, the hot/cold partitioning is currently organized as folows: 1) we have static branch prediction code in predict.c and profile feedback which we store into cfg and callgraph. 2) we have predicates
optimize_*_for_speed_p/size_p where * can be function, basic block, cfg edge, callgraph edge, loop or loop nest 3) all optimization passes trading code size for speed should the corresponding predicaes about whether to do the transform. 4) ipa-split pass is mostly there to enable partial inlining 5) hot-cold partition in bb-reorder offlines stores 6) we have ipa-icf pass to identify functions and some code in FRE to identify code within single function. 7) we do shrink wrapping to mitigate register pressure problems caused in cold regions of code I think this is bit stronger to what llvm does currently which relies on outlining SESE regions earlier rather than going through the pain of implementing support for partitioning. Building clang10 with GCC and FDO leads to 37MB .text section Building clang10 with GCC and LTO+FDO leads to 34MB .text section Building clang10 with clang10 and FDO leads to 53MB .text section Building clang10 with clang10 and thinlto+FDO leads to 67MB .text section GCC built clang is about 2-3% faster building Firefox. There are many things which I think could/should imporve in our framework. 1) our size optimization is very agressive (llvms -Oz) and enaling it based on heuristics may lead to huge regressions. We probably want to have optimize_for_size_p predicate to have two levels and do less agressive mode in place we are not sure the code is really very cold. 2) ipa-split is very simplistic and only splits when there is no value computed in header of function used in the tail. We should support adding extra parameters for values computed and do more general SESE outlining Note that we do SESE outlining for openMP but this code is not interfaced very generically to be easilly used by ipa-split. Original implementation of ipa-split was kind of "first cut" trying to clean up interfaces to rest of the compiler and implement more fancy features later. This never happened so there is certainly space for imrovements here. We also do all splitting before actual IPA optimization while it may be more reasonable to identify potential split points and make IPA optimization to decide on transforms (currently we rely on inliner to inline back useless splits). 3) function partitioning is enabled only for x86. I never had time to get it working on other targets and no-one picked up this task 4) ipa-icf and in-function code merging is currently very conservative (I plan to work on this next stage1) comparing metadata like type based aliasing info. 5) we have only very limited logic to detect cold regions without profile feedback and thus amount of offlined code is very small (this also is because of 1). We basically know that code leading to abort/exception handling etc is cold and consider everything else hot. 6) We lack code placement pass (though Martin has WIP implementation of it) 7) We do no partitioning of data segment which may be also interesting to do. 8) Most of non-x86 backends do not implement very well the hot/cold code models and instruction choice. 9) Shrink-wrapping and register allocation is not always able to move spilling to code paths but this is generally very hard problem to track. So there are a lot of place for improvmeent (and I am sure more can be found) and I would be happy to help you with them. Honza > > Hi Martin, > Thank you for explaining the status quo. After reading the code of > bb-reorder.c, > it looks pretty good and seems it doesn't need any significant improvements. > In that case, the only value GIMPLE level hot/cold splitting could bring is > to enable aggressive code-size optimization > by merging of similar/identical functions: after outlining cold regions, they > may be suitable candidates for function merging. > ipa-split might be enabling some of that, having a region based function > splitting could improve ipa-split. > > -Aditya > > > -- > From: Martin Liška <mli...@suse.cz> > Sent: Tuesday, March 3, 2020 2:47 AM > To: Aditya K <hiradi...@msn.com>; gcc@gcc.gnu.org <gcc@gcc.gnu.org> > Cc: Jan Hubicka <hubi...@ucw.cz> > Subject: Re: GSoC topic: Implement hot cold splitting at GIMPLE IR level > > Hello. > Thank you for idea. I would like to provide some comments about what GCC can > currently > do and I'm curious we need something extra on top of what we do. > Right now, GCC can do hot/cold partitioning based on functions and basic > blocks. With > a PGO profile, the optimization is quite aggressive and can save quite some > code > being placed into a cold partitioning and being optimized for size. Without a > profile, > we do a static profile guess (predict.c), where we also propagate information > about cold > blocks (determine_unlikely_bbs). Later in RTL, we utilize the information and > make > the real reordering (bb-reorder.c). > > Martin > > >