On Fri, Aug 5, 2011 at 7:40 AM, Jan Hubicka <j...@suse.de> wrote: > Am Fri 05 Aug 2011 09:32:05 AM CEST schrieb Richard Guenther > <richard.guent...@gmail.com>: > >> On Thu, Aug 4, 2011 at 8:42 PM, Jan Hubicka <j...@suse.de> wrote: >>>>> >>>>> Did you try using FDO with -Os? FDO should make hot code parts >>>>> optimized similar to -O3 but leave other pieces optimized for size. >>>>> Using FDO with -O3 gives you the opposite, cold portions optimized >>>>> for size while the rest is optimized for speed. >>> >>> FDO with -Os still optimize for size, even in hot parts. >> >> I don't think so. Or at least that would be a bug. Shouldn't 'hot' >> BBs/functions >> be optimized for speed even at -Os? Hm, I see predict.c indeed returns >> always false for optimize_size :( > > It was outcome of discussion held some time ago. I think it was Mark > promoting point that users opitmize for size when they use -Os period. > > I thought we had just the neither cold or hot parts optimized according > to optimize_size. I originally wanted to have attribute HOT to overwrite > -Os, so the well annotateed sources (i.e. kernel) could compile with -Os by > default and explicitely declare the hot parts hot and get them compiled > appropriately. > > With profile feedback however the current logic is binary - i.e. blocks are > either hot since their count is bigger than the threshold or cold. We don't > really have "I don't really know" state there. In some cases it would make > sense - i.e. there are optimizations that we want to do only in the hottest > parts of code, but we don't have any logic for that.
For profile summary at function/cgraph_node level, there are three states: hot, unlikely, and normal. At BB/EDGE level, there are three states too, but implementation turns it into 2 states (by querying only 'maybe_hot_bb'): hot and not hot --- instead of 'hot', 'not hot nor cold', and 'cold'. David > > My plan is to extend ipa-profile to do better hot/cold partitioning first: > at the moment we decide on fixed fraction of maximal count in the program. > This is unnecesarily conservative for programs with not terribly flat > profiles. At IPA level we could collect histogram of counts of instructions > (i.e. figure out how much time we spend on instructions executed N times) > and then figure out where is the threshold so 99% of executed instructions > belongs to hot region. This should give noticeably smaller binaries. >> >> I thought we had just the neither cold or hot parts optimized according >> to optimize_size. > > >> >>> So to get resonale >>> speedups you need -O3+FDO. -O3+FDO effectively defaults to -Os in cold >>> portions of program. >> >> Well, but unless your training coverage is 100% all parts with no coverage >> get optimized with -O3 instead of -Os. And I bet coverage for mozilla >> isn't even close to 100%. Thus I think recommending -O3 for FDO is >> usually a bad idea. > > Code with no coverage is cold in our model (as is code executed once or so) > and thus optimized for -Os even at -O3+FDO. This is bit aggressive on > optimizing for size side. We might consider changing this policy, but so far > I didn't see any complains on this... > > Honza >> >> So - did you try FDO with -O2? ;) >> >>> Still -Os+FDO should be somewhat faster than -Os alone, so a slowdown is >>> bug. It is not very thoroughly since it is not really used in practice. >>> >>>>> Also do you get any warnings on profile mismatches? Perhaps something >>>>> is wrong to the degree that the relevant part of profile gets >>>>> misapplied. >>>> >>>> I don't get any warning on profile mismatches. I only get a "few" >>>> missing gcda files warning, but that's expected. >>> >>> Perhaps you could compile one of less trivial files you are sure that are >>> covered by train run and send me -fdump-tree-all-blocks -fdump-ipa-all >>> dumps >>> of the compilation so I can double check the profile seems sane. This >>> could >>> be good start to rule out something stupid. >>> >>> Honza >>>> >>>> Cheers, >>>> >>>> Mike >>>> >>> >>> >>> >> > > >