On Wed, Apr 24, 2013 at 6:37 AM, Teresa Johnson <tejohn...@google.com> wrote: > On Mon, Apr 22, 2013 at 11:16 AM, Jan Hubicka <hubi...@ucw.cz> wrote: >> Hi, >> sorry for getting back to this late. >>> >> That's a larger error than I had expected from the merging, although >>> >> as you note it is an approximation so there is going to be some amount >>> >> of error. If the error is that large then maybe there is a better >>> >> merging algorithm, since in the non-LTO case we won't be able to >>> >> recompute it exactly. For cc1, what was your test case - >>> >> profiledbootstrap or something simpler? I can try to reproduce this >>> >> and see if it is another bug or just due to the approximation. >>> >>> I've been using Rong's tool to compute the exactly merged histogram >>> from the gcov-merged histogram for perlbmk. I tried a couple test >>> cases - with the 5 train inputs, and with the 3 ref inputs. In both >>> cases I am seeing up to 200% or so difference in some of the working >>> set min counter values, although the error is not as high for the >>> higher working set percentages. But large enough to cause a potential >>> performance issue nonetheless. >> >> One thing that confuse me is why the error tends to be in positive direction. >> Since we are minimizing the counter during merging, I would expect us to >> more or less consistently underestimate the counters. Do you have any >> intuition here? > > Hi Honza, > > Yes, I think I know what is going on. What to do about it is a > different story. =) > > From comparing the histograms merged by libgcov to the exactly merged > histograms produced by Rong's tool at larger histogram sizes, I think > the main issue is that the hotspots are slightly different between > different runs, and this causes inaccuracies in the merged histogram > unless you have global counter information. > > I tried increasing the histogram up to a much larger size - 16128. > This is 256 linear subranges per bit instead of the default 4. This > increased the accuracy at the high end of the profile range, but I was > still left with a fair amount of inaccuracy in the working set (except > the 99.9% bucket, which is very close now). > > I confirmed that some of the hottest counters between different > training inputs (of perlbmk) are from different functions. The hottest > 2 counters are the same between all 3 runs, but after that there are > some differences. When the hotspots are different, those large counter > values actually correspond to smaller counter values in the other run. > Therefore, at the high end of the counter range, the gcov-merged > counter values are artificially high, because we are summing large > counter values that shouldn't be correlated. Similarly, some of the > smaller counter values are being summed together that shouldn't be, > producing artificially low merged counters. For example, in the > libgcov-merged profile, the number of 0 valued counters will be the > min of the number of 0-valued counters in the individual runs. But in > the exactly-merged profile, there are fewer 0-valued counters, because > some of the 0 counters from individual runs were non-zero in other > runs. > > I confirmed by comparing graphs of the counter values produced by each > merge that there are more very high and very low counter values in the > libgcov-merged profile. Graphing the counter values accumulated from > highest to lowest (similar to what the working set computation does), > shows that the accumulated sum grows faster in the libgcov-merged case > as a result, and this is causing the various working set cutoff values > to be hit earlier, resulting in higher min counter values. > > I haven't come up with any great ideas on how to handle this issue > though. Possibly artificially reduce the min counter value for the > working sets when there were more than one run? > >> >> Also if you have setup with your tool, it may be nice to double check that >> the histograms produced by the LTO pass actually match the histogram produced >> by the Ron's external tool. I am not sure if his tool takes into account the >> estimated execution times of basic blocks. If not it may be interesting >> experiment by itself, since we will get how well counting counts alone >> estimate >> the times. (I would expect it to be rather good, but it is always better >> to sanity check). > > Rong sent his tool for review for the gcc google 4_7 branch for now, > but it is usable for trunk too. See: > http://gcc.gnu.org/ml/gcc-patches/2012-11/msg02141.html
Woops, wrong link. Correct one is: http://gcc.gnu.org/ml/gcc-patches/2013-04/msg00607.html Teresa > > It doesn't take any estimated times into account - it just merges the > counters and recomputes the histogram based on all the counters. We > could take an exactly merged profile produced by his tool and feed it > into LTO with your patch and get the time estimate comparison your > patch dumps out. But I would expect it to be the same since like LTO > it has the same global set of counter input? > > Teresa > >>> >>> It looks like the histogram min counters are always off in the larger >>> direction with the gcov merged histogram, so one possibility is to >>> reduce the min counter value by some divisor when the number of runs >>> is >1. Unfortunately, at least in the perlbmk case, the magnitude of >>> the error doesn't seem to correlate obviously with the # runs (in the >>> 5 run case the error is actually a little less than in the 3 run case >>> for perlbench). Honza, how many runs were merged in your 10x error >>> case above? >> >> It was the standard profiledbootstrap. I am not sure how many runs exactly >> are merged, but it will be couple hundred. There is also an issue with fact >> that libbackend is linked into more than one binary. >>> >>> One thing I noticed in the perlbmk case was that there were a number >>> of large counter values sharing histogram buckets at the high end of >>> the histogram in some of the individual run profiles, so there would >>> be a large loss of precision when merging, since the range of counter >>> values sharing a single histogram is larger at the high end of the >>> histogram. I'll experiment with increasing the size of the histogram >>> to see how much that would reduce the error. >> >> Thanks! I guess increasing histogram to size matching approximately the >> number of counters in the binary should more or less eliminate the precision >> errors. >> >> Honza > > > > -- > Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413 -- Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413