Re: Compute precise counter histogram at LTO

Teresa Johnson Wed, 24 Apr 2013 07:53:54 -0700

On Wed, Apr 24, 2013 at 6:37 AM, Teresa Johnson <tejohn...@google.com> wrote:
> On Mon, Apr 22, 2013 at 11:16 AM, Jan Hubicka <hubi...@ucw.cz> wrote:
>> Hi,
>> sorry for getting back to this late.
>>> >> That's a larger error than I had expected from the merging, although
>>> >> as you note it is an approximation so there is going to be some amount
>>> >> of error. If the error is that large then maybe there is a better
>>> >> merging algorithm, since in the non-LTO case we won't be able to
>>> >> recompute it exactly. For cc1, what was your test case -
>>> >> profiledbootstrap or something simpler? I can try to reproduce this
>>> >> and see if it is another bug or just due to the approximation.
>>>
>>> I've been using Rong's tool to compute the exactly merged histogram
>>> from the gcov-merged histogram for perlbmk. I tried a couple test
>>> cases - with the 5 train inputs, and with the 3 ref inputs. In both
>>> cases I am seeing up to 200% or so difference in some of the working
>>> set min counter values, although the error is not as high for the
>>> higher working set percentages. But large enough to cause a potential
>>> performance issue nonetheless.
>>
>> One thing that confuse me is why the error tends to be in positive direction.
>> Since we are minimizing the counter during merging, I would expect us to
>> more or less consistently underestimate the counters.  Do you have any
>> intuition here?
>
> Hi Honza,
>
> Yes, I think I know what is going on. What to do about it is a
> different story. =)
>
> From comparing the histograms merged by libgcov to the exactly merged
> histograms produced by Rong's tool at larger histogram sizes, I think
> the main issue is that the hotspots are slightly different between
> different runs, and this causes inaccuracies in the merged histogram
> unless you have global counter information.
>
> I tried increasing the histogram up to a much larger size - 16128.
> This is 256 linear subranges per bit instead of the default 4. This
> increased the accuracy at the high end of the profile range, but I was
> still left with a fair amount of inaccuracy in the working set (except
> the 99.9% bucket, which is very close now).
>
> I confirmed that some of the hottest counters between different
> training inputs (of perlbmk) are from different functions. The hottest
> 2 counters are the same between all 3 runs, but after that there are
> some differences. When the hotspots are different, those large counter
> values actually correspond to smaller counter values in the other run.
> Therefore, at the high end of the counter range, the gcov-merged
> counter values are artificially high, because we are summing large
> counter values that shouldn't be correlated. Similarly, some of the
> smaller counter values are being summed together that shouldn't be,
> producing artificially low merged counters. For example, in the
> libgcov-merged profile, the number of 0 valued counters will be the
> min of the number of 0-valued counters in the individual runs. But in
> the exactly-merged profile, there are fewer 0-valued counters, because
> some of the 0 counters from individual runs were non-zero in other
> runs.
>
> I confirmed by comparing graphs of the counter values produced by each
> merge that there are more very high and very low counter values in the
> libgcov-merged profile. Graphing the counter values accumulated from
> highest to lowest (similar to what the working set computation does),
> shows that the accumulated sum grows faster in the libgcov-merged case
> as a result, and this is causing the various working set cutoff values
> to be hit earlier, resulting in higher min counter values.
>
> I haven't come up with any great ideas on how to handle this issue
> though. Possibly artificially reduce the min counter value for the
> working sets when there were more than one run?
>
>>
>> Also if you have setup with your tool, it may be nice to double check that
>> the histograms produced by the LTO pass actually match the histogram produced
>> by the Ron's external tool.  I am not sure if his tool takes into account the
>> estimated execution times of basic blocks.  If not it may be interesting
>> experiment by itself, since we will get how well counting counts alone 
>> estimate
>> the times. (I would expect it to be rather good, but it is always better
>> to sanity check).
>
> Rong sent his tool for review for the gcc google 4_7 branch for now,
> but it is usable for trunk too. See:
> http://gcc.gnu.org/ml/gcc-patches/2012-11/msg02141.html


Woops, wrong link. Correct one is:
http://gcc.gnu.org/ml/gcc-patches/2013-04/msg00607.html

Teresa

>
> It doesn't take any estimated times into account - it just merges the
> counters and recomputes the histogram based on all the counters. We
> could take an exactly merged profile produced by his tool and feed it
> into LTO with your patch and get the time estimate comparison your
> patch dumps out. But I would expect it to be the same since like LTO
> it has the same global set of counter input?
>
> Teresa
>
>>>
>>> It looks like the histogram min counters are always off in the larger
>>> direction with the gcov merged histogram, so one possibility is to
>>> reduce the min counter value by some divisor when the number of runs
>>> is >1. Unfortunately, at least in the perlbmk case, the magnitude of
>>> the error doesn't seem to correlate obviously with the # runs (in the
>>> 5 run case the error is actually a little less than in the 3 run case
>>> for perlbench). Honza, how many runs were merged in your 10x error
>>> case above?
>>
>> It was the standard profiledbootstrap. I am not sure how many runs exactly
>> are merged, but it will be couple hundred. There is also an issue with fact
>> that libbackend is linked into more than one binary.
>>>
>>> One thing I noticed in the perlbmk case was that there were a number
>>> of large counter values sharing histogram buckets at the high end of
>>> the histogram in some of the individual run profiles, so there would
>>> be a large loss of precision when merging, since the range of counter
>>> values sharing a single histogram is larger at the high end of the
>>> histogram. I'll experiment with increasing the size of the histogram
>>> to see how much that would reduce the error.
>>
>> Thanks! I guess increasing histogram to size matching approximately the
>> number of counters in the binary should more or less eliminate the precision
>> errors.
>>
>> Honza
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413



--
Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413

Re: Compute precise counter histogram at LTO

Reply via email to