Re: [google]Add support for sampled profile collection (issue4438083)

2011-04-28 Thread Silvius Rus
On Thu, Apr 28, 2011 at 4:58 PM, Xinliang David Li  wrote:
>
> + Honza
>
> This patch may be a candidate for trunk as well. This feature not only
> allows profile collection with much less overhead (for multi-thread
> programs with hot regions, the slow down can be significant due to
> cache ping-pong effect of counter update) without sacrificing too much
> the performance.
>

At an extreme I saw overhead reduction from 30x to 2x on actual server
applications, but 10x to 2x was more common.  10x may not be an issue
for some workloads.  There's "train" input for SPEC.  But when you
have a server that needs to warm up 3 hours before the function
profile becomes relevant, that 10x to 2x makes the qualitative
difference.

I'm stating the obvious, but, for the record, note that turning this
on for single threaded applications would actually lead to overhead
(about 30%), as the sampling code is more expensive than the counter
update on a single core.  That's why it's not turned on by default.

>
> Another usage for this support is that it allows profile collection to
> be turned on/off asynchronously for long running server programs which
> sometimes profile data in warm up period is not important and should
> be excluded.
>

For completeness, I tried at some point to add two wrappers, the first
as an on/off switch and the second this proposed sampling wrapper.
But code size almost doubled and overhead went up significantly, so I
ditched the on/off switch.  The workaround is to start with a very
large sampling rate and then make a call into libgcov to reset the
rate at runtime, just when you're ready to measure.

> A known limitation is that value profiling is not yet sampled, but it
> does not seem to cause problems.
>
> David

Thank you, Easwaran and David, for bringing this upstream.  Mea culpa.
Silvius


Re: [google]Add support for sampled profile collection (issue4438083)

2011-04-29 Thread Silvius Rus
> How is code-size affected with this patch, non-instrumented vs.
> regular-instrumented vs. sample-instrumented?

I don't have the numbers, but the increase in code size from
regular-instrumented to sample-instrumented is larger than that from
non-instrumented to sample-instrumented.  Easwaran, could you please
build a few binaries at -O2, -O2 -fprofile-generate and -O2
-fprofile-generate -fprofile-generate-sampling, and send out the text
size ratios considering -O2 as the baseline?

> > @@ -292,6 +474,15 @@ gimple_gen_edge_profiler (int edgeno, edge e)
> >                                        gimple_assign_lhs (stmt1), one);
> >   gimple_assign_set_lhs (stmt2, make_ssa_name (gcov_type_tmp_var, stmt2));
> >   stmt3 = gimple_build_assign (unshare_expr (ref), gimple_assign_lhs 
> > (stmt2));
> > +
> > +  if (flag_profile_generate_sampling)
> > +    {
> > +      slot = htab_find_slot_with_hash (instrumentation_to_be_sampled, 
> > stmt1,
> > +                                       htab_hash_pointer (stmt1), INSERT);
> > +      gcc_assert (!*slot);
> > +      *slot = stmt1;
> > +    }
> > +
>
> What's the reason to delay the sampling transform and not apply it here?
> At least a comment should be added for this.

Sorry, I just don't remember.

> > +DEFPARAM (PARAM_PROFILE_GENERATE_SAMPLING_RATE,
> > +         "profile-generate-sampling-rate",
> > +         "sampling rate with -fprofile-generate-sampling",
> > +         100, 0, 20)
>
> Zero is really ok?

It wouldn't break anything, but a rate of 0 doesn't make sense.  It
should be 1.  And the default should be 101 instead of 100 (and
generally a prime).

While we're at it, let me bring up an important practical FDO issue
related to this.  As David mentioned, besides reducing overhead we
have used this to implement selective collection.  This essentially
requires libgcov to provide a basic public interface:
reset()
start()
stop()
save()
I implemented them by playing with the sampling rate, but a clearer
and supported interface would be useful.  Also, there was one annoying
detail that I could not figure out.  When you build with
-fprofile-generate=./fdoprof, the output gets dumped under
./fdoprof/..., but there seems to be no easy way to know this path
within the profiling process.  For Google servers this makes
collection fragile.  The user generally does not have access to the
server file system.  I implemented a collection mechanism so the user
can tell the server "copy the profiles from ./fdoprof to some shared
location".  But now you need to pass the exact path when building
*and* collecting profiles, which may get separated in time, thus the
process is fragile.  It would be good to keep a copy of the path
prefix and have it accessible through a public interface as well.
Then once an instrumented binary is built, it will know the root of
the profile output directory and thus relieve the user from the
responsibility of knowing it.

Silvius