Hello all,
[Sorry for the excessively long mail. It contains introduction, problem
explanation, solution and a set of how-to questions].
== Introduction ==
Because gprof is completely useless in some cases (see below), I had to
develop
(1999) a new profiling algorithm. Unfortunately, my implementation
requires manual
instrumentation of all "interesting" functions with RAII objects (C++
only).
Aside form the manual instrumentation, the new profiler works perfectly,
with
remarkable accuracy and low overhead (a bit faster than gcc -O2 -pg -g).
Instead of doing this manually, I would like to tell GCC instrument the
code such that:
1. Create a static variable per profiled function (could be void*).
2. Call a profiling-function upon entry (and pass this pointer, and
__FUNCTION__ - or whatever its name is in c99/c++0x).
3. Call a conclude-function upon exit from the profiled function.
== Problem description ==
gcc -pg and gprof can't accumulate time of function's children. Instead,
gprof
would interpolate based on call count. For example, look at the attached
file.
In practice 'fast' takes less than a millisecond, and slow takes around
a second,
yet gprof manages to reverse their order (when compiling gcc -pg -g t.c,
and running gprof a.out):
| 0.00 0.00 1/1001 slow [4]
1.05 0.00 1000/1001 fast [3]
[1] 100.0 1.06 0.00 1001 f [1]
-----------------------------------------------
[2] 100.0 0.00 1.06 main [2]
0.00 1.05 1000/1000 fast [3]
0.00 0.00 1/1 slow [4]
|The only correct information is on the time that the leaf "f" takes.
Measuring the time manually (see t.c) give (contradicting gprof):
|Slow took 1.15 seconds
Sum of fast took 0.00 seconds
|
(You can tune t.c to make the result as inaccurate as you like)
=== Solution ===
Record approximate time of entry and exit of each function (volatile tick
counter is updated by a SIGPROF handler). Accumulate time for each:
1. Time for a function (unlike -pg this will be function+sibling time).
2. Time for a pair of caller-callee functions. (The time that the
callee contributes to the caller's total)
In my experience, this scheme works much better than the current
-pg+gprof (the latter is totally useless for my needs, I'd better flip a
coin to decide which function I guess to be expensive).
To do:
I would like to add the instrumentation (as explained above). To avoid
billing the overhead of the profiling on instrumented functions, I need
to maintain
volatile unsigned in_profiler;
and avoid counting time when in_profiler!=0. in_profiler is updated as
follows:
1. in_profiler=1 right before parameters are passed to the call of
start_profile_function(&state, __FUNCTION__); and
finalize_profile_function(&state);
2. in_profiler=0 immediately after *_profile_function() cleans up
the remains of the call (depending on the ABI, it is possible that
no such clean up occurs).
== How to: ==
I would like to find how to force gcc (and its optimizers):
1. Not to move stuff across in_profiler=* assignments, for all
optimizers. This was measured to skew profiling by 10-20% on x86
Linux, and more than x2 on tiny functions on PPC-AIX (adding an
__asm__ register+memory barrier lowered this to 1-4% on x86, but
not for PPC-AIX's ABI)
2. Instrument start_profile_function as early as possible in the
profiled function (if possible before most of the preamble).
3. Instrument finalize _profile_function as late as possible in the
profiled function (if possible after most of the preamble).
4. Take exceptions into account
5. Extract the __FUNCTION__ information from inside
start_profile_function() instead of passing it explicitly (debug
info?)
1. If __FUNCTION__ can be extracted from stack then maybe ditch
in_profiler=* altogether (this way the profiler will be
billed correctly for its time, without such variables).
2. Use specialized ABI to call *_profile_function, saving up
the need for register save/restore for profiler's function
calls.
It took me about a week to implement it in C++ using STL (using RAII and
manual instrumentation). I hope that it would take me no more than two
weeks doing this for C. Anything more than will shift my benefit/cost
ration to finding a weaker solution.
I would be happy to get any comments or suggestions (especially "don't
waste your time, it is not needed because...").
Note that I am not 100% if I am allowed to assign copyright to FSF. I am
doing this with my student hat (need it for accurate profiling of my
research code), but my IBM researcher hat may (or may not) interfere,
and I'll need advice from our legal department.
Thanks
Michael
--
Michael Veksler
http:///tx.technion.ac.il/~mveksler
#include <stdio.h>
#include <time.h>
volatile long long total_count=0;
void f(unsigned repeat_count)
{
unsigned i;
for (i=0; i < repeat_count ; ++i)
++total_count;
}
void fast()
{
f(1);
}
void slow()
{
f(300*1000*1000);
}
int main()
{
unsigned i=0;
clock_t start;
start=clock();
slow();
printf("Slow took %.2f seconds \n",
(clock()-start)*1.0/CLOCKS_PER_SEC);
start=clock();
for (i=0 ; i < 1000 ; ++i)
fast();
printf("Sum of fast took %.2f seconds \n",
(clock()-start)*1.0/CLOCKS_PER_SEC);
return 0;
}