Instrumenting for a different profiling algorithm

Michael Veksler Wed, 28 Mar 2007 06:33:45 -0800

Hello all,

[Sorry for the excessively long mail. It contains introduction, problemexplanation, solution and a set of how-to questions].


== Introduction ==

Because gprof is completely useless in some cases (see below), I had todevelop(1999) a new profiling algorithm. Unfortunately, my implementationrequires manualinstrumentation of all "interesting" functions with RAII objects (C++only).Aside form the manual instrumentation, the new profiler works perfectly,with

remarkable accuracy and low overhead (a bit faster than gcc -O2 -pg -g).

Instead of doing this manually, I would like to tell GCC instrument thecode such that:


  1. Create a static variable per profiled function (could be void*).
  2. Call a profiling-function upon entry (and pass this pointer, and
     __FUNCTION__ - or whatever its name is in c99/c++0x).
  3. Call a conclude-function upon exit from the profiled function.

== Problem description ==

gcc -pg and gprof can't accumulate time of function's children. Instead,gprofwould interpolate based on call count. For example, look at the attachedfile.In practice 'fast' takes less than a millisecond, and slow takes arounda second,

yet gprof manages to reverse their order (when compiling gcc -pg -g t.c,
and running gprof a.out):

|                0.00    0.00       1/1001        slow [4]
               1.05    0.00    1000/1001        fast [3]
[1]    100.0    1.06    0.00    1001         f [1]
-----------------------------------------------
[2]    100.0    0.00    1.06                 main [2]
               0.00    1.05    1000/1000        fast [3]
               0.00    0.00       1/1           slow [4]
|The only correct  information is on the time that the leaf "f" takes.

Measuring the time manually (see t.c) give (contradicting gprof):

   |Slow took 1.15 seconds
   Sum of fast took 0.00 seconds
   |

(You can tune t.c to make the result as inaccurate as you like)

=== Solution ===

Record approximate time of entry and exit of each function (volatile tick
counter is updated by a SIGPROF handler). Accumulate time for each:

  1. Time for a function (unlike -pg this will be function+sibling time).
  2. Time for a pair of caller-callee functions. (The time that the
     callee  contributes to the caller's total)

In my experience, this scheme works much better than the current-pg+gprof (the latter is totally useless for my needs, I'd better flip acoin to decide which function I guess to be expensive).


To do:

I would like to add the instrumentation (as explained above). To avoidbilling the overhead of the profiling on instrumented functions, I needto maintain

volatile unsigned in_profiler;

and avoid counting time when in_profiler!=0. in_profiler is updated asfollows:


  1. in_profiler=1 right before parameters are passed to the call of
     start_profile_function(&state, __FUNCTION__); and
     finalize_profile_function(&state);
  2. in_profiler=0 immediately  after *_profile_function() cleans up
     the remains of the call (depending on the ABI, it is possible that
     no such clean up occurs).

== How to: ==

I would like to find how to force gcc (and its optimizers):

  1. Not to move stuff across in_profiler=*  assignments, for all
     optimizers. This was measured to skew profiling by 10-20% on x86
     Linux, and more than x2 on tiny functions on PPC-AIX (adding an
     __asm__ register+memory barrier lowered this to 1-4% on x86, but
     not for PPC-AIX's ABI)
  2. Instrument start_profile_function as early as possible in the
     profiled function (if possible before most of the preamble).
  3. Instrument finalize _profile_function as late as possible in the
     profiled function (if possible after most of the preamble).
  4. Take exceptions into account
  5. Extract the __FUNCTION__ information from inside
     start_profile_function() instead of passing it explicitly (debug
     info?)
        1. If __FUNCTION__ can be extracted from stack then maybe ditch
           in_profiler=* altogether (this way the profiler will be
           billed correctly for its time, without such variables).
        2. Use specialized ABI to call *_profile_function, saving up
           the need for register save/restore for profiler's function
           calls.

It took me about a week to implement it in C++ using STL (using RAII andmanual instrumentation). I hope that it would take me no more than twoweeks doing this for C. Anything more than will shift my benefit/costration to finding a weaker solution.

I would be happy to get any comments or suggestions (especially "don'twaste your time, it is not needed because...").

Note that I am not 100% if I am allowed to assign copyright to FSF. I amdoing this with my student hat (need it for accurate profiling of myresearch code), but my IBM researcher hat may (or may not) interfere,and I'll need advice from our legal department.


Thanks
 Michael

--
Michael Veksler
http:///tx.technion.ac.il/~mveksler

#include <stdio.h>
#include <time.h>
volatile long long total_count=0;
void f(unsigned repeat_count)
{
        unsigned i;
        for (i=0; i < repeat_count ; ++i)
                ++total_count;
}
void fast()
{
        f(1);
}
void slow()
{
        f(300*1000*1000);
}
int main()
{
        unsigned i=0;
        clock_t start;
        start=clock();
        slow();
        printf("Slow took %.2f seconds \n",
                 (clock()-start)*1.0/CLOCKS_PER_SEC);
        start=clock();
        for (i=0 ; i < 1000 ; ++i)
                fast();
        printf("Sum of fast took %.2f seconds \n",
                 (clock()-start)*1.0/CLOCKS_PER_SEC);
        return 0;
}

Instrumenting for a different profiling algorithm

Reply via email to