http://lwn.net/Articles/336542/
We are pleased to announce version 8 of the performance counters
subsystem for Linux.
This new subsystem adds a new system call (sys_perf_counter_open())
and it provides the new 'perf' tool that makes use of these new
kernel capabilities.
This subsystem and this tool is new in that it tries a new approach
at integrating all things performance analysis under one roof.
There have been many changes since -v7 - see the shortlog below for
details.
There are a lot of new contributors to this code. Many thanks go to:
Peter Zijlstra, Paul Mackerras, Robert Richter, Arnaldo Carvalho de
Melo, Mike Galbraith, Thomas Gleixner, Wu Fengguang, Jaswinder Singh
Rajput, Yong Wang, Frederic Weisbecker, Yinghai Lu, Luis Henriques,
Eric Paris, Arjan van de Ven, Tim Blechmann, Steven Whitehouse,
Jaswinder Singh, H. Peter Anvin, Hidetoshi Seto, Erdem Aktas and
Andrew Morton.
The biggest change in -v8 is a re-focusig of our effort towards
building tools to help various user-space development workflows. The
latest code and perfcounter-tools deal with all sorts of user-space
profiling usage models, they are very fast and are able to look up
DSO symbols regardless of where they are loaded - and try to be easy
to use and easy to configure.
Per-application and system-wide profiling modes are supported - plus
a number of intermediate modes are supported as well via the use of
inherited counters that traverse into child-task hierarchies
automatically and transparently.
With perfcounters there is no daemon needed: if a perfcounters
kernel is booted on a supported CPU (all AMD models and Core2 /
Corei7 / Atom Intel CPUs - both 64-bit and 32-bit user-space is
supported) then profiling can be done straight away.
Profiling sessions are recorded into local files, which can then be
analyzed. There's a number of high-level-overview tools 'perf stat'
and 'perf top' which help one get a quick impression about what to
profile and in what way.
New in -v8 is the 'perf' utility which has merged all the
perfcounters utilities and which exposes all the functionality of
the kernel subsystem, in one uniform and unified way:
mercury:~/tip/tools/perf> perf
usage: perf [--version] [--help] COMMAND [ARGS]
The most commonly used perf commands are:
annotate Read perf.data (created by perf record) and display annotated code
list List all symbolic event types
record Run a command and record its profile into perf.data
report Read perf.data (created by perf record) and display the profile
stat Run a command and gather performance counter statistics
top Run a command and profile it
See 'perf help COMMAND' for more information on a specific command.
There's also a new "record + report" separated profilig workflow
supported: use "perf record ./my-app" to record its profile, then
use "perf report" and all its --sort options to get various
high-level and low level details. Oprofile users will find this
workflow familar.
On the lowest level, 'perf annotate' will annotate the source code
alongside profiling information and assembly code:
$ perf annotate decode_tree_entry
------------------------------------------------
Percent | Source code & Disassembly of /home/mingo/git/git
------------------------------------------------
:
: /home/mingo/git/git: file format elf64-x86-64
:
:
: Disassembly of section .text:
:
: 00000000004a0da0 <decode_tree_entry>:
: *modep = mode;
: return str;
: }
:
: static void decode_tree_entry(struct tree_desc *desc, const char *buf, unsigned long size)
: {
3.82 : 4a0da0: 41 54 push %r12
: const char *path;
: unsigned int mode, len;
:
: if (size < 24 || buf[size - 21])
0.17 : 4a0da2: 48 83 fa 17 cmp $0x17,%rdx
: *modep = mode;
: return str;
: }
:
: static void decode_tree_entry(struct tree_desc *desc, const char *buf, unsigned long size)
: {
0.00 : 4a0da6: 49 89 fc mov %rdi,%r12
0.00 : 4a0da9: 55 push %rbp
3.37 : 4a0daa: 53 push %rbx
: const char *path;
: unsigned int mode, len;
:
: if (size < 24 || buf[size - 21])
0.08 : 4a0dab: 76 73 jbe 4a0e20 <decode_tree_entry+0x80>
0.00 : 4a0dad: 80 7c 16 eb 00 cmpb $0x0,-0x15(%rsi,%rdx,1)
3.48 : 4a0db2: 75 6c jne 4a0e20 <decode_tree_entry+0x80>
: static const char *get_mode(const char *str, unsigned int *modep)
: {
: unsigned char c;
: unsigned int mode = 0;
:
: if (*str == ' ')
1.94 : 4a0db4: 0f b6 06 movzbl (%rsi),%eax
0.39 : 4a0db7: 3c 20 cmp $0x20,%al
0.00 : 4a0db9: 74 65 je 4a0e20 <decode_tree_entry+0x80>
: return NULL;
:
: while ((c = *str++) != ' ') {
0.06 : 4a0dbb: 89 c2 mov %eax,%edx
: if (c < '0' || c > '7')
1.99 : 4a0dbd: 31 ed xor %ebp,%ebp
: unsigned int mode = 0;
:
: if (*str == ' ')
: return NULL;
:
: while ((c = *str++) != ' ') {
1.74 : 4a0dbf: 48 8d 5e 01 lea 0x1(%rsi),%rbx
: if (c < '0' || c > '7')
0.00 : 4a0dc3: 8d 42 d0 lea -0x30(%rdx),%eax
0.17 : 4a0dc6: 3c 07 cmp $0x7,%al
0.00 : 4a0dc8: 76 0d jbe 4a0dd7 <decode_tree_entry+0x37>
0.00 : 4a0dca: eb 54 jmp 4a0e20 <decode_tree_entry+0x80>
0.00 : 4a0dcc: 0f 1f 40 00 nopl 0x0(%rax)
16.57 : 4a0dd0: 8d 42 d0 lea -0x30(%rdx),%eax
0.14 : 4a0dd3: 3c 07 cmp $0x7,%al
0.00 : 4a0dd5: 77 49 ja 4a0e20 <decode_tree_entry+0x80>
: return NULL;
: mode = (mode << 3) + (c - '0');
3.12 : 4a0dd7: 0f b6 c2 movzbl %dl,%eax
: unsigned int mode = 0;
:
: if (*str == ' ')
: return NULL;
:
: while ((c = *str++) != ' ') {
0.00 : 4a0dda: 0f b6 13 movzbl (%rbx),%edx
16.74 : 4a0ddd: 48 83 c3 01 add $0x1,%rbx
: if (c < '0' || c > '7')
: return NULL;
: mode = (mode << 3) + (c - '0');
Those who already use Git will (hopefully) find 'perf' intuitive, as
we've picked up a number of internal libraries from Git to build
this tool so the look-and-feel will be familar. It's very
extensible, new subcommands can be added easily - while there's just
a single new binary in the system.
'perf report' supports multi-key histograms and a rich set of views
of the same performance data - per task or per dso, or a finegrained
per symbol view (and all permutations of these keys).
Most of the user-visible action in -v8 was in the tooling, but the
kernel side code has been revamped all around as well:
- Sampling support for inherited counters
- Performance optimizations to lazy-switch PMU contexts
- Enhanced PowerPC and x86 support.
- Generic tracepoints can be used via perfcounters too
- Fixed-frequency, auto-sampling counters. (they can be used via
the '-F' option in perf record and perf top.)
- Generic "hardware cache" event enumeration method - for those
who want more than just a handful of essential hardware
counters.
- Automatic "fool-proof" event-throttling code to protect against
accidentally too short sampling periods.
- The 'raw events' configuration space has been extended -
every event type that oprofile is able to handle can be
specified via raw perfcounter events as well.
- ... and lots of other changes.
To try/test/check this code, the latest perfcounters tree can be
pulled/cloned from:
git pull \
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git \
perfcounters/core
Or the following patch can be applied to the latest
(v2.6.30-rc8-git3) upstream -git Linux kernel:
http://redhat.com/~mingo/perfcounters/perfcounters-v8-v2....
The 'perf' utility can be built by pulling that tree and by doing:
cd tools/perf/
make
make install
( The combo patch is too large to be posted to lkml - and all the
v7->v8 patches have been posted to lkml already. )
As usual, test feedback, patche, comments and suggestions are
welcome!
Ingo