Hi, Building on our discussions from the last Cauldron, we propose creating a native, simplified AutoFDO tool for GCC to replace our current reliance on external Google tools which is not actively maintained. I'll follow up with a detailed design document as soon as we have a consensus on the proposal.
Thanks, Kugan Summary ======= We propose a standalone, minimal tool for generating AutoFDO profiles that can be consumed by the GCC AutoFDO toolchain, with the goal of integrating it into the GCC repository. The tool would support: (1) offline read existing perf.data (single-process or system-wide) and produce a profile for a target binary; (2) direct attach to a process via the PMU (LBR or BRBE / SPE), bypassing perf record and building the profile from the live sample stream; (3) system-wide read perf.data from system-wide collection (e.g. perf record -a), filter samples by the target application, and generate gcov/profile for that application. This keeps the design simple, dependencies minimal, and the tool easier to maintain in step with GCC. Motivation ========== - Current AutoFDO tools (e.g. from Google) are not widely used with GCC. LLVM has a similar profile-creation tool integrated with the compiler. A tightly coupled tool for GCC would allow for easy development and upgradation. - A lightweight tool that generates AutoFDO profiles for the GCC AutoFDO toolchain (with minimal perf parsing and minimal DWARF) can be memory efficient. - An optional mode to pipe PMU data directly via perf_event_open (LBR or BRBE, and SPE) makes the tool more memory efficient. Goals ===== 1. Simplicity - One job: turn samples (from file or live) into AutoFDO profiles that can be consumed by the GCC AutoFDO toolchain. 2. Minimal dependencies - Dependent only on libraries such as libdwarf for DWARF parsing (no large frameworks). 3. Input modes - Offline (perf.data, single-process or system-wide), direct (tool runs the workload and reads LBR or BRBE (or SPE) via the PMU, bypassing perf record), and system-wide (perf.data from system-wide collection, then generate gcov for a chosen application). 4. Easier maintenance Ideally part of or released with GCC for fast iteration when profile format or DWARF expectations change. Requirements ============ - Read perf.data and parse branch stack (LBR or BRBE) records to obtain (address, count) for the target binary; support SPE (ARM Statistical Profiling Extension) as an extension point (parse SPE/AUX records when present, stub or full implementation). - Parse MMAP2 (and MMAP) records to map runtime addresses to the profiled binary and file offsets. Support system-wide perf.data: filter samples by target binary (using MMAP2/COMM/pid) and produce gcov (or AutoFDO profile) for that application only. - Direct mode: run a user-supplied command, attach via perf_event_open, read LBR or BRBE (or SPE) from the kernel ring buffer, and parse MMAP2/MMAP from the same stream; produce the same profile format as offline mode without writing perf.data. - Use a minimal DWARF subset to map instruction addresses to (source file, line, discriminator) for the target binary (e.g. line table, address ranges, minimal subprogram info). - Emit profile output in the format consumed by the GCC AutoFDO toolchain (e.g. gcov-style or the format used by -fauto-profile). - Portability: support Linux (perf_event_open, LBR or BRBE, SPE); other hosts/PMUs can be added later without changing the core design. Use Cases ========= - Offline from existing perf.data: user has perf.data from a single-process run; tool produces profile for the target binary. - Direct: one-shot "run and profile": user runs the tool with a command; tool executes it, attaches via PMU, collects samples (no perf.data file). - System-wide: user has perf.data from "perf record -a"; tool filters by target binary (-b <binary>) and produces profile for that application only. - CI / automated builds: script runs perf record then tool, or tool in direct mode, or system-wide perf then tool with -b <binary>. Dependencies ============ The tool is dependent only on libraries such as libdwarf for DWARF parsing (and libelf as typically required by libdwarf for ELF access). Perf/PMU data is read via standard system interfaces (e.g. perf_event_open for direct mode; perf.data file for offline/system-wide). No dependency on the perf userspace tool for direct mode; for offline mode, input is a perf.data file (produced by perf record or any writer of that format). Scope of the Tool ================= Parse only branch stack (LBR or BRBE), MMAP2 (and MMAP), and optionally SPE. Map addresses to source (file, line, discriminator) via a minimal DWARF subset (line table, address ranges; libdwarf sufficient). Output is the profile format consumed by the GCC AutoFDO toolchain. Offline: read perf.data (single-process or system-wide; if system-wide, filter by -b <binary>). Direct: run user command, attach via perf_event_open, read from kernel ring buffer (no perf.data). Benefits ======== Small codebase and minimal dependencies ease review. In-tree with GCC allows fast iteration when profile format or DWARF expectations change. Single workflow supports offline, direct, and system-wide use. Integration =========== The tool will be kept as part of the GCC repository (e.g. contrib/ or a dedicated directory), built and installed with GCC, so it stays in step with the compiler and profile format. Technical Outline ================= Input: perf.data (offline or system-wide) or live via perf_event_open (direct). Processing: (address, count) from LBR/BRBE/SPE; MMAP2 for address->binary and filtering by target when system-wide; minimal DWARF for address->(file, line, discriminator); aggregate for GCC AutoFDO toolchain. Output: profile in toolchain format. Extensions ========== Add support for gathering branch profiles and memory profile as an extension to the current gcov format.
