> Hi all,
> 
> Per prior discussions, here is the specification we are proposing for
> version 4 of the AutoFDO GCOV profile format. This is a complete re-design
> focused around being a backwards-compatible and extensible format which
> can be partially read by the compiler.
> 
> =================
> 
> Table of Contents
> 
>    1.  Introduction
>    2.  Motivation
>    3.  Open Questions
>    4.  Binary Format
>        4.1.  File Header
>        4.2.  Section Layout
>        4.3.  String Table Section (GCOV_STRING_TABLE)
>        4.4.  Summary Section (GCOV_SUMMARY_INFO)
>        4.5.  File Names Section (GCOV_FILE_NAMES)
>        4.6.  Symbol Names Section (GCOV_SYMBOL_NAMES)
>        4.7.  Symbol Info Section (GCOV_SYMBOL_INFO)
>        4.8.  Location Encoding
>        4.9.  Compact Encoding Mode
>    5.  Textual Format
>        5.1.  Overview
>        5.2.  Grammar
>        5.3.  Example
>    6.  Extensibility
>        6.1.  Binary Format
>        6.2.  Textual Format
>        6.3.  Forward Compatibility
>    7.  Experimental Results
> 
> 
> 1.  Introduction
> 
>    The existing GCOV AutoFDO profile formats (versions 2 and 3) use a
>    fixed, non-extensible tag-length sequential layout. Version 4 redesigns
>    the format to be a section-based one with the following goals:
> 
>      - The ability to partially read the profile, i.e. unrequired sections are
>        simply skipped by a reader.
> 
>      - Forward compatibility through typed location records with a
>        trailing size field for unknown types. This is useful (for example) for
>        future work regarding branch profiling for better if-conversion
>        heuristics.
> 
>      - Reduced file size through trie-compressed string tables,
>        optional discriminators, zero-count elision, and variable-width
>        sample counts, as detailed in later sections.
> 
>      - Optional compact encoding using Protocol Buffers-style varints
>        for further size reduction.
> 
>      - A companion human-readable textual format for inspection.
> 
>    Measured against a GCC bootstrap profile (58 MB in v3), the v4
>    binary format achieves -43% (33 MB) in normal mode and -72%
>    (16 MB) in compact mode. Against SPEC CPU 2017 profiles, it
>    also achieves -43% in normal mode and -72% in compact mode on average.
> 
>    All multi-byte integers in the binary format are unsigned and
>    stored in big-endian byte order.
> 
>    "Varint" refers to the variable-length integer encoding defined by Protocol
>    Buffers: each byte uses 7 data bits with the MSB as a continuation flag,
>    least significant group first
>    (see https://protobuf.dev/programming-guides/encoding/#varints).
> 
> 2.  Motivation
> 
>    The primary issue with the existing GCOV format is the lack of 
> extensibility
>    and backwards-compatibility. Adding features to the format necessitates
>    bumping the version number, which also requires updating all of the tooling
>    working with the format.
> 
>    Secondly, the profile is required to be streamed in its entirety from disk
>    whenever it is read, which leads to duplicate work when multiple TUs are
>    compiled as the compiler has to read the entire profile per TU. Having the
>    ability to read the profile partially allows getting rid of most of this
>    duplicate work.
> 
> 3.  Open Questions
> 
>    - Can the GCOV profile format be renamed to something better? The
>      current name is confusing and collides with the existing GCC PGO
>      format. Perhaps "AFDO" could work.

I would wote for renaming. Gcov is already confusing for -fprofile-use
but comes from a history. With auto-fdo it even more confusing since the
file format is not compatible.  .afdo works for me (or perhaps with gcc in 
name).
> 
> 4.  Binary Format
> 
> 4.1.  File Header
> 
>    The file begins with a fixed header followed by a section table.
> 
>      GCOV_MAGIC            4 bytes   ASCII 'gcov' (0x67636F76)
If we rename it perhaps afdo.  gcov tools can be extended to recognize
it meaningully.
>      GCOV_VERSION          4 bytes   0x00000004
>      GCOV_HEADER_BITMASK   1 byte
>        Bit 7    : compact flag
>        Bits 0-6 : reserved
>      GCOV_NUM_SECTIONS     7 bytes   Number of entries in the
>                                      section table (excludes the
>                                      two fixed sections below)
> 
>    Immediately following are offset/size pairs for two fixed
>    sections, then the section table:
> 
>      GCOV_SUMMARY_OFFSET        8 bytes
>      GCOV_SUMMARY_SIZE          8 bytes
>      GCOV_FILE_NAMES_OFFSET     8 bytes
>      GCOV_FILE_NAMES_SIZE       8 bytes
> 
>      GCOV_SECTION (repeated GCOV_NUM_SECTIONS times):
>        GCOV_SECTION_OFFSET      8 bytes
>        GCOV_SECTION_SIZE        8 bytes
> 
>    All offsets are byte offsets from the start of the file.
> 
>    The offsets for the summary and file names sections are provided
>    for fast access. There is no specified ordering of any of the
>    sections. The indexing is done based on the order they are placed
>    within the file, and is inclusive of the two sections mentioned
>    before.
> 
> 4.2.  Section Layout
> 
>    Each section begins with a one-byte bitmask:
> 
>      GCOV_SECTION_BITMASK   1 byte
>        Bit 7    : compact flag for this section
>        Bits 0-6 : section type
> 
>    Defined section types:
> 
>      0x01   GCOV_STRING_TABLE
>      0x02   GCOV_SUMMARY_INFO
>      0x03   GCOV_FILE_NAMES
>      0x04   GCOV_SYMBOL_NAMES
>      0x05   GCOV_SYMBOL_INFO
> 
>    The section bitmask is followed immediately by the section data as
>    defined in the subsections below.
> 
> 4.3.  String Table Section (GCOV_STRING_TABLE)
> 
>    Each translation unit has its own string table section, storing
>    its symbol name strings in a path-compressed trie. This exploits
>    the shared prefixes common in C++ mangled names (e.g.
>    "_ZNSt6vector...").
> 
>      GCOV_NUM_STRINGS   4 bytes   Total number of strings
> 
>    Followed by a serialized trie starting at the root node:
> 
>      GCOV_TRIE_NODE:
>        TRIE_NODE_BITMASK        1 byte
>          Bit 7:  terminal flag (a complete string ends here)
>          Bits 0-6: number of children (0-127)
>        TRIE_TERMINAL_ID         4 bytes (present only if terminal)
>                                 String index for this terminal
>        TRIE_CHILD (repeated for each child):
>          TRIE_EDGE_LABEL_LENGTH   2 bytes
>          TRIE_EDGE_LABEL          variable (TRIE_EDGE_LABEL_LENGTH
>                                   bytes)
>          GCOV_TRIE_NODE           (recursively)
> 
>    The trie is path-compressed: chains of nodes where each node has
>    exactly one child and is not a terminal are collapsed into a single
>    edge with a multi-character label. To reconstruct a string,
>    edge labels are concatenated from the root to the terminal node, like a
>    typical trie.
Path-compressions looks like a good idea I did not think of previously.
We wnat to have independent sgring tables to support independent
reading?
Note that with profile-use we will end up with possibly massive TUs
anyway (which is a problem e.g. for dwarf).
> 
> 4.4.  Summary Section (GCOV_SUMMARY_INFO)
> 
>    The summary section contains aggregate statistics for the entire
>    profile.
> 
>      SUMMARY_TOTAL_COUNT            8 bytes
>      SUMMARY_MAX_COUNT              8 bytes
>      SUMMARY_MAX_FN_COUNT           8 bytes
>      SUMMARY_NUM_COUNTS             8 bytes
>      SUMMARY_NUM_FUNCTIONS          8 bytes
>      SUMMARY_NUM_DETAILED_ENTRIES   8 bytes
> 
>    Followed by detailed histogram entries:
> 
>      SUMMARY_DETAILED_ENTRY (repeated SUMMARY_NUM_DETAILED_ENTRIES
>                              times):
>        ENTRY_CUTOFF       4 bytes   Cutoff value
>        ENTRY_MIN_COUNT    8 bytes   Minimum count at this percentile
>        ENTRY_NUM_COUNTS   8 bytes   Number of counters at or above this 
> minimum

We may want to add more fileds later.  I wonder if we want to have some
method of declaring what fields are present and what is their size.
> 
> 7.1.  SPEC CPU 2017
> 
>      Profile            v3        v4       v4c     v4    v4c
>      -------            --        --       ---   -----  -----
>      cpugcc_r          7.1M      4.1M     1.9M    -43%   -74%
>      cpuxalan_r        1.3M      790K     516K    -40%   -61%
>      deepsjeng_r       103K       59K      30K    -44%   -71%
>      exchange2_r        27K       16K     7.7K    -43%   -71%
>      leela_r           299K      195K     113K    -35%   -62%
>      mcf_r              27K       16K     7.8K    -43%   -71%
>      omnetpp_r         1.2M      764K     425K    -37%   -65%
>      perlbench_r       412K      229K     105K    -44%   -75%
>      specrand_ir       3.3K      1.6K      705    -53%   -79%
>      x264_r            434K      247K     113K    -43%   -74%
>      xz_r               75K       46K      23K    -39%   -69%
> 
>      TOTAL              11M      6.4M     3.2M    -42%   -71%
> 
> 7.2.  GCC Bootstrap
> 
>      Profile            v3        v4       v4c     v4    v4c
>      -------            --        --       ---   -----  -----
>      all.fda            56M       32M      16M    -43%   -71%

This does look nice.  I am looking forward discussing it tomorrow.
I wonder if you did any comparison to LLVM's implementation?
Honza
> 
> -- 
> Regards,
> Dhruv
> 

Reply via email to