On 15/10/25 14:34, Jan Hubicka wrote:
External email: Use caution opening links or attachments
Hi Honza,
The assumption was that streaming the statistics in from the file would be
lesser overhead than computing them while reading in the counters. If that
isn't the case, then it is better not to have them.
Computing the cutoff is linear time algorithm, so we can do it cheaply
enough while stream in: push all counts into one large array and use
quickselect. However we need to stream in all the data which is
inherently nonlinear (#invocations of compiler * #of samples) and I
think not scalable enough.
Yes, that is a good idea. But as far as I can see, LLVM does read the entirety
of
the file while streaming the profile information (at least going off of the
SampleProfileReaderBinary::readImpl () implementation).
Andi is right that current auto profile reader ignores fact that gcov
file format has a well defined sections and mechanism to add new ones
without breaking old tools (so it is equivalent of llvm's extBinary; I
do not see need for two binary formats). I will fix that incrementally.
Currently dump-gcov is confused about autoprofile "gcov" so I guess the
file format is not implemented correctly. We also want the reader to
skip unknown section & do some sanity checking.
Given current readers we indeed need to bump up the gcov version.
I checked SampleProfileReaderExtBinaryBase::readOneSection.
It has SecProfSummary which is what you added, SecNameTable we have,
SecLBRProfile which seems to be the actual profile counts,
SecFuncOffsetTable which seems to be the helper to allow random access
to function profiles (which we miss but will need to add to scale to
very large projects), SecFuncMetadata which record some info about
pofile (i.e. are unique names used), SecProfileSymbolList I did not
look into and ability to skip "custom" sections.
I think what we want is something like the SampleProfileReaderExtBinaryBase
implementation which uses a section header and is designed to be extensible.
Yes, it seems current code is half way there. While it attemts to make
actual file format extensible and compatible with normal gcov tools, the
reader is broken.
Yeah, I had designed this following the implementation in LLVM which uses both
hot and cold counts, and potentially calls with different percentiles (instead
of a
global threshold) which requires caching. I wasn't sure if there were places in
GCC where this is done so this was a bit of an open-ended decision.
We have three types of counts
- hot: those above threshold
- likely never executed: Thos where we can prove that they are likely
not executed. I.e. code leading to abort ()
- rest
We can subdivide rest but so far I did not see good motivation for that.
With LTO GCC has analogous functionality in ipa-profile.cc:ipa_profile
which computes histogram of all basic block counts in the program
weighted by its expected execution counts and computes hot BB cutoff
based on param_hot_bb_count_ws_permille.
We consider hot counts, cold counts and probably never executed counts
(which is 0 count from porfile feedback or static estimation).
So I would drop --param afdo-profile-summary-cutoff-hotcold_count and
use --param hot-bb-count-ws-permille for afdo as well.
The current design streams out 16 hardcoded percentiles to the GCOV file,
then looks up the closest percentile to the --param. Would it be better then
to just calculate one percentile (based on hot-bb-count-ws-permille) while
streaming in the information? This would remove the need to modify the GCOV
format at all, if the other summary items can also be cheaply calculated.
I think we can go with your changes and incrementally implement the
function offset table. We need a table that links symbol name index to
all its function instances (offline and inline). This seems similar to
FuncOffsetTable.
So the read in can be organized by streaming in the symbol name table,
deciding what bodies we need and then stream them skipping bodies we do
not need.
Hi Honza,
I've addressed your comments and sent out a v2 of the patch. Given your
comments,
I think it may be useful to only have two things in the summary - namely the
detailed summaries and the maximum count - as those require streaming the whole
profile to compute.
I think it may also be useful to avoid the summary section entirely in the GCOV
for profiles which are small enough that computing the information within the
compiler itself will be cheaper as compared to streaming - perhaps this could
help avoid the compile time impact you were seeing?
I tried to measure clang's overhead for reading whole profile, but
did not managed to convince clang to read other file format then
extbinary. However running
llvm-profdata merge -o a llvm.profdata
takes 1.2s and
llvm-profdata overlap llvm.profdata llvm.profdata
takes 0.55s
llvm-profdata merge --binary and --extbinary produces same data, so I
guess llvm folks replaced binary format by extbinary and binary
readers/writters are now kind of dead.
It seems reading whole profile takes around 0.2-0.3s which is still a
lot when building small files (and would get worse with bigger training
run then one I used).
Honza
--
Regards,
Dhruv