Hello-
I did not quite complete this series in time for stage 1, but I thought it
might still be worth sending now, given there has been an uptick in PRs about
this topic lately.
I started with a simple goal of streaming `#pragma GCC diagnostic'
information for LTO, so that diagnostic suppression could work in the LTO
front end. The main difficulty is that the linemap structure that LTO
creates while streaming in the data is not globally ordered; there is no
definite relation between the numerical ordering of two location_t values in
its linemap and the order in which source lines were originally
processed. There is an ordering local to each function, but this is not
enough to handle general diagnostic pragmas, since for a given source
location, it needs to be unambiguous which diagnostic pragmas were in force
at that point specifically.
The solution I went with was to stream out all of the relevant linemap data
structures into a new LTO section, so that the linemap reconstructed on the
other side could reflect the global ordering. With that done, diagnostic
pragmas work automatically without needing any LTO-specific logic.
Testing was done on the following platforms, where I did bootstrap + regtest
of the indicated languages, both for a normal bootstrap and for one with
--with-build-config=bootstrap-lto.
x86_64-linux-gnu: all languages
ppc64le-redhat-linux (cfarm135): all languages
aarch64-redhat-linux (cfarm185): c,c++,fortran
sparcv9-sun-solaris2.11 (cfarm216): c,c++,fortran,objc
powerpc64-linux-gnu (cfarm121): all languages *
*I was not able to get bootstrap-lto to work here. compare-lto fails on
all object files; it seems that objcopy and strip both silently decline to
strip the LTO options section here? But I ran with bootstrap-lto-lean,
which skips the compare step, and the regtest at least was OK.
x86_64-apple-darwin24 - c,c++ *
*For this platform, I had to use BOOT_CFLAGS+=-g0 in order to bootstrap
with LTO, or else dsymutil was trying to use 200 GB of RAM; not sure if
that's something about my system or a known issue. (It was not related to
these patches.) It also seemed that compare-lto does not work here either,
so I did bootstrap-lto-lean.
I figured it would be of interest how the change to the streaming format
affects the object file sizes. I don't think it is very significant either
way. It is possible for the object files to be either smaller or larger than
before, depending on the nature of the locations being streamed out, but on
balance they tend to be a little smaller. The previous format streamed the
file name, line number, and column number for a location each time it was
output (avoiding duplication of the file name when possible), while the new
format streams an integer index into the table of linemaps and an integer
offset to compute the location_t (plus, separately, the linemap table
itself.) The new format is preferable in case the same location is streamed
multiple times. The old format prefers when locations appear mostly once and
when they appear in roughly chronological order, since it requires fewer
bits to store small deltas in the 4-bit uleb format being streamed. I tried
to recapture some of that benefit in the new format by streaming the
location indices as deltas from the prior one when possible.
Here are some real-world examples that I tried:
0) GCC LTO bootstrap on x86-64
- Size of stage 3 object files under gcc/ after bootstrap-lto build of all
languages.
- Total size of *.o before this patch: 2773040k
- Total size of *.o after this patch: 2745372k (-1.00%)
1) Python (C mostly)
- Built all of Python 3.13 objects with -fno-fat-lto-objects.
-(Python build doesn't support -fno-fat-lto-objects, but
it gets as far as creating all of the object files.)
- Total size of *.o before this patch: 93560k
- Total size of *.o after this patch: 91476k (-2.23%)
2) Boost (C++)
- Built boost 1.90.0 with b2 args:
- --build-type=complete --layout=versioned link=static lto=on
- Total size of *.o before this patch: 643372k
- Total size of *.o after this patch: 631716k (-1.81%)
3) Quantum Espresso (Fortran)
- Built commit 797f00f1d3f390f642411209b167af6668f3cb83 with -flto.
- Total size of *.o before this patch: 89972k
- Total size of *.o after this patch: 89236k (-0.82%)
Regarding the temporary files written out by WPA for the LTRANS phase, all
of the linemap sections are written just once into their own file, so the
space usage is comparable to that needed for the regular object files. In
general, the space saved vs the old streaming format could be a little more
for these than for the object files, because the same location is often
streamed into multiple partitions. For example, here are the sizes of all
the LTRANS files produced when compiling cc1plus with -flto:
-Total size of ltrans*.o before this patch: 736988k
-Total size of ltrans*.o after this patch: 714700k (-3.02%).
-The new size is comprised of 710616k for the 512 partitions,
plus 4084k for the linemaps object.
Another performance-related concern would be the number of line maps and
location_t's used by the LTO front end. This was the subject of PR65536
(described more in the commit message for [2/5]). To measure this, I looked
at the output of -fmem-report from the WPA stage of building cc1plus from
LTO-enabled object files. (This is the stage that reads in all of the files
at once, before optionally producing LTRANS partitions, so it's the stage
that uses the largest number of locations.)
Here is the relevant portion of it, first for the case -flto-partition=none. In
this mode, there is no WPA + LTRANS, rather all the functions are read in and
processed as one compilation.
Old way:
--------
Number of ordinary maps used: 1386k
Ordinary map used size: 43M
Number of ordinary maps allocated: 4096k
Ordinary maps allocated size: 128M
Total allocated maps size: 128M
Total used maps size: 43M
Ad-hoc table size: 1280M
Ad-hoc table entries used: 16M
optimized_ranges: 540k
unoptimized_ranges: 0
max location: 2912904036992
(Note: for that last row "max location", I have temporarily modified
-fmem-report to add this because it is relevant to PR65536.)
New way:
--------
Number of ordinary maps used: 139k
Ordinary map used size: 4473k
Number of ordinary maps allocated: 256k
Ordinary maps allocated size: 8192k
Total allocated maps size: 8192k
Total used maps size: 4473k
Ad-hoc table size: 640M
Ad-hoc table entries used: 14M
optimized_ranges: 540k
unoptimized_ranges: 0
max location: 9030944385
So this much is satisfactory. There are about 10X fewer maps created with
the new approach. The max location_t is even more reduced, by a factor of
300X or so. The reason this got so much smaller is because in the new
approach, I also put an optimization for the linemap that could actually
have been done at any time even without this change (and, if we don't go
with my patches, then it should probably just be done as a one-line
change). The linemap has a configurable number of bits reserved inside each
location_t to store range information for location ranges. With 64-bit
location_t, the default number of bits for ranges is set in toplev.cc to 7
bits. The LTO front end does not use range information, so it could change
this to 0 with no change in behavior other than having 128X more location_t
space available. In the new approach to the linemap, I do use 0 range bits
for the maps. [As an aside, if it were desired to include range information
in the locations, this could be done as a future enhancement; it is
conceptually very straightforward, it just does meaningfully increase the
size of the streamed location data and it also increases the number of adhoc
locations.]
I feel the above results are a good outcome for the new approach and are
sufficient also to close out PR65536. There is (at least) one downside that
should be mentioned, however. Here is the same data, but using the default LTO
partition rather than -flto-partition=none. In this mode, the WPA phase just
reads the decls section and not the function bodies; then it partitions
everything into a number of LTRANS files that are subsequently processed
separately.
Old way:
--------
Number of ordinary maps used: 96k
Ordinary map used size: 3100k
Number of ordinary maps allocated: 256k
Ordinary maps allocated size: 8192k
Total allocated maps size: 8192k
Total used maps size: 3100k
Ad-hoc table size: 5120k
Ad-hoc table entries used: 100k
optimized_ranges: 0
unoptimized_ranges: 0
max location: 261878903168
New way:
--------
Number of ordinary maps used: 139k
Ordinary map used size: 4473k
Number of ordinary maps allocated: 256k
Ordinary maps allocated size: 8192k
Total allocated maps size: 8192k
Total used maps size: 4473k
Ad-hoc table size: 5120k
Ad-hoc table entries used: 80k
optimized_ranges: 0
unoptimized_ranges: 0
max location: 9030944385
So what you can see here is that with the old approach, there are fewer maps
used when doing just WPA, compared to when doing the full compilation; but with
the new approach, the number of maps and locations used is the same in both
cases. This is because with the old approach, most locations (other than those
read from the decls section) are only created when the function bodies are read,
and they are not read during WPA; with the new approach, it is necessary to read
all the maps from the linemap section during WPA (although the function bodies
are still not read, as before). As a result, when doing just WPA, the new
approach creates around 50% more maps than the old way. I don't think this is a
concern from the perspective of PR65536, because what really matters there is
the number of location_t used, and that is still significantly smaller than
before. It does mean there is potentially more memory used with the new approach
during WPA. In this particular case, the memory usage was not actually increased
at all, because the space for the maps is overallocated some. I don't expect
this to be a deal-breaker for the new approach; the memory used by linemaps is a
small fraction of the total.
[1/5] libcpp: Preparation for LTO linemap changes
[2/5] lto: Overhaul approach to location streaming [PR65536]
[3/5] diagnostics: Preparation for LTO diagnostic pragma support
[4/5] testsuite: Add dg-lto-additional-options directive
[5/5] lto: Support #pragma GCC diagnostic [PR80922] [PR106823] [PR107936]
[2/5] implements the new streaming format and is the largest one; [1/5],
[3/5], and [4/5] are small preparatory patches, and [5/5], which actually
implements diagnostic pragma streaming, is a small extension of [2/5].
Thanks in advance for taking a look at it; I hope this looks like a useful
direction, whether for now or for stage 1.
-Lewis