https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95348
Martin Liška <marxin at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |WAITING --- Comment #7 from Martin Liška <marxin at gcc dot gnu.org> --- Ok, I spent some time thinking about your workload and I would recommend the following steps: 1) You should not generate profile data for each process to a different folder, but rather merge it. GCC PGO bootstrap contains ~500 .gcda files where the process is executed ~2000x. Note that .gcda file merging happens per-file and the file is locked. It should be a reasonable small window that can delay parallel process execution. 2) I would like to know how long does one process run and what portion is spent in merging (and dumping) of a profile. 3) You may consider shrinking training run, 10.000 executions seems like a massive training run to me. 4) GCDA file format is not ideal and can be simply and rapidly shrank by e.g. gzip. For GCC PGO, it shrinks 10x. Please provide as much information about the workload so that we can find a feasible solution.