Hi Aron, On 2025-01-10 06:57, Aron Xu wrote:
When I was performing a rebuild for an upcoming transition, I noticed that rocsolver took a lot of time because most of the time the build takes up to 16 parallel jobs. It would be great if the build parallelism could be improved but I have not done research on what's the cause, it might be build system related and not easy to change.
The slowest translation units are those containing the specialized kernels for small matrix sizes. These kernels are templated on an integer parameter, N, representing the size of the matrix. They are explicitly instantiated largely for loop unrolling.
The compiler builds these templated functions for all possible combinations of N=0...64, data type=single,double,complex,double complex, gpu_arch=gfx803,gfx900,gfx906,gfx908,gfx90a,gfx1010,gfx1030,gfx1100,gfx1101,gfx1102. Incidentally, these kernels represent ~95% of the size on disk of librocsolver.so.
If I recall correctly, the parallelism in this process is limited to the data_type. Separate translation units were created for different data types to increase the possible parallelism. As an upstream developer of rocSOLVER, there are a few ways in which I would like to see this improved:
1. Support for a HIP equivalent to CUDA_SEPARABLE_COMPILATION within CMake [1]. This would enable the build system to manage the invokation of the compiler for each GPU architecture. As it stands now, when building for multiple GPU architectures, clang invokes itself multiple times in serial and then invokes the bundler to combine the resulting artifacts. If this was managed by the build system, you could have a 10x increase in parallelism.
It should be noted that the AMD fork of clang has a flag called -parallel-jobs that allows clang to invoke itself in parallel when building for multiple architectures. Unfortunately, this is a flawed solution. The clang job count is multiplicative with the make job count and this can result in resource exhaustion in the parts of the build with the greatest make-managed parallelism. As such, you're forced to set -parallel-jobs to a relatively low value, which needlessly limits parallelism during the parts of the build with the least make-managed parallelism.
2. If the small matrix size functions in rocsolver could be rewritten to depend on kernels that operated on blocks of fixed sizes, then perhaps the kernels could be instantiated for something like N=1,2,4,8,16,32, then N=1...64 could just use a combination of those other sizes. Unfortunately, previous attempts to do this failed because they introduced unacceptable performance regressions.
3. The specialized small matrix kernels could be split out of librocsolver.so and into separate code objects. The rocsolver library could then manage the build of those code objects itself within its CMake, which would allow for parallel compilation by GPU architecture. This option might also be nice because the rocsolver library itself would be ~95% smaller if it only contained the generic kernels, and the size-specialized kernels were moved to separate files to be loaded at runtime (if available).
4. The Debian build could ask CMake to generate Ninja build files rather than Make build files. If build with ninja, the librocsolver, rocsolver-test and rocsolver-bench sources would be compiled in parallel despite the latter depending on the former. This would reduce the number of parallelism bottlenecks, it may result in the librocsolver library being linked during the compilation of other sources, which would increase the maximum amount of memory required for the build. There is, however, a patch that could be used as a workaround [2].
5. Once LLVM's generic targets and SPIR-V targets are supported by the HIP Runtime, we could adopt them to reduce the number of GPU targets we need to build for. This doesn't actually increase parallelism, but it would at least reduce the build time.
Sincerely, Cory Bloor [1]: https://gitlab.kitware.com/cmake/cmake/-/issues/23210 [2]: https://github.com/ROCm/rocSOLVER/pull/652