Artem-B wrote:

Ooh... I think I know exactly what may be causing this.

On machines where NVIDIA GPUs are used for compute only (e.g. a headless server 
machine), NVIDIA drivers are not always loaded by default and may not have 
driver persistence enabled. The drivers get loaded when GPU is accessed, and 
then released and unloaded when there are no GPU users remaining. A parallel 
compilation with `--offload-arch=native` will be the worst-case stress test 
scenario for the driver init/deinit machinery, as GPU probing is both 
short-living and will be done repeatedly.

Adding a timeout here would help, sort of, but it would be much better if we 
could figure out a way to either detect that GPU probing takes too long (and 
likely causes the driver to load/unload), or cache probing results somehow, so 
we do not have to run the same detection over and over again. This is a point 
towards pushing the detection out of clang into the build system, which would 
be the better place to do it.

For the GPU detection, we may be able to work around the issue by leaving the 
detection app running for the duration of the compilation, and prevent driver 
unloading, but it's a rather gross hack.

https://github.com/llvm/llvm-project/pull/94751
_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to