Hi all,
We are using PETSc 3.20 in our code and running succesfully several solvers on Nvidia GPU with OpenMPI library which are not GPU aware (so I need to add the flag -use_gpu_aware_mpi 0). But now, when using OpenMPI GPU Aware library (OpenMPI 4.0.5 ou 4.1.5 from NVHPC), some parallel calculations failed with KSP_DIVERGED_ITS or KSP_DIVERGED_DTOL with several configurations. It may run wells on a small test case with (matrix is symmetric): -ksp_type cg -pc_type gamg -pc_gamg_type classical But suddenly with a number of devices for instance bigger than 4 or 8, it may fail. If I switch to another solver (BiCGstab), it may converge: -ksp_type bcgs -pc_type gamg -pc_gamg_type classical The more sensitive cases where it diverges are the following: -ksp_type cg -pc_type hypre -pc_hypre_type boomeramg -ksp_type cg -pc_type gamg -pc_gamg_type classical And the bcgs turnaroud doesn't work each time... It seems to work without problem with aggregation (at least 128 GPUs on my simulation): -ksp_type cg -pc_type gamg -pc_gamg_type agg So I guess there is a weird thing happening in my code during the solve in PETSc with MPI GPU Aware, as all the previous configurations works with non GPU aware MPI. Here is the -ksp_view log during one fail with the first configuration: KSP Object: () 8 MPI processes type: cg maximum iterations=10000, nonzero initial guess tolerances: relative=0., absolute=0.0001, divergence=10000. left preconditioning using UNPRECONDITIONED norm type for convergence test PC Object: () 8 MPI processes type: hypre HYPRE BoomerAMG preconditioning Cycle type V Maximum number of levels 25 Maximum number of iterations PER hypre call 1 Convergence tolerance PER hypre call 0. Threshold for strong coupling 0.7 Interpolation truncation factor 0. Interpolation: max elements per row 0 Number of levels of aggressive coarsening 0 Number of paths for aggressive coarsening 1 Maximum row sums 0.9 Sweeps down 1 Sweeps up 1 Sweeps on coarse 1 Relax down l1scaled-Jacobi Relax up l1scaled-Jacobi Relax on coarse Gaussian-elimination Relax weight (all) 1. Outer relax weight (all) 1. Maximum size of coarsest grid 9 Minimum size of coarsest grid 1 Not using CF-relaxation Not using more complex smoothers. Measure type local Coarsen type PMIS Interpolation type ext+i SpGEMM type cusparse linear system matrix = precond matrix: Mat Object: () 8 MPI processes type: mpiaijcusparse rows=64000, cols=64000 total: nonzeros=311040, allocated nonzeros=311040 total number of mallocs used during MatSetValues calls=0 not using I-node (on process 0) routines I didn't succeed for the moment creating a reproducer with ex.c examples... Did you see this kind of behaviour before? Should I update my PETSc version ? Thanks for any advice, Pierre LEDAC Commissariat à l’énergie atomique et aux énergies alternatives Centre de SACLAY DES/ISAS/DM2S/SGLS/LCAN Bâtiment 451 – point courrier n°43 F-91191 Gif-sur-Yvette +33 1 69 08 04 03 +33 6 83 42 05 79