This communication is all in PCApply. What -pc_type are you using? It looks like -pc_type ssor (or is it sor). That is not implemented on the GPU. You can use 'jacobi'
On Thu, Sep 24, 2020 at 11:08 AM Zhang, Chonglin <zhang...@rpi.edu> wrote: > Dear PETSc Users, > > I have some questions regarding the proper GPU usage. I would like to know > the proper way to: > (1) solve linear equation in SNES, using GPU in PETSc; what > syntax/arguments should I be using; > (2) how to avoid/reduce the “CpuToGpu count” and “GpuToCpu count” data > transfer showed in PETSc log file, when using CUDA aware MPI. > > > Details of what I am doing now and my observations are below: > > System and compilers used: > (1) RPI’s AiMOS computer (node wise, it is the same as Summit); > (2) using GCC 7.4.0 and Spectrum-MPI 10.3. > > I am doing the followings to solve the linear Poisson equation with SNES > interface, under DMPlex: > (1) using DMPlex to set up the unstructured mesh; > (2) using DM to create vector and matrix; > (3) using SNES interface to solve the linear Poisson equation, with > “-snes_type ksponly”; > (4) using “dm_vec_type cuda”, “dm_mat_type aijcusparse “ to use GPU vector > and matrix, as suggested in this webpage: > https://www.mcs.anl.gov/petsc/features/gpus.html > (5) using “use_gpu_aware_mpi” with PETSc, and using `mpirun -gpu` to > enable GPU-Direct ( similar as "srun --smpiargs=“-gpu”" for Summit): > https://secure.cci.rpi.edu/wiki/Slurm/#gpu-direct; > https://www.olcf.ornl.gov/wp-content/uploads/2018/11/multi-gpu-workshop.pdf > (6) using “-options_left” to check and make sure all the arguments are > accepted and used by PETSc. > (7) After problem setup, I am running the “SNESSolve()” multiple times to > solve the linear problem and observe the log file with “-log_view" > > I noticed that if I run “SNESSolve()” 500 times, instead of 50 times, the > “CpuToGpu count” and/or “GpuToCpu count” increased roughly 10 times for > some of the operations: SNESSolve, MatSOR, VecMDot, VecCUDACopyTo, > VecCUDACopyFrom, MatCUSPARSCopyTo. See below for a truncated log > corresponding to running SNESSolve() 500 times: > > > Event Count Time (sec) Flop > --- Global --- --- Stage ---- Total GPU - CpuToGpu - - > GpuToCpu - GPU > Max Ratio Max Ratio Max Ratio Mess AvgLen > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count > Size %F > > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > > --- Event Stage 0: Main Stage > > BuildTwoSided 510 1.0 4.9205e-03 1.1 0.00e+00 0.0 3.5e+01 4.0e+00 > 1.0e+03 0 0 0 0 0 0 0 21 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > BuildTwoSidedF 501 1.0 1.0199e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.0e+03 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > SNESSolve 500 1.0 3.2570e+02 1.0 1.18e+10 1.0 0.0e+00 0.0e+00 > 8.7e+05100100 0 0100 100100 0 0100 144 202 31947 7.82e+02 63363 > 1.44e+03 82 > SNESSetUp 1 1.0 6.0082e-04 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > SNESFunctionEval 500 1.0 3.9826e+01 1.0 3.60e+08 1.0 0.0e+00 0.0e+00 > 5.0e+02 12 3 0 0 0 12 3 0 0 0 36 13 0 0.00e+00 1000 > 2.48e+01 0 > SNESJacobianEval 500 1.0 4.8200e+01 1.0 5.97e+08 1.0 0.0e+00 0.0e+00 > 2.0e+03 15 5 0 0 0 15 5 0 0 0 50 0 1000 7.77e+01 500 > 1.24e+01 0 > DMPlexResidualFE 500 1.0 3.6923e+01 1.1 3.56e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 10 3 0 0 0 10 3 0 0 0 39 0 0 0.00e+00 500 > 1.24e+01 0 > DMPlexJacobianFE 500 1.0 4.6013e+01 1.0 5.95e+08 1.0 0.0e+00 0.0e+00 > 2.0e+03 14 5 0 0 0 14 5 0 0 0 52 0 1000 7.77e+01 0 > 0.00e+00 0 > MatSOR 30947 1.0 3.1254e+00 1.1 1.21e+09 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 10 0 0 0 1 10 0 0 0 1542 0 0 0.00e+00 61863 > 1.41e+03 0 > MatAssemblyBegin 511 1.0 5.3428e+00256.4 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+03 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > MatAssemblyEnd 511 1.0 4.3440e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.1e+01 0 0 0 0 0 0 0 0 0 0 0 0 1002 7.80e+01 0 > 0.00e+00 0 > MatCUSPARSCopyTo 1002 1.0 3.6557e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1002 7.80e+01 0 > 0.00e+00 0 > VecMDot 29930 1.0 3.7843e+01 1.0 2.62e+09 1.0 0.0e+00 0.0e+00 > 6.0e+04 12 22 0 0 7 12 22 0 0 7 277 3236 29930 6.81e+02 0 > 0.00e+00 100 > VecNorm 31447 1.0 2.1164e+01 1.4 1.79e+08 1.0 0.0e+00 0.0e+00 > 6.3e+04 5 2 0 0 7 5 2 0 0 7 34 55 1017 2.31e+01 0 > 0.00e+00 100 > VecNormalize 30947 1.0 2.3957e+01 1.1 2.65e+08 1.0 0.0e+00 0.0e+00 > 6.2e+04 7 2 0 0 7 7 2 0 0 7 44 51 1017 2.31e+01 0 > 0.00e+00 100 > VecCUDACopyTo 30947 1.0 7.8866e+00 3.4 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 0 30947 7.04e+02 0 > 0.00e+00 0 > VecCUDACopyFrom 63363 1.0 1.0873e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 63363 > 1.44e+03 0 > KSPSetUp 500 1.0 2.2737e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 5.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > KSPSolve 500 1.0 2.3687e+02 1.0 1.08e+10 1.0 0.0e+00 0.0e+00 > 8.6e+05 72 92 0 0 99 73 92 0 0 99 182 202 30947 7.04e+02 61863 > 1.41e+03 89 > KSPGMRESOrthog 29930 1.0 1.8920e+02 1.0 7.87e+09 1.0 0.0e+00 0.0e+00 > 6.4e+05 58 67 0 0 74 58 67 0 0 74 166 209 29930 6.81e+02 0 > 0.00e+00 100 > PCApply 30947 1.0 3.1555e+00 1.1 1.21e+09 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 10 0 0 0 1 10 0 0 0 1527 0 0 0.00e+00 61863 > 1.41e+03 0 > > > Thanks! > Chonglin >