On Sep 24, 2020, at 1:11 PM, Barry Smith 
<bsm...@petsc.dev<mailto:bsm...@petsc.dev>> wrote:



On Sep 24, 2020, at 11:48 AM, Zhang, Chonglin 
<zhang...@rpi.edu<mailto:zhang...@rpi.edu>> wrote:

Thanks Mark and Barry!

A quick try of using “-pc_type jacobi” did reduce the number of count for 
“CpuToGpu” and “GpuToCpu”, although using “-pc_type gamg” (the counts did not 
decrease in this case) solves the problem faster (may not be of any meaning 
since the problem size is too small; the function 
“DMPlexCreateFromCellListParallelPetsc()" is slow for large problem size 
preventing running larger problems, separate issue).

Would this “CpuToGpu” and “GpuToCpu” data transfer contribute a significant 
amount of time for a realistic sized problem, say for example a linear problem 
with ~1-2 million DOFs?

   It depends on how often the copies are done.

   With GAMG once the preconditioner is built the entire linear solve can run 
on the GPU and Mark has some good speed ups of the liner solve using GAMG on 
the GPU instead of the CPU on Summit.

   The speedup of the entire simulation will depend on the relative cost of the 
finite element matrix assembly vs the linear solver time and Amdahl's law kicks 
in so, for example, if the finite element assembly takes 50 percent of the time 
even if the linear solve takes 0 time one cannot only get a speedup of two 
which is not much.


Thanks for the detailed explanation Barry!

Mark: could you share the results of GAMG on GPU vs CPU on Summit, or pointing 
to me where I could see them. (Actual code how you are doing this would be even 
better as a learning opportunity for me). Thanks!


Also, is there any plan to have the SNES and DMPlex code run on GPU?

  Basically the finite element computation for the nonlinear function and its 
Jacobian need to run on the GPU, this is a big project that we've barely begun 
thinking about. If this is something you are interested in it would be 
fantastic if you could take a look at that.

I see. I will think about this, discuss internally and get back to you if I can!

Thanks!
Chonglin


  Barry




Thanks!
Chonglin

On Sep 24, 2020, at 12:17 PM, Barry Smith 
<bsm...@petsc.dev<mailto:bsm...@petsc.dev>> wrote:


   MatSOR() runs on the CPU, this causes copy to CPU for each application of 
MatSOR() and then a copy to GPU for the next step.

   You can try, for example -pc_type jacobi  better yet use PCGAMG if it 
amenable for your problem.

   Also the problem is way to small for a GPU.

  There will be copies between the GPU/CPU for each SNES iteration since the 
DMPLEX code does not run on GPUs.

   Barry



On Sep 24, 2020, at 10:08 AM, Zhang, Chonglin 
<zhang...@rpi.edu<mailto:zhang...@rpi.edu>> wrote:

Dear PETSc Users,

I have some questions regarding the proper GPU usage. I would like to know the 
proper way to:
(1) solve linear equation in SNES, using GPU in PETSc; what syntax/arguments 
should I be using;
(2) how to avoid/reduce the “CpuToGpu count” and “GpuToCpu count” data transfer 
showed in PETSc log file, when using CUDA aware MPI.


Details of what I am doing now and my observations are below:

System and compilers used:
(1) RPI’s AiMOS computer (node wise, it is the same as Summit);
(2) using GCC 7.4.0 and Spectrum-MPI 10.3.

I am doing the followings to solve the linear Poisson equation with SNES 
interface, under DMPlex:
(1) using DMPlex to set up the unstructured mesh;
(2) using DM to create vector and matrix;
(3) using SNES interface to solve the linear Poisson equation, with “-snes_type 
ksponly”;
(4) using “dm_vec_type cuda”, “dm_mat_type aijcusparse “ to use GPU vector and 
matrix, as suggested in this webpage: 
https://www.mcs.anl.gov/petsc/features/gpus.html
(5) using “use_gpu_aware_mpi” with PETSc, and using `mpirun -gpu` to enable 
GPU-Direct ( similar as "srun --smpiargs=“-gpu”" for Summit): 
https://secure.cci.rpi.edu/wiki/Slurm/#gpu-direct; 
https://www.olcf.ornl.gov/wp-content/uploads/2018/11/multi-gpu-workshop.pdf
(6) using “-options_left” to check and make sure all the arguments are accepted 
and used by PETSc.
(7) After problem setup, I am running the “SNESSolve()” multiple times to solve 
the linear problem and observe the log file with “-log_view"

I noticed that if I run “SNESSolve()” 500 times, instead of 50 times, the 
“CpuToGpu count” and/or “GpuToCpu count” increased roughly 10 times for some of 
the operations: SNESSolve, MatSOR, VecMDot, VecCUDACopyTo, VecCUDACopyFrom, 
MatCUSPARSCopyTo. See below for a truncated log corresponding to running 
SNESSolve() 500 times:


Event                Count      Time (sec)     Flop                             
 --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu - GPU
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---------------------------------------------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

BuildTwoSided        510 1.0 4.9205e-03 1.1 0.00e+00 0.0 3.5e+01 4.0e+00 
1.0e+03  0  0  0  0  0   0  0 21  0  0     0       0      0 0.00e+00    0 
0.00e+00  0
BuildTwoSidedF       501 1.0 1.0199e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
1.0e+03  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
0.00e+00  0
SNESSolve            500 1.0 3.2570e+02 1.0 1.18e+10 1.0 0.0e+00 0.0e+00 
8.7e+05100100  0  0100 100100  0  0100   144     202   31947 7.82e+02 63363 
1.44e+03 82
SNESSetUp              1 1.0 6.0082e-04 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
1.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
0.00e+00  0
SNESFunctionEval     500 1.0 3.9826e+01 1.0 3.60e+08 1.0 0.0e+00 0.0e+00 
5.0e+02 12  3  0  0  0  12  3  0  0  0    36      13      0 0.00e+00 1000 
2.48e+01  0
SNESJacobianEval     500 1.0 4.8200e+01 1.0 5.97e+08 1.0 0.0e+00 0.0e+00 
2.0e+03 15  5  0  0  0  15  5  0  0  0    50       0   1000 7.77e+01  500 
1.24e+01  0
DMPlexResidualFE     500 1.0 3.6923e+01 1.1 3.56e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 10  3  0  0  0  10  3  0  0  0    39       0      0 0.00e+00  500 
1.24e+01  0
DMPlexJacobianFE     500 1.0 4.6013e+01 1.0 5.95e+08 1.0 0.0e+00 0.0e+00 
2.0e+03 14  5  0  0  0  14  5  0  0  0    52       0   1000 7.77e+01    0 
0.00e+00  0
MatSOR             30947 1.0 3.1254e+00 1.1 1.21e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  1 10  0  0  0   1 10  0  0  0  1542       0      0 0.00e+00 61863 
1.41e+03  0
MatAssemblyBegin     511 1.0 5.3428e+00256.4 0.00e+00 0.0 0.0e+00 0.0e+00 
2.0e+03  1  0  0  0  0   1  0  0  0  0     0       0      0 0.00e+00    0 
0.00e+00  0
MatAssemblyEnd       511 1.0 4.3440e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
2.1e+01  0  0  0  0  0   0  0  0  0  0     0       0   1002 7.80e+01    0 
0.00e+00  0
MatCUSPARSCopyTo    1002 1.0 3.6557e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0   1002 7.80e+01    0 
0.00e+00  0
VecMDot            29930 1.0 3.7843e+01 1.0 2.62e+09 1.0 0.0e+00 0.0e+00 
6.0e+04 12 22  0  0  7  12 22  0  0  7   277    3236   29930 6.81e+02    0 
0.00e+00 100
VecNorm            31447 1.0 2.1164e+01 1.4 1.79e+08 1.0 0.0e+00 0.0e+00 
6.3e+04  5  2  0  0  7   5  2  0  0  7    34      55   1017 2.31e+01    0 
0.00e+00 100
VecNormalize       30947 1.0 2.3957e+01 1.1 2.65e+08 1.0 0.0e+00 0.0e+00 
6.2e+04  7  2  0  0  7   7  2  0  0  7    44      51   1017 2.31e+01    0 
0.00e+00 100
VecCUDACopyTo      30947 1.0 7.8866e+00 3.4 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0   2  0  0  0  0     0       0   30947 7.04e+02    0 
0.00e+00  0
VecCUDACopyFrom    63363 1.0 1.0873e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00 63363 
1.44e+03  0
KSPSetUp             500 1.0 2.2737e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
5.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
0.00e+00  0
KSPSolve             500 1.0 2.3687e+02 1.0 1.08e+10 1.0 0.0e+00 0.0e+00 
8.6e+05 72 92  0  0 99  73 92  0  0 99   182     202   30947 7.04e+02 61863 
1.41e+03 89
KSPGMRESOrthog     29930 1.0 1.8920e+02 1.0 7.87e+09 1.0 0.0e+00 0.0e+00 
6.4e+05 58 67  0  0 74  58 67  0  0 74   166     209   29930 6.81e+02    0 
0.00e+00 100
PCApply            30947 1.0 3.1555e+00 1.1 1.21e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  1 10  0  0  0   1 10  0  0  0  1527       0      0 0.00e+00 61863 
1.41e+03  0


Thanks!
Chonglin




Reply via email to