Re: [petsc-users] overlap cpu and gpu?

Mark Adams Tue, 18 Aug 2020 04:46:42 -0700

I thought these were the Lapalcian (Poisson and AMerphere law).

Anyway, the coarse grids are very messed up, or at least the eigen
estimates are very messed up. A bad QR solver, used in GAMG's coarse grid
construction, could do that. I've never seen that happen before, but it
would explain this.


On Tue, Aug 18, 2020 at 3:18 AM nicola varini <nicola.var...@gmail.com>
wrote:

> Dear Mark, the matrices are not symmetric and not positive definite. I did
> try to add:
> *-*ampere_mg_levels_esteig_*ksp_monitor_singular_value*
> -ampere_mg_levels_esteig_ksp_max_it *50*
> -ampere_mg_levels_esteig_ksp_type
> *gmres*
> but it still fails to converge.
> For the time being it seems that hypre on CPU is the safest choice,
> although it is surely worth experimenting with Stefano branch.
>
> Thanks,
>
> Nicola
>
> Il giorno lun 17 ago 2020 alle ore 18:10 Mark Adams <mfad...@lbl.gov> ha
> scritto:
>
>> The eigen estimates are either very bad or the coarse grids have a
>> problem. Everything looks fine other than these bad estimates  that are
>> >> 2.
>>
>> * Are these matrices not symmetric? Maybe from BCs. THat is not usually a
>> problem, just checking.
>>
>> * Are these stretched grids? If not you might try:
>> -ampere_pc_gamg_square_graph *10*
>>
>> * GMRES is not a good estimator when you have SPD matrices, but it is
>> robust. You might try
>>
>> *-*ampere_mg_levels_esteig_*ksp_monitor_singular_value*
>> -ampere_mg_levels_esteig_ksp_max_it *50*
>> -ampere_mg_levels_esteig_ksp_type *gmres*
>>
>> * And why are you using:
>>
>> -ampere_ksp_type dgmres
>>
>> ?
>> If your problems are SPD then CG is great.
>>
>> Mark
>>
>> On Mon, Aug 17, 2020 at 10:33 AM nicola varini <nicola.var...@gmail.com>
>> wrote:
>>
>>> Hi Mark, this is the out of grep GAMG after I used -info:
>>> =======
>>> [0] PCSetUp_GAMG(): level 0) N=582736, n data rows=1, n data cols=1,
>>> nnz/row (ave)=9, np=12
>>> [0] PCGAMGFilterGraph():         97.9676% nnz after filtering, with
>>> threshold 0., 8.95768 nnz ave. (N=582736)
>>> [0] PCGAMGCoarsen_AGG(): Square Graph on level 1 of 1 to square
>>> [0] PCGAMGProlongator_AGG(): New grid 38934 nodes
>>> [0] PCGAMGOptProlongator_AGG(): Smooth P0: max eigen=2.101683e+00 
>>> min=4.341777e-03
>>> PC=jacobi
>>> [0] PCGAMGOptProlongator_AGG(): Smooth P0: level 0, cache spectra
>>> 0.00434178 2.10168
>>> [0] PCSetUp_GAMG(): 1) N=38934, n data cols=1, nnz/row (ave)=18, 12
>>> active pes
>>> [0] PCGAMGFilterGraph():         97.024% nnz after filtering, with
>>> threshold 0., 17.9774 nnz ave. (N=38934)
>>> [0] PCGAMGProlongator_AGG(): New grid 4459 nodes
>>> [0] PCGAMGOptProlongator_AGG(): Smooth P0: max eigen=4.521607e+01
>>> min=5.854294e-01 PC=jacobi
>>> [0] PCGAMGOptProlongator_AGG(): Smooth P0: level 1, cache spectra
>>> 0.585429 45.2161
>>> [0] PCSetUp_GAMG(): 2) N=4459, n data cols=1, nnz/row (ave)=29, 12
>>> active pes
>>> [0] PCGAMGFilterGraph():         99.6422% nnz after filtering, with
>>> threshold 0., 27.5481 nnz ave. (N=4459)
>>> [0] PCGAMGProlongator_AGG(): New grid 345 nodes
>>> [0] PCGAMGOptProlongator_AGG(): Smooth P0: max eigen=1.394069e+01 
>>> min=1.086973e-01
>>> PC=jacobi
>>> [0] PCGAMGOptProlongator_AGG(): Smooth P0: level 2, cache spectra
>>> 0.108697 13.9407
>>> [0] PCGAMGCreateLevel_GAMG(): Number of equations (loc) 29 with simple
>>> aggregation
>>> [0] PCSetUp_GAMG(): 3) N=345, n data cols=1, nnz/row (ave)=31, 6 active
>>> pes
>>> [0] PCGAMGFilterGraph():         99.6292% nnz after filtering, with
>>> threshold 0., 26.9667 nnz ave. (N=345)
>>> [0] PCGAMGProlongator_AGG(): New grid 26 nodes
>>> [0] PCGAMGOptProlongator_AGG(): Smooth P0: max eigen=1.463593e+02
>>> min=1.469384e-01 PC=jacobi
>>> [0] PCGAMGOptProlongator_AGG(): Smooth P0: level 3, cache spectra
>>> 0.146938 146.359
>>> [0] PCGAMGCreateLevel_GAMG(): Number of equations (loc) 5 with simple
>>> aggregation
>>> [0] PCSetUp_GAMG(): 4) N=26, n data cols=1, nnz/row (ave)=16, 1 active
>>> pes
>>> [0] PCSetUp_GAMG(): 5 levels, grid complexity = 1.16304
>>> PCGAMGGraph_AGG        4 1.0 8.4114e-02 1.0 1.02e+06 1.0 3.8e+02 1.3e+03
>>> 4.0e+01  0  0  0  0  0   0  0  0  0  0   145
>>> PCGAMGCoarse_AGG       4 1.0 3.2107e-01 1.0 9.43e+06 1.0 7.3e+02 1.1e+04
>>> 3.5e+01  0  0  0  0  0   0  0  0  0  0   351
>>> PCGAMGProl_AGG         4 1.0 2.8825e-02 1.0 0.00e+00 0.0 3.5e+02 2.8e+03
>>> 6.4e+01  0  0  0  0  0   0  0  0  0  0     0
>>> PCGAMGPOpt_AGG         4 1.0 1.1570e-01 1.0 2.61e+07 1.0 1.2e+03 2.6e+03
>>> 1.6e+02  0  0  0  0  1   0  0  0  0  1  2692
>>> GAMG: createProl       4 1.0 5.5680e-01 1.0 3.64e+07 1.0 2.7e+03 4.6e+03
>>> 3.0e+02  0  0  0  0  1   0  0  0  0  1   784
>>> GAMG: partLevel        4 1.0 1.1628e-01 1.0 5.90e+06 1.0 1.1e+03 3.0e+03
>>> 1.6e+02  0  0  0  0  1   0  0  0  0  1   604
>>> ======
>>> Nicola
>>>
>>>
>>> Il giorno lun 17 ago 2020 alle ore 15:40 Mark Adams <mfad...@lbl.gov>
>>> ha scritto:
>>>
>>>>
>>>>
>>>> On Mon, Aug 17, 2020 at 9:24 AM nicola varini <nicola.var...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Mark, I do confirm that hypre with boomeramg is working fine and is
>>>>> pretty fast.
>>>>>
>>>>
>>>> Good, you can send me the -info (grep GAMG) output and I try to see
>>>> what is going on.
>>>>
>>>>
>>>>> However, none of the GAMG option works.
>>>>> Did anyone ever succeeded in usign hypre with petsc on gpu?
>>>>>
>>>>
>>>> We have gotten Hypre to run on GPUs but it has been fragile. The
>>>> performance has been marginal (due to use of USM apparently), but it is
>>>> being worked on by the hypre team.
>>>>
>>>> The cude tools are changing fast and I am guessing this is a different
>>>> version than what we have tested, perhaps. Maybe someone else can help with
>>>> this, but I know we use cuda 10.2 and you are using cuda tools 10.1.
>>>>
>>>> And you do want to use the most up-to-date PETSc.
>>>>
>>>>
>>>>> I did manage to compile hypre on gpu but I do get the following error:
>>>>> =======
>>>>>           CC gpuhypre/obj/vec/vec/impls/hypre/vhyp.o
>>>>> In file included from
>>>>> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config.h:22,
>>>>>                  from
>>>>> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/execution_policy.h:23,
>>>>>                  from
>>>>> /users/nvarini/hypre/include/_hypre_utilities.h:1129,
>>>>>                  from /users/nvarini/hypre/include/_hypre_IJ_mv.h:14,
>>>>>                  from
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/include/../src/vec/vec/impls/hypre/vhyp.h:6,
>>>>>                  from
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/vec/vec/impls/hypre/vhyp.c:7:
>>>>> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/version.h:83:1:
>>>>> error: unknown type name 'namespace'
>>>>>  namespace thrust
>>>>>  ^~~~~~~~~
>>>>> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/version.h:84:1:
>>>>> error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
>>>>>  {
>>>>>  ^
>>>>> In file included from
>>>>> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config/config.h:28,
>>>>>                  from
>>>>> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config.h:23,
>>>>>                  from
>>>>> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/execution_policy.h:23,
>>>>>                  from
>>>>> /users/nvarini/hypre/include/_hypre_utilities.h:1129,
>>>>>                  from /users/nvarini/hypre/include/_hypre_IJ_mv.h:14,
>>>>>                  from
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/include/../src/vec/vec/impls/hypre/vhyp.h:6,
>>>>>                  from
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/vec/vec/impls/hypre/vhyp.c:7:
>>>>> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config/cpp_compatibility.h:21:10:
>>>>> fatal error: cstddef: No such file or directory
>>>>>  #include <cstddef>
>>>>>           ^~~~~~~~~
>>>>> compilation terminated.
>>>>>
>>>>> =======
>>>>> Nicola
>>>>>
>>>>> Il giorno ven 14 ago 2020 alle ore 20:13 Mark Adams <mfad...@lbl.gov>
>>>>> ha scritto:
>>>>>
>>>>>> You can try Hypre. If that fails then there is a problem with your
>>>>>> system.
>>>>>>
>>>>>> And you can run with -info and grep on GAMG and send the output and I
>>>>>> can see if I see anything funny.
>>>>>>
>>>>>> If this is just a Lapacian with a stable discretization and not crazy
>>>>>> material parameters then stretched grids are about the only thing that 
>>>>>> can
>>>>>> hurt the solver.
>>>>>>
>>>>>> Do both of your solves fail in a similar way?
>>>>>>
>>>>>> On the CPU you can try this with large subdomains, preferably (in
>>>>>> serial ideally):
>>>>>> -ampere_mg_levels_ksp_type richardson
>>>>>> -ampere_mg_levels_pc_type sor
>>>>>>
>>>>>> And check that there are no unused options with -options_left. GAMG
>>>>>> can fail with bad eigen estimates, but these parameters look fine.
>>>>>>
>>>>>> On Fri, Aug 14, 2020 at 5:01 AM nicola varini <
>>>>>> nicola.var...@gmail.com> wrote:
>>>>>>
>>>>>>> Dear Barry, yes it gives the same problems.
>>>>>>>
>>>>>>> Il giorno gio 13 ago 2020 alle ore 23:22 Barry Smith <
>>>>>>> bsm...@petsc.dev> ha scritto:
>>>>>>>
>>>>>>>>
>>>>>>>>    Does the same thing work (with GAMG) if you run on the same
>>>>>>>> problem on the same machine same number of MPI ranks but make a new
>>>>>>>> PETSC_ARCH that does NOT use the GPUs?
>>>>>>>>
>>>>>>>>    Barry
>>>>>>>>
>>>>>>>>    Ideally one gets almost identical convergence with CPUs or GPUs
>>>>>>>> (same problem, same machine) but a bug or numerically change "might" 
>>>>>>>> affect
>>>>>>>> this.
>>>>>>>>
>>>>>>>> On Aug 13, 2020, at 10:28 AM, nicola varini <
>>>>>>>> nicola.var...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Dear Barry, you are right. The Cray argument checking is incorrect.
>>>>>>>> It does work with download-fblaslapack.
>>>>>>>> However it does fail to converge. Is there anything obviously wrong
>>>>>>>> with my petscrc?
>>>>>>>> Anything else am I missing?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Il giorno gio 13 ago 2020 alle ore 03:17 Barry Smith <
>>>>>>>> bsm...@petsc.dev> ha scritto:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>    The QR is always done on the CPU, we don't have generic calls
>>>>>>>>> to blas/lapack go to the GPU currently.
>>>>>>>>>
>>>>>>>>>    The error message is:
>>>>>>>>>
>>>>>>>>>    On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal value
>>>>>>>>> (info = -7)
>>>>>>>>>
>>>>>>>>>    argument 7 is &LWORK which is defined by
>>>>>>>>>
>>>>>>>>>    PetscBLASInt   LWORK=N*bs;
>>>>>>>>>
>>>>>>>>>    and
>>>>>>>>>
>>>>>>>>>    N=nSAvec is the column block size of new P.
>>>>>>>>>
>>>>>>>>>    Presumably this is a huge run with many processes so using the
>>>>>>>>> debugger is not practical?
>>>>>>>>>
>>>>>>>>>    We need to see what these variables are
>>>>>>>>>
>>>>>>>>>     N, bs, nSAvec
>>>>>>>>>
>>>>>>>>>     perhaps nSAvec is zero which could easily upset LAPACK.
>>>>>>>>>
>>>>>>>>>     Crudest thing would be to just put a print statement in the
>>>>>>>>> code before the LAPACK call of if they are called many times add an 
>>>>>>>>> error
>>>>>>>>> check like that
>>>>>>>>>     generates an error if any of these three values are 0 (or
>>>>>>>>> negative).
>>>>>>>>>
>>>>>>>>>    Barry
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>     It is not impossible that the Cray argument checking is
>>>>>>>>> incorrect and the value passed in is fine. You can check this by using
>>>>>>>>> --download-fblaslapack and see if the same or some other error comes 
>>>>>>>>> up.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Aug 12, 2020, at 7:19 PM, Mark Adams <mfad...@lbl.gov> wrote:
>>>>>>>>>
>>>>>>>>> Can you reproduce this on the CPU?
>>>>>>>>> The QR factorization seems to be failing. That could be from bad
>>>>>>>>> data or a bad GPU QR.
>>>>>>>>>
>>>>>>>>> On Wed, Aug 12, 2020 at 4:19 AM nicola varini <
>>>>>>>>> nicola.var...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Dear all, following the suggestions I did resubmit the simulation
>>>>>>>>>> with the petscrc below.
>>>>>>>>>> However I do get the following error:
>>>>>>>>>> ========
>>>>>>>>>>  7362 [592]PETSC ERROR: #1 formProl0() line 748 in
>>>>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>>>>>>>   7363 [339]PETSC ERROR: Petsc has generated inconsistent data
>>>>>>>>>>   7364 [339]PETSC ERROR: xGEQRF error
>>>>>>>>>>   7365 [339]PETSC ERROR: See
>>>>>>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
>>>>>>>>>> shooting.
>>>>>>>>>>   7366 [339]PETSC ERROR: Petsc Release Version 3.13.3, Jul 01,
>>>>>>>>>> 2020
>>>>>>>>>>   7367 [339]PETSC ERROR:
>>>>>>>>>> /users/nvarini/gbs_test_nicola/bin/gbs_daint_gpu_gnu on a  named 
>>>>>>>>>> nid05083
>>>>>>>>>> by nvarini Wed Aug 12 10:06:15 2020
>>>>>>>>>>   7368 [339]PETSC ERROR: Configure options --with-cc=cc
>>>>>>>>>> --with-fc=ftn --known-mpi-shared-libraries=1 
>>>>>>>>>> --known-mpi-c-double-complex=1
>>>>>>>>>> --known-mpi-int64_t=1 --known-mpi-long-double=1 --with-batch=1
>>>>>>>>>> --known-64-bit-blas-indices=0 --LIBS=-lstdc++ 
>>>>>>>>>> --with-cxxlib-autodetect=0
>>>>>>>>>> --with-scalapa       ck=1 --with-cxx=CC --with-debugging=0
>>>>>>>>>> --with-hypre-dir=/opt/cray/pe/tpsl/19.06.1/GNU/8.2/haswell
>>>>>>>>>> --prefix=/scratch/snx3000/nvarini/petsc3.13.3-gpu --with-cuda=1
>>>>>>>>>> --with-cuda-c=nvcc --with-cxxlib-autodetect=0
>>>>>>>>>> --COPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include -
>>>>>>>>>> -with-cxx=CC
>>>>>>>>>> --CXXOPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include
>>>>>>>>>>   7369 [592]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in
>>>>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>>>>>>>   7370 [592]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>>>>>>>>   7371 [592]PETSC ERROR: #4 PCSetUp() line 898 in
>>>>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c
>>>>>>>>>>   7372 [592]PETSC ERROR: #5 KSPSetUp() line 376 in
>>>>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>>>>>>   7373 [592]PETSC ERROR: #6 KSPSolve_Private() line 633 in
>>>>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>>>>>>   7374 [316]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>>>>>>>>   7375 [339]PETSC ERROR: #1 formProl0() line 748 in
>>>>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>>>>>>>   7376 [339]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in
>>>>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>>>>>>>   7377 [339]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>>>>>>>>   7378 [339]PETSC ERROR: #4 PCSetUp() line 898 in
>>>>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c
>>>>>>>>>>   7379 [339]PETSC ERROR: #5 KSPSetUp() line 376 in
>>>>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>>>>>>   7380 [592]PETSC ERROR: #7 KSPSolve() line 853 in
>>>>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>>>>>>   7381 [339]PETSC ERROR: #6 KSPSolve_Private() line 633 in
>>>>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>>>>>>   7382 [339]PETSC ERROR: #7 KSPSolve() line 853 in
>>>>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>>>>>>   7383 On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal
>>>>>>>>>> value (info = -7)
>>>>>>>>>>   7384 [160]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>>>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>>>>>>>> ========
>>>>>>>>>>
>>>>>>>>>> I did try other pc_gamg_type but they fails as well.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> #PETSc Option Table entries:
>>>>>>>>>> -ampere_dm_mat_type aijcusparse
>>>>>>>>>> -ampere_dm_vec_type cuda
>>>>>>>>>> -ampere_ksp_atol 1e-15
>>>>>>>>>> -ampere_ksp_initial_guess_nonzero yes
>>>>>>>>>> -ampere_ksp_reuse_preconditioner yes
>>>>>>>>>> -ampere_ksp_rtol 1e-7
>>>>>>>>>> -ampere_ksp_type dgmres
>>>>>>>>>> -ampere_mg_levels_esteig_ksp_max_it 10
>>>>>>>>>> -ampere_mg_levels_esteig_ksp_type cg
>>>>>>>>>> -ampere_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05
>>>>>>>>>> -ampere_mg_levels_ksp_type chebyshev
>>>>>>>>>> -ampere_mg_levels_pc_type jacobi
>>>>>>>>>> -ampere_pc_gamg_agg_nsmooths 1
>>>>>>>>>> -ampere_pc_gamg_coarse_eq_limit 10
>>>>>>>>>> -ampere_pc_gamg_reuse_interpolation true
>>>>>>>>>> -ampere_pc_gamg_square_graph 1
>>>>>>>>>> -ampere_pc_gamg_threshold 0.05
>>>>>>>>>> -ampere_pc_gamg_threshold_scale .0
>>>>>>>>>> -ampere_pc_gamg_type agg
>>>>>>>>>> -ampere_pc_type gamg
>>>>>>>>>> -dm_mat_type aijcusparse
>>>>>>>>>> -dm_vec_type cuda
>>>>>>>>>> -log_view
>>>>>>>>>> -poisson_dm_mat_type aijcusparse
>>>>>>>>>> -poisson_dm_vec_type cuda
>>>>>>>>>> -poisson_ksp_atol 1e-15
>>>>>>>>>> -poisson_ksp_initial_guess_nonzero yes
>>>>>>>>>> -poisson_ksp_reuse_preconditioner yes
>>>>>>>>>> -poisson_ksp_rtol 1e-7
>>>>>>>>>> -poisson_ksp_type dgmres
>>>>>>>>>> -poisson_log_view
>>>>>>>>>> -poisson_mg_levels_esteig_ksp_max_it 10
>>>>>>>>>> -poisson_mg_levels_esteig_ksp_type cg
>>>>>>>>>> -poisson_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05
>>>>>>>>>> -poisson_mg_levels_ksp_max_it 1
>>>>>>>>>> -poisson_mg_levels_ksp_type chebyshev
>>>>>>>>>> -poisson_mg_levels_pc_type jacobi
>>>>>>>>>> -poisson_pc_gamg_agg_nsmooths 1
>>>>>>>>>> -poisson_pc_gamg_coarse_eq_limit 10
>>>>>>>>>> -poisson_pc_gamg_reuse_interpolation true
>>>>>>>>>> -poisson_pc_gamg_square_graph 1
>>>>>>>>>> -poisson_pc_gamg_threshold 0.05
>>>>>>>>>> -poisson_pc_gamg_threshold_scale .0
>>>>>>>>>> -poisson_pc_gamg_type agg
>>>>>>>>>> -poisson_pc_type gamg
>>>>>>>>>> -use_mat_nearnullspace true
>>>>>>>>>> #End of PETSc Option Table entries
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Nicola
>>>>>>>>>>
>>>>>>>>>> Il giorno mar 4 ago 2020 alle ore 17:57 Mark Adams <
>>>>>>>>>> mfad...@lbl.gov> ha scritto:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Aug 4, 2020 at 6:35 AM Stefano Zampini <
>>>>>>>>>>> stefano.zamp...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Nicola,
>>>>>>>>>>>>
>>>>>>>>>>>> You are actually not using the GPU properly, since you use
>>>>>>>>>>>> HYPRE preconditioning, which is CPU only. One of your solvers is 
>>>>>>>>>>>> actually
>>>>>>>>>>>> slower on “GPU”.
>>>>>>>>>>>> For a full AMG GPU, you can use PCGAMG, with cheby smoothers
>>>>>>>>>>>> and with Jacobi preconditioning. Mark can help you out with the 
>>>>>>>>>>>> specific
>>>>>>>>>>>> command line options.
>>>>>>>>>>>> When it works properly, everything related to PC application is
>>>>>>>>>>>> offloaded to the GPU, and you should expect to get the well-known 
>>>>>>>>>>>> and
>>>>>>>>>>>> branded 10x (maybe more) speedup one is expecting from GPUs during 
>>>>>>>>>>>> KSPSolve
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> The speedup depends on the machine, but on SUMMIT, using enough
>>>>>>>>>>> CPUs to saturate the memory bus vs all 6 GPUs the speedup is a 
>>>>>>>>>>> function of
>>>>>>>>>>> problem subdomain size. I saw 10x at about 100K equations/process.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Doing what you want to do is one of the last optimization steps
>>>>>>>>>>>> of an already optimized code before entering production. Yours is 
>>>>>>>>>>>> not even
>>>>>>>>>>>> optimized for proper GPU usage  yet.
>>>>>>>>>>>> Also, any specific reason why you are using dgmres and fgmres?
>>>>>>>>>>>>
>>>>>>>>>>>> PETSc has not been designed with multi-threading in mind. You
>>>>>>>>>>>> can achieve “overlap” of the two solves by splitting the 
>>>>>>>>>>>> communicator. But
>>>>>>>>>>>> then you need communications to let the two solutions talk to each 
>>>>>>>>>>>> other.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Stefano
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Aug 4, 2020, at 12:04 PM, nicola varini <
>>>>>>>>>>>> nicola.var...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Dear all, thanks for your replies. The reason why I've asked if
>>>>>>>>>>>> it is possible to overlap poisson and ampere is because they 
>>>>>>>>>>>> roughly
>>>>>>>>>>>> take the same amount of time. Please find in attachment the
>>>>>>>>>>>> profiling logs for only CPU  and only GPU.
>>>>>>>>>>>> Of course it is possible to split the MPI communicator and run
>>>>>>>>>>>> each solver on different subcommunicator, however this would 
>>>>>>>>>>>> involve more
>>>>>>>>>>>> communication.
>>>>>>>>>>>> Did anyone ever tried to run 2 solvers with hyperthreading?
>>>>>>>>>>>> Thanks
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Il giorno dom 2 ago 2020 alle ore 14:09 Mark Adams <
>>>>>>>>>>>> mfad...@lbl.gov> ha scritto:
>>>>>>>>>>>>
>>>>>>>>>>>>> I suspect that the Poisson and Ampere's law solve are not
>>>>>>>>>>>>> coupled. You might be able to duplicate the communicator and use 
>>>>>>>>>>>>> two
>>>>>>>>>>>>> threads. You would want to configure PETSc with threadsafty and 
>>>>>>>>>>>>> threads and
>>>>>>>>>>>>> I think it could/should work, but this mode is never used by 
>>>>>>>>>>>>> anyone.
>>>>>>>>>>>>>
>>>>>>>>>>>>> That said, I would not recommend doing this unless you feel
>>>>>>>>>>>>> like playing in computer science, as opposed to doing application 
>>>>>>>>>>>>> science.
>>>>>>>>>>>>> The best case scenario you get a speedup of 2x. That is a strict 
>>>>>>>>>>>>> upper
>>>>>>>>>>>>> bound, but you will never come close to it. Your hardware has 
>>>>>>>>>>>>> some balance
>>>>>>>>>>>>> of CPU to GPU processing rate. Your application has a balance of 
>>>>>>>>>>>>> volume of
>>>>>>>>>>>>> work for your two solves. They have to be the same to get close 
>>>>>>>>>>>>> to 2x
>>>>>>>>>>>>> speedup and that ratio(s) has to be 1:1. To be concrete, from 
>>>>>>>>>>>>> what little I
>>>>>>>>>>>>> can guess about your applications let's assume that the cost of 
>>>>>>>>>>>>> each of
>>>>>>>>>>>>> these two solves is about the same (eg, Laplacians on your domain 
>>>>>>>>>>>>> and the
>>>>>>>>>>>>> best case scenario). But, GPU machines are configured to have 
>>>>>>>>>>>>> roughly 1-10%
>>>>>>>>>>>>> of capacity in the GPUs, these days, that gives you an upper 
>>>>>>>>>>>>> bound of about
>>>>>>>>>>>>> 10% speedup. That is noise. Upshot, unless you configure your 
>>>>>>>>>>>>> hardware to
>>>>>>>>>>>>> match this problem, and the two solves have the same cost, you 
>>>>>>>>>>>>> will not see
>>>>>>>>>>>>> close to 2x speedup. Your time is better spent elsewhere.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mark
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Aug 1, 2020 at 3:24 PM Jed Brown <j...@jedbrown.org>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> You can use MPI and split the communicator so n-1 ranks
>>>>>>>>>>>>>> create a DMDA for one part of your system and the other rank 
>>>>>>>>>>>>>> drives the GPU
>>>>>>>>>>>>>> in the other part.  They can all be part of the same coupled 
>>>>>>>>>>>>>> system on the
>>>>>>>>>>>>>> full communicator, but PETSc doesn't currently support some 
>>>>>>>>>>>>>> ranks having
>>>>>>>>>>>>>> their Vec arrays on GPU and others on host, so you'd be paying 
>>>>>>>>>>>>>> host-device
>>>>>>>>>>>>>> transfer costs on each iteration (and that might swamp any 
>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>> benefit you would have gotten).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In any case, be sure to think about the execution time of
>>>>>>>>>>>>>> each part.  Load balancing with matching time-to-solution for 
>>>>>>>>>>>>>> each part can
>>>>>>>>>>>>>> be really hard.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Barry Smith <bsm...@petsc.dev> writes:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> >   Nicola,
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> >     This is really viable or practical at this time with
>>>>>>>>>>>>>> PETSc. It is not impossible but requires careful coding with 
>>>>>>>>>>>>>> threads,
>>>>>>>>>>>>>> another possibility is to use one half of the virtual GPUs for 
>>>>>>>>>>>>>> each solve,
>>>>>>>>>>>>>> this is also not trivial. I would recommend first seeing what 
>>>>>>>>>>>>>> kind of
>>>>>>>>>>>>>> performance you can get on the GPU for each type of solve and 
>>>>>>>>>>>>>> revist this
>>>>>>>>>>>>>> idea in the future.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> >    Barry
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> >> On Jul 31, 2020, at 9:23 AM, nicola varini <
>>>>>>>>>>>>>> nicola.var...@gmail.com> wrote:
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> Hello, I would like to know if it is possible to overlap
>>>>>>>>>>>>>> CPU and GPU with DMDA.
>>>>>>>>>>>>>> >> I've a machine where each node has 1P100+1Haswell.
>>>>>>>>>>>>>> >> I've to resolve Poisson and Ampere equation for each time
>>>>>>>>>>>>>> step.
>>>>>>>>>>>>>> >> I'm using 2D DMDA for each of them. Would be possible to
>>>>>>>>>>>>>> compute poisson
>>>>>>>>>>>>>> >> and ampere equation at the same time? One on CPU and the
>>>>>>>>>>>>>> other on GPU?
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> Thanks
>>>>>>>>>>>>>>
>>>>>>>>>>>>> <out_gpu><out_nogpu>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>> <out_miniapp_f_poisson>
>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: [petsc-users] overlap cpu and gpu?

Reply via email to