Re: [petsc-users] overlap cpu and gpu?

nicola varini Mon, 17 Aug 2020 06:26:16 -0700

Hi Mark, I do confirm that hypre with boomeramg is working fine and is
pretty fast.
However, none of the GAMG option works.
Did anyone ever succeeded in usign hypre with petsc on gpu?
I did manage to compile hypre on gpu but I do get the following error:
=======
          CC gpuhypre/obj/vec/vec/impls/hypre/vhyp.o
In file included from
/opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config.h:22,
                 from
/opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/execution_policy.h:23,
                 from /users/nvarini/hypre/include/_hypre_utilities.h:1129,
                 from /users/nvarini/hypre/include/_hypre_IJ_mv.h:14,
                 from
/scratch/snx3000/nvarini/petsc-3.13.3/include/../src/vec/vec/impls/hypre/vhyp.h:6,
                 from
/scratch/snx3000/nvarini/petsc-3.13.3/src/vec/vec/impls/hypre/vhyp.c:7:
/opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/version.h:83:1:
error: unknown type name 'namespace'
 namespace thrust
 ^~~~~~~~~
/opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/version.h:84:1:
error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token
 {
 ^
In file included from
/opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config/config.h:28,
                 from
/opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config.h:23,
                 from
/opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/execution_policy.h:23,
                 from /users/nvarini/hypre/include/_hypre_utilities.h:1129,
                 from /users/nvarini/hypre/include/_hypre_IJ_mv.h:14,
                 from
/scratch/snx3000/nvarini/petsc-3.13.3/include/../src/vec/vec/impls/hypre/vhyp.h:6,
                 from
/scratch/snx3000/nvarini/petsc-3.13.3/src/vec/vec/impls/hypre/vhyp.c:7:
/opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config/cpp_compatibility.h:21:10:
fatal error: cstddef: No such file or directory
 #include <cstddef>
          ^~~~~~~~~
compilation terminated.


=======
Nicola

Il giorno ven 14 ago 2020 alle ore 20:13 Mark Adams <mfad...@lbl.gov> ha
scritto:

> You can try Hypre. If that fails then there is a problem with your system.
>
> And you can run with -info and grep on GAMG and send the output and I can
> see if I see anything funny.
>
> If this is just a Lapacian with a stable discretization and not crazy
> material parameters then stretched grids are about the only thing that can
> hurt the solver.
>
> Do both of your solves fail in a similar way?
>
> On the CPU you can try this with large subdomains, preferably (in serial
> ideally):
> -ampere_mg_levels_ksp_type richardson
> -ampere_mg_levels_pc_type sor
>
> And check that there are no unused options with -options_left. GAMG can
> fail with bad eigen estimates, but these parameters look fine.
>
> On Fri, Aug 14, 2020 at 5:01 AM nicola varini <nicola.var...@gmail.com>
> wrote:
>
>> Dear Barry, yes it gives the same problems.
>>
>> Il giorno gio 13 ago 2020 alle ore 23:22 Barry Smith <bsm...@petsc.dev>
>> ha scritto:
>>
>>>
>>>    Does the same thing work (with GAMG) if you run on the same problem
>>> on the same machine same number of MPI ranks but make a new PETSC_ARCH that
>>> does NOT use the GPUs?
>>>
>>>    Barry
>>>
>>>    Ideally one gets almost identical convergence with CPUs or GPUs (same
>>> problem, same machine) but a bug or numerically change "might" affect this.
>>>
>>> On Aug 13, 2020, at 10:28 AM, nicola varini <nicola.var...@gmail.com>
>>> wrote:
>>>
>>> Dear Barry, you are right. The Cray argument checking is incorrect. It
>>> does work with download-fblaslapack.
>>> However it does fail to converge. Is there anything obviously wrong with
>>> my petscrc?
>>> Anything else am I missing?
>>>
>>> Thanks
>>>
>>> Il giorno gio 13 ago 2020 alle ore 03:17 Barry Smith <bsm...@petsc.dev>
>>> ha scritto:
>>>
>>>>
>>>>    The QR is always done on the CPU, we don't have generic calls to
>>>> blas/lapack go to the GPU currently.
>>>>
>>>>    The error message is:
>>>>
>>>>    On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal value
>>>> (info = -7)
>>>>
>>>>    argument 7 is &LWORK which is defined by
>>>>
>>>>    PetscBLASInt   LWORK=N*bs;
>>>>
>>>>    and
>>>>
>>>>    N=nSAvec is the column block size of new P.
>>>>
>>>>    Presumably this is a huge run with many processes so using the
>>>> debugger is not practical?
>>>>
>>>>    We need to see what these variables are
>>>>
>>>>     N, bs, nSAvec
>>>>
>>>>     perhaps nSAvec is zero which could easily upset LAPACK.
>>>>
>>>>     Crudest thing would be to just put a print statement in the code
>>>> before the LAPACK call of if they are called many times add an error check
>>>> like that
>>>>     generates an error if any of these three values are 0 (or negative).
>>>>
>>>>    Barry
>>>>
>>>>
>>>>     It is not impossible that the Cray argument checking is incorrect
>>>> and the value passed in is fine. You can check this by using
>>>> --download-fblaslapack and see if the same or some other error comes up.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Aug 12, 2020, at 7:19 PM, Mark Adams <mfad...@lbl.gov> wrote:
>>>>
>>>> Can you reproduce this on the CPU?
>>>> The QR factorization seems to be failing. That could be from bad data
>>>> or a bad GPU QR.
>>>>
>>>> On Wed, Aug 12, 2020 at 4:19 AM nicola varini <nicola.var...@gmail.com>
>>>> wrote:
>>>>
>>>>> Dear all, following the suggestions I did resubmit the simulation with
>>>>> the petscrc below.
>>>>> However I do get the following error:
>>>>> ========
>>>>>  7362 [592]PETSC ERROR: #1 formProl0() line 748 in
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>>   7363 [339]PETSC ERROR: Petsc has generated inconsistent data
>>>>>   7364 [339]PETSC ERROR: xGEQRF error
>>>>>   7365 [339]PETSC ERROR: See
>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
>>>>> shooting.
>>>>>   7366 [339]PETSC ERROR: Petsc Release Version 3.13.3, Jul 01, 2020
>>>>>   7367 [339]PETSC ERROR:
>>>>> /users/nvarini/gbs_test_nicola/bin/gbs_daint_gpu_gnu on a  named nid05083
>>>>> by nvarini Wed Aug 12 10:06:15 2020
>>>>>   7368 [339]PETSC ERROR: Configure options --with-cc=cc --with-fc=ftn
>>>>> --known-mpi-shared-libraries=1 --known-mpi-c-double-complex=1
>>>>> --known-mpi-int64_t=1 --known-mpi-long-double=1 --with-batch=1
>>>>> --known-64-bit-blas-indices=0 --LIBS=-lstdc++ --with-cxxlib-autodetect=0
>>>>> --with-scalapa       ck=1 --with-cxx=CC --with-debugging=0
>>>>> --with-hypre-dir=/opt/cray/pe/tpsl/19.06.1/GNU/8.2/haswell
>>>>> --prefix=/scratch/snx3000/nvarini/petsc3.13.3-gpu --with-cuda=1
>>>>> --with-cuda-c=nvcc --with-cxxlib-autodetect=0
>>>>> --COPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include -
>>>>> -with-cxx=CC
>>>>> --CXXOPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include
>>>>>   7369 [592]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>>   7370 [592]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>>>   7371 [592]PETSC ERROR: #4 PCSetUp() line 898 in
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c
>>>>>   7372 [592]PETSC ERROR: #5 KSPSetUp() line 376 in
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>   7373 [592]PETSC ERROR: #6 KSPSolve_Private() line 633 in
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>   7374 [316]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>>>   7375 [339]PETSC ERROR: #1 formProl0() line 748 in
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>>   7376 [339]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>>>>   7377 [339]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>>>   7378 [339]PETSC ERROR: #4 PCSetUp() line 898 in
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c
>>>>>   7379 [339]PETSC ERROR: #5 KSPSetUp() line 376 in
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>   7380 [592]PETSC ERROR: #7 KSPSolve() line 853 in
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>   7381 [339]PETSC ERROR: #6 KSPSolve_Private() line 633 in
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>   7382 [339]PETSC ERROR: #7 KSPSolve() line 853 in
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>>>>   7383 On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal value
>>>>> (info = -7)
>>>>>   7384 [160]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in
>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>>>> ========
>>>>>
>>>>> I did try other pc_gamg_type but they fails as well.
>>>>>
>>>>>
>>>>> #PETSc Option Table entries:
>>>>> -ampere_dm_mat_type aijcusparse
>>>>> -ampere_dm_vec_type cuda
>>>>> -ampere_ksp_atol 1e-15
>>>>> -ampere_ksp_initial_guess_nonzero yes
>>>>> -ampere_ksp_reuse_preconditioner yes
>>>>> -ampere_ksp_rtol 1e-7
>>>>> -ampere_ksp_type dgmres
>>>>> -ampere_mg_levels_esteig_ksp_max_it 10
>>>>> -ampere_mg_levels_esteig_ksp_type cg
>>>>> -ampere_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05
>>>>> -ampere_mg_levels_ksp_type chebyshev
>>>>> -ampere_mg_levels_pc_type jacobi
>>>>> -ampere_pc_gamg_agg_nsmooths 1
>>>>> -ampere_pc_gamg_coarse_eq_limit 10
>>>>> -ampere_pc_gamg_reuse_interpolation true
>>>>> -ampere_pc_gamg_square_graph 1
>>>>> -ampere_pc_gamg_threshold 0.05
>>>>> -ampere_pc_gamg_threshold_scale .0
>>>>> -ampere_pc_gamg_type agg
>>>>> -ampere_pc_type gamg
>>>>> -dm_mat_type aijcusparse
>>>>> -dm_vec_type cuda
>>>>> -log_view
>>>>> -poisson_dm_mat_type aijcusparse
>>>>> -poisson_dm_vec_type cuda
>>>>> -poisson_ksp_atol 1e-15
>>>>> -poisson_ksp_initial_guess_nonzero yes
>>>>> -poisson_ksp_reuse_preconditioner yes
>>>>> -poisson_ksp_rtol 1e-7
>>>>> -poisson_ksp_type dgmres
>>>>> -poisson_log_view
>>>>> -poisson_mg_levels_esteig_ksp_max_it 10
>>>>> -poisson_mg_levels_esteig_ksp_type cg
>>>>> -poisson_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05
>>>>> -poisson_mg_levels_ksp_max_it 1
>>>>> -poisson_mg_levels_ksp_type chebyshev
>>>>> -poisson_mg_levels_pc_type jacobi
>>>>> -poisson_pc_gamg_agg_nsmooths 1
>>>>> -poisson_pc_gamg_coarse_eq_limit 10
>>>>> -poisson_pc_gamg_reuse_interpolation true
>>>>> -poisson_pc_gamg_square_graph 1
>>>>> -poisson_pc_gamg_threshold 0.05
>>>>> -poisson_pc_gamg_threshold_scale .0
>>>>> -poisson_pc_gamg_type agg
>>>>> -poisson_pc_type gamg
>>>>> -use_mat_nearnullspace true
>>>>> #End of PETSc Option Table entries
>>>>>
>>>>> Regards,
>>>>>
>>>>> Nicola
>>>>>
>>>>> Il giorno mar 4 ago 2020 alle ore 17:57 Mark Adams <mfad...@lbl.gov>
>>>>> ha scritto:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 4, 2020 at 6:35 AM Stefano Zampini <
>>>>>> stefano.zamp...@gmail.com> wrote:
>>>>>>
>>>>>>> Nicola,
>>>>>>>
>>>>>>> You are actually not using the GPU properly, since you use HYPRE
>>>>>>> preconditioning, which is CPU only. One of your solvers is actually 
>>>>>>> slower
>>>>>>> on “GPU”.
>>>>>>> For a full AMG GPU, you can use PCGAMG, with cheby smoothers and
>>>>>>> with Jacobi preconditioning. Mark can help you out with the specific
>>>>>>> command line options.
>>>>>>> When it works properly, everything related to PC application is
>>>>>>> offloaded to the GPU, and you should expect to get the well-known and
>>>>>>> branded 10x (maybe more) speedup one is expecting from GPUs during 
>>>>>>> KSPSolve
>>>>>>>
>>>>>>>
>>>>>> The speedup depends on the machine, but on SUMMIT, using enough CPUs
>>>>>> to saturate the memory bus vs all 6 GPUs the speedup is a function of
>>>>>> problem subdomain size. I saw 10x at about 100K equations/process.
>>>>>>
>>>>>>
>>>>>>> Doing what you want to do is one of the last optimization steps of
>>>>>>> an already optimized code before entering production. Yours is not even
>>>>>>> optimized for proper GPU usage  yet.
>>>>>>> Also, any specific reason why you are using dgmres and fgmres?
>>>>>>>
>>>>>>> PETSc has not been designed with multi-threading in mind. You can
>>>>>>> achieve “overlap” of the two solves by splitting the communicator. But 
>>>>>>> then
>>>>>>> you need communications to let the two solutions talk to each other.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Stefano
>>>>>>>
>>>>>>>
>>>>>>> On Aug 4, 2020, at 12:04 PM, nicola varini <nicola.var...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Dear all, thanks for your replies. The reason why I've asked if it
>>>>>>> is possible to overlap poisson and ampere is because they roughly
>>>>>>> take the same amount of time. Please find in attachment the
>>>>>>> profiling logs for only CPU  and only GPU.
>>>>>>> Of course it is possible to split the MPI communicator and run each
>>>>>>> solver on different subcommunicator, however this would involve more
>>>>>>> communication.
>>>>>>> Did anyone ever tried to run 2 solvers with hyperthreading?
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>> Il giorno dom 2 ago 2020 alle ore 14:09 Mark Adams <mfad...@lbl.gov>
>>>>>>> ha scritto:
>>>>>>>
>>>>>>>> I suspect that the Poisson and Ampere's law solve are not coupled.
>>>>>>>> You might be able to duplicate the communicator and use two threads. 
>>>>>>>> You
>>>>>>>> would want to configure PETSc with threadsafty and threads and I think 
>>>>>>>> it
>>>>>>>> could/should work, but this mode is never used by anyone.
>>>>>>>>
>>>>>>>> That said, I would not recommend doing this unless you feel like
>>>>>>>> playing in computer science, as opposed to doing application science. 
>>>>>>>> The
>>>>>>>> best case scenario you get a speedup of 2x. That is a strict upper 
>>>>>>>> bound,
>>>>>>>> but you will never come close to it. Your hardware has some balance of 
>>>>>>>> CPU
>>>>>>>> to GPU processing rate. Your application has a balance of volume of 
>>>>>>>> work
>>>>>>>> for your two solves. They have to be the same to get close to 2x 
>>>>>>>> speedup
>>>>>>>> and that ratio(s) has to be 1:1. To be concrete, from what little I can
>>>>>>>> guess about your applications let's assume that the cost of each of 
>>>>>>>> these
>>>>>>>> two solves is about the same (eg, Laplacians on your domain and the 
>>>>>>>> best
>>>>>>>> case scenario). But, GPU machines are configured to have roughly 1-10% 
>>>>>>>> of
>>>>>>>> capacity in the GPUs, these days, that gives you an upper bound of 
>>>>>>>> about
>>>>>>>> 10% speedup. That is noise. Upshot, unless you configure your hardware 
>>>>>>>> to
>>>>>>>> match this problem, and the two solves have the same cost, you will 
>>>>>>>> not see
>>>>>>>> close to 2x speedup. Your time is better spent elsewhere.
>>>>>>>>
>>>>>>>> Mark
>>>>>>>>
>>>>>>>> On Sat, Aug 1, 2020 at 3:24 PM Jed Brown <j...@jedbrown.org> wrote:
>>>>>>>>
>>>>>>>>> You can use MPI and split the communicator so n-1 ranks create a
>>>>>>>>> DMDA for one part of your system and the other rank drives the GPU in 
>>>>>>>>> the
>>>>>>>>> other part.  They can all be part of the same coupled system on the 
>>>>>>>>> full
>>>>>>>>> communicator, but PETSc doesn't currently support some ranks having 
>>>>>>>>> their
>>>>>>>>> Vec arrays on GPU and others on host, so you'd be paying host-device
>>>>>>>>> transfer costs on each iteration (and that might swamp any performance
>>>>>>>>> benefit you would have gotten).
>>>>>>>>>
>>>>>>>>> In any case, be sure to think about the execution time of each
>>>>>>>>> part.  Load balancing with matching time-to-solution for each part 
>>>>>>>>> can be
>>>>>>>>> really hard.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Barry Smith <bsm...@petsc.dev> writes:
>>>>>>>>>
>>>>>>>>> >   Nicola,
>>>>>>>>> >
>>>>>>>>> >     This is really viable or practical at this time with PETSc.
>>>>>>>>> It is not impossible but requires careful coding with threads, another
>>>>>>>>> possibility is to use one half of the virtual GPUs for each solve, 
>>>>>>>>> this is
>>>>>>>>> also not trivial. I would recommend first seeing what kind of 
>>>>>>>>> performance
>>>>>>>>> you can get on the GPU for each type of solve and revist this idea in 
>>>>>>>>> the
>>>>>>>>> future.
>>>>>>>>> >
>>>>>>>>> >    Barry
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >> On Jul 31, 2020, at 9:23 AM, nicola varini <
>>>>>>>>> nicola.var...@gmail.com> wrote:
>>>>>>>>> >>
>>>>>>>>> >> Hello, I would like to know if it is possible to overlap CPU
>>>>>>>>> and GPU with DMDA.
>>>>>>>>> >> I've a machine where each node has 1P100+1Haswell.
>>>>>>>>> >> I've to resolve Poisson and Ampere equation for each time step.
>>>>>>>>> >> I'm using 2D DMDA for each of them. Would be possible to
>>>>>>>>> compute poisson
>>>>>>>>> >> and ampere equation at the same time? One on CPU and the other
>>>>>>>>> on GPU?
>>>>>>>>> >>
>>>>>>>>> >> Thanks
>>>>>>>>>
>>>>>>>> <out_gpu><out_nogpu>
>>>>>>>
>>>>>>>
>>>>>>>
>>>> <out_miniapp_f_poisson>
>>>
>>>
>>>

Re: [petsc-users] overlap cpu and gpu?

Reply via email to