Re: [petsc-users] overlap cpu and gpu?

Barry Smith Thu, 13 Aug 2020 14:23:57 -0700

   Does the same thing work (with GAMG) if you run on the same problem on the 
same machine same number of MPI ranks but make a new PETSC_ARCH that does NOT 
use the GPUs?


   Barry

   Ideally one gets almost identical convergence with CPUs or GPUs (same 
problem, same machine) but a bug or numerically change "might" affect this.

> On Aug 13, 2020, at 10:28 AM, nicola varini <nicola.var...@gmail.com> wrote:
> 
> Dear Barry, you are right. The Cray argument checking is incorrect. It does 
> work with download-fblaslapack.
> However it does fail to converge. Is there anything obviously wrong with my 
> petscrc? 
> Anything else am I missing?
> 
> Thanks
> 
> Il giorno gio 13 ago 2020 alle ore 03:17 Barry Smith <bsm...@petsc.dev 
> <mailto:bsm...@petsc.dev>> ha scritto:
> 
>    The QR is always done on the CPU, we don't have generic calls to 
> blas/lapack go to the GPU currently.
> 
>    The error message is:
> 
>    On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal value (info = -7)
> 
>    argument 7 is &LWORK which is defined by
> 
>    PetscBLASInt   LWORK=N*bs;
> 
>    and
> 
>    N=nSAvec is the column block size of new P.
> 
>    Presumably this is a huge run with many processes so using the debugger is 
> not practical?
> 
>    We need to see what these variables are
> 
>     N, bs, nSAvec  
> 
>     perhaps nSAvec is zero which could easily upset LAPACK. 
> 
>     Crudest thing would be to just put a print statement in the code before 
> the LAPACK call of if they are called many times add an error check like that
>     generates an error if any of these three values are 0 (or negative).
> 
>    Barry
> 
> 
>     It is not impossible that the Cray argument checking is incorrect and the 
> value passed in is fine. You can check this by using --download-fblaslapack 
> and see if the same or some other error comes up.
> 
> 
> 
> 
> 
> 
> 
> 
>> On Aug 12, 2020, at 7:19 PM, Mark Adams <mfad...@lbl.gov 
>> <mailto:mfad...@lbl.gov>> wrote:
>> 
>> Can you reproduce this on the CPU? 
>> The QR factorization seems to be failing. That could be from bad data or a 
>> bad GPU QR.
>> 
>> On Wed, Aug 12, 2020 at 4:19 AM nicola varini <nicola.var...@gmail.com 
>> <mailto:nicola.var...@gmail.com>> wrote:
>> Dear all, following the suggestions I did resubmit the simulation with the 
>> petscrc below.
>> However I do get the following error:
>> ========
>>  7362 [592]PETSC ERROR: #1 formProl0() line 748 in 
>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>   7363 [339]PETSC ERROR: Petsc has generated inconsistent data
>>   7364 [339]PETSC ERROR: xGEQRF error
>>   7365 [339]PETSC ERROR: See 
>> https://www.mcs.anl.gov/petsc/documentation/faq.html 
>> <https://www.mcs.anl.gov/petsc/documentation/faq.html> for trouble shooting.
>>   7366 [339]PETSC ERROR: Petsc Release Version 3.13.3, Jul 01, 2020
>>   7367 [339]PETSC ERROR: 
>> /users/nvarini/gbs_test_nicola/bin/gbs_daint_gpu_gnu on a  named nid05083 by 
>> nvarini Wed Aug 12 10:06:15 2020
>>   7368 [339]PETSC ERROR: Configure options --with-cc=cc --with-fc=ftn 
>> --known-mpi-shared-libraries=1 --known-mpi-c-double-complex=1 
>> --known-mpi-int64_t=1 --known-mpi-long-double=1 --with-batch=1 
>> --known-64-bit-blas-indices=0 --LIBS=-lstdc++ --with-cxxlib-autodetect=0 
>> --with-scalapa       ck=1 --with-cxx=CC --with-debugging=0 
>> --with-hypre-dir=/opt/cray/pe/tpsl/19.06.1/GNU/8.2/haswell 
>> --prefix=/scratch/snx3000/nvarini/petsc3.13.3-gpu --with-cuda=1 
>> --with-cuda-c=nvcc --with-cxxlib-autodetect=0 
>> --COPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include -       
>> -with-cxx=CC 
>> --CXXOPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include
>>   7369 [592]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in 
>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>   7370 [592]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in 
>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>   7371 [592]PETSC ERROR: #4 PCSetUp() line 898 in 
>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c
>>   7372 [592]PETSC ERROR: #5 KSPSetUp() line 376 in 
>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>   7373 [592]PETSC ERROR: #6 KSPSolve_Private() line 633 in 
>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>   7374 [316]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in 
>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>   7375 [339]PETSC ERROR: #1 formProl0() line 748 in 
>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>   7376 [339]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in 
>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c
>>   7377 [339]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in 
>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>>   7378 [339]PETSC ERROR: #4 PCSetUp() line 898 in 
>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c
>>   7379 [339]PETSC ERROR: #5 KSPSetUp() line 376 in 
>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>   7380 [592]PETSC ERROR: #7 KSPSolve() line 853 in 
>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>   7381 [339]PETSC ERROR: #6 KSPSolve_Private() line 633 in 
>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>   7382 [339]PETSC ERROR: #7 KSPSolve() line 853 in 
>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c
>>   7383 On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal value (info 
>> = -7)
>>   7384 [160]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in 
>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c
>> ========
>> 
>> I did try other pc_gamg_type but they fails as well.
>> 
>> 
>> #PETSc Option Table entries:
>> -ampere_dm_mat_type aijcusparse
>> -ampere_dm_vec_type cuda
>> -ampere_ksp_atol 1e-15
>> -ampere_ksp_initial_guess_nonzero yes
>> -ampere_ksp_reuse_preconditioner yes
>> -ampere_ksp_rtol 1e-7
>> -ampere_ksp_type dgmres
>> -ampere_mg_levels_esteig_ksp_max_it 10
>> -ampere_mg_levels_esteig_ksp_type cg
>> -ampere_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05
>> -ampere_mg_levels_ksp_type chebyshev
>> -ampere_mg_levels_pc_type jacobi
>> -ampere_pc_gamg_agg_nsmooths 1
>> -ampere_pc_gamg_coarse_eq_limit 10
>> -ampere_pc_gamg_reuse_interpolation true
>> -ampere_pc_gamg_square_graph 1
>> -ampere_pc_gamg_threshold 0.05
>> -ampere_pc_gamg_threshold_scale .0
>> -ampere_pc_gamg_type agg
>> -ampere_pc_type gamg
>> -dm_mat_type aijcusparse
>> -dm_vec_type cuda
>> -log_view
>> -poisson_dm_mat_type aijcusparse
>> -poisson_dm_vec_type cuda
>> -poisson_ksp_atol 1e-15
>> -poisson_ksp_initial_guess_nonzero yes
>> -poisson_ksp_reuse_preconditioner yes
>> -poisson_ksp_rtol 1e-7
>> -poisson_ksp_type dgmres
>> -poisson_log_view
>> -poisson_mg_levels_esteig_ksp_max_it 10
>> -poisson_mg_levels_esteig_ksp_type cg
>> -poisson_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05
>> -poisson_mg_levels_ksp_max_it 1
>> -poisson_mg_levels_ksp_type chebyshev
>> -poisson_mg_levels_pc_type jacobi
>> -poisson_pc_gamg_agg_nsmooths 1
>> -poisson_pc_gamg_coarse_eq_limit 10
>> -poisson_pc_gamg_reuse_interpolation true
>> -poisson_pc_gamg_square_graph 1
>> -poisson_pc_gamg_threshold 0.05
>> -poisson_pc_gamg_threshold_scale .0
>> -poisson_pc_gamg_type agg
>> -poisson_pc_type gamg
>> -use_mat_nearnullspace true
>> #End of PETSc Option Table entries
>> 
>> Regards,
>> 
>> Nicola
>> 
>> Il giorno mar 4 ago 2020 alle ore 17:57 Mark Adams <mfad...@lbl.gov 
>> <mailto:mfad...@lbl.gov>> ha scritto:
>> 
>> 
>> On Tue, Aug 4, 2020 at 6:35 AM Stefano Zampini <stefano.zamp...@gmail.com 
>> <mailto:stefano.zamp...@gmail.com>> wrote:
>> Nicola,
>> 
>> You are actually not using the GPU properly, since you use HYPRE 
>> preconditioning, which is CPU only. One of your solvers is actually slower 
>> on “GPU”.
>> For a full AMG GPU, you can use PCGAMG, with cheby smoothers and with Jacobi 
>> preconditioning. Mark can help you out with the specific command line 
>> options.
>> When it works properly, everything related to PC application is offloaded to 
>> the GPU, and you should expect to get the well-known and branded 10x (maybe 
>> more) speedup one is expecting from GPUs during KSPSolve
>> 
>> 
>> The speedup depends on the machine, but on SUMMIT, using enough CPUs to 
>> saturate the memory bus vs all 6 GPUs the speedup is a function of problem 
>> subdomain size. I saw 10x at about 100K equations/process.
>>  
>> Doing what you want to do is one of the last optimization steps of an 
>> already optimized code before entering production. Yours is not even 
>> optimized for proper GPU usage  yet.
>> Also, any specific reason why you are using dgmres and fgmres?
>> 
>> PETSc has not been designed with multi-threading in mind. You can achieve 
>> “overlap” of the two solves by splitting the communicator. But then you need 
>> communications to let the two solutions talk to each other.
>> 
>> Thanks
>> Stefano
>> 
>> 
>>> On Aug 4, 2020, at 12:04 PM, nicola varini <nicola.var...@gmail.com 
>>> <mailto:nicola.var...@gmail.com>> wrote:
>>> 
>>> Dear all, thanks for your replies. The reason why I've asked if it is 
>>> possible to overlap poisson and ampere is because they roughly
>>> take the same amount of time. Please find in attachment the profiling logs 
>>> for only CPU  and only GPU.
>>> Of course it is possible to split the MPI communicator and run each solver 
>>> on different subcommunicator, however this would involve more communication.
>>> Did anyone ever tried to run 2 solvers with hyperthreading? 
>>> Thanks
>>> 
>>> 
>>> Il giorno dom 2 ago 2020 alle ore 14:09 Mark Adams <mfad...@lbl.gov 
>>> <mailto:mfad...@lbl.gov>> ha scritto:
>>> I suspect that the Poisson and Ampere's law solve are not coupled. You 
>>> might be able to duplicate the communicator and use two threads. You would 
>>> want to configure PETSc with threadsafty and threads and I think it 
>>> could/should work, but this mode is never used by anyone.
>>> 
>>> That said, I would not recommend doing this unless you feel like playing in 
>>> computer science, as opposed to doing application science. The best case 
>>> scenario you get a speedup of 2x. That is a strict upper bound, but you 
>>> will never come close to it. Your hardware has some balance of CPU to GPU 
>>> processing rate. Your application has a balance of volume of work for your 
>>> two solves. They have to be the same to get close to 2x speedup and that 
>>> ratio(s) has to be 1:1. To be concrete, from what little I can guess about 
>>> your applications let's assume that the cost of each of these two solves is 
>>> about the same (eg, Laplacians on your domain and the best case scenario). 
>>> But, GPU machines are configured to have roughly 1-10% of capacity in the 
>>> GPUs, these days, that gives you an upper bound of about 10% speedup. That 
>>> is noise. Upshot, unless you configure your hardware to match this problem, 
>>> and the two solves have the same cost, you will not see close to 2x 
>>> speedup. Your time is better spent elsewhere.
>>> 
>>> Mark
>>> 
>>> On Sat, Aug 1, 2020 at 3:24 PM Jed Brown <j...@jedbrown.org 
>>> <mailto:j...@jedbrown.org>> wrote:
>>> You can use MPI and split the communicator so n-1 ranks create a DMDA for 
>>> one part of your system and the other rank drives the GPU in the other 
>>> part.  They can all be part of the same coupled system on the full 
>>> communicator, but PETSc doesn't currently support some ranks having their 
>>> Vec arrays on GPU and others on host, so you'd be paying host-device 
>>> transfer costs on each iteration (and that might swamp any performance 
>>> benefit you would have gotten).
>>> 
>>> In any case, be sure to think about the execution time of each part.  Load 
>>> balancing with matching time-to-solution for each part can be really hard.
>>> 
>>> 
>>> Barry Smith <bsm...@petsc.dev <mailto:bsm...@petsc.dev>> writes:
>>> 
>>> >   Nicola,
>>> >
>>> >     This is really viable or practical at this time with PETSc. It is not 
>>> > impossible but requires careful coding with threads, another possibility 
>>> > is to use one half of the virtual GPUs for each solve, this is also not 
>>> > trivial. I would recommend first seeing what kind of performance you can 
>>> > get on the GPU for each type of solve and revist this idea in the future.
>>> >
>>> >    Barry
>>> >
>>> >
>>> >
>>> >
>>> >> On Jul 31, 2020, at 9:23 AM, nicola varini <nicola.var...@gmail.com 
>>> >> <mailto:nicola.var...@gmail.com>> wrote:
>>> >> 
>>> >> Hello, I would like to know if it is possible to overlap CPU and GPU 
>>> >> with DMDA.
>>> >> I've a machine where each node has 1P100+1Haswell.
>>> >> I've to resolve Poisson and Ampere equation for each time step.
>>> >> I'm using 2D DMDA for each of them. Would be possible to compute poisson 
>>> >> and ampere equation at the same time? One on CPU and the other on GPU?
>>> >> 
>>> >> Thanks
>>> <out_gpu><out_nogpu>
>> 
> 
> <out_miniapp_f_poisson>

Re: [petsc-users] overlap cpu and gpu?

Reply via email to