Does the same thing work (with GAMG) if you run on the same problem on the same machine same number of MPI ranks but make a new PETSC_ARCH that does NOT use the GPUs?
Barry Ideally one gets almost identical convergence with CPUs or GPUs (same problem, same machine) but a bug or numerically change "might" affect this. > On Aug 13, 2020, at 10:28 AM, nicola varini <nicola.var...@gmail.com> wrote: > > Dear Barry, you are right. The Cray argument checking is incorrect. It does > work with download-fblaslapack. > However it does fail to converge. Is there anything obviously wrong with my > petscrc? > Anything else am I missing? > > Thanks > > Il giorno gio 13 ago 2020 alle ore 03:17 Barry Smith <bsm...@petsc.dev > <mailto:bsm...@petsc.dev>> ha scritto: > > The QR is always done on the CPU, we don't have generic calls to > blas/lapack go to the GPU currently. > > The error message is: > > On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal value (info = -7) > > argument 7 is &LWORK which is defined by > > PetscBLASInt LWORK=N*bs; > > and > > N=nSAvec is the column block size of new P. > > Presumably this is a huge run with many processes so using the debugger is > not practical? > > We need to see what these variables are > > N, bs, nSAvec > > perhaps nSAvec is zero which could easily upset LAPACK. > > Crudest thing would be to just put a print statement in the code before > the LAPACK call of if they are called many times add an error check like that > generates an error if any of these three values are 0 (or negative). > > Barry > > > It is not impossible that the Cray argument checking is incorrect and the > value passed in is fine. You can check this by using --download-fblaslapack > and see if the same or some other error comes up. > > > > > > > > >> On Aug 12, 2020, at 7:19 PM, Mark Adams <mfad...@lbl.gov >> <mailto:mfad...@lbl.gov>> wrote: >> >> Can you reproduce this on the CPU? >> The QR factorization seems to be failing. That could be from bad data or a >> bad GPU QR. >> >> On Wed, Aug 12, 2020 at 4:19 AM nicola varini <nicola.var...@gmail.com >> <mailto:nicola.var...@gmail.com>> wrote: >> Dear all, following the suggestions I did resubmit the simulation with the >> petscrc below. >> However I do get the following error: >> ======== >> 7362 [592]PETSC ERROR: #1 formProl0() line 748 in >> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c >> 7363 [339]PETSC ERROR: Petsc has generated inconsistent data >> 7364 [339]PETSC ERROR: xGEQRF error >> 7365 [339]PETSC ERROR: See >> https://www.mcs.anl.gov/petsc/documentation/faq.html >> <https://www.mcs.anl.gov/petsc/documentation/faq.html> for trouble shooting. >> 7366 [339]PETSC ERROR: Petsc Release Version 3.13.3, Jul 01, 2020 >> 7367 [339]PETSC ERROR: >> /users/nvarini/gbs_test_nicola/bin/gbs_daint_gpu_gnu on a named nid05083 by >> nvarini Wed Aug 12 10:06:15 2020 >> 7368 [339]PETSC ERROR: Configure options --with-cc=cc --with-fc=ftn >> --known-mpi-shared-libraries=1 --known-mpi-c-double-complex=1 >> --known-mpi-int64_t=1 --known-mpi-long-double=1 --with-batch=1 >> --known-64-bit-blas-indices=0 --LIBS=-lstdc++ --with-cxxlib-autodetect=0 >> --with-scalapa ck=1 --with-cxx=CC --with-debugging=0 >> --with-hypre-dir=/opt/cray/pe/tpsl/19.06.1/GNU/8.2/haswell >> --prefix=/scratch/snx3000/nvarini/petsc3.13.3-gpu --with-cuda=1 >> --with-cuda-c=nvcc --with-cxxlib-autodetect=0 >> --COPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include - >> -with-cxx=CC >> --CXXOPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include >> 7369 [592]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in >> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c >> 7370 [592]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in >> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c >> 7371 [592]PETSC ERROR: #4 PCSetUp() line 898 in >> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c >> 7372 [592]PETSC ERROR: #5 KSPSetUp() line 376 in >> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >> 7373 [592]PETSC ERROR: #6 KSPSolve_Private() line 633 in >> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >> 7374 [316]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in >> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c >> 7375 [339]PETSC ERROR: #1 formProl0() line 748 in >> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c >> 7376 [339]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in >> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c >> 7377 [339]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in >> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c >> 7378 [339]PETSC ERROR: #4 PCSetUp() line 898 in >> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c >> 7379 [339]PETSC ERROR: #5 KSPSetUp() line 376 in >> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >> 7380 [592]PETSC ERROR: #7 KSPSolve() line 853 in >> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >> 7381 [339]PETSC ERROR: #6 KSPSolve_Private() line 633 in >> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >> 7382 [339]PETSC ERROR: #7 KSPSolve() line 853 in >> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >> 7383 On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal value (info >> = -7) >> 7384 [160]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in >> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c >> ======== >> >> I did try other pc_gamg_type but they fails as well. >> >> >> #PETSc Option Table entries: >> -ampere_dm_mat_type aijcusparse >> -ampere_dm_vec_type cuda >> -ampere_ksp_atol 1e-15 >> -ampere_ksp_initial_guess_nonzero yes >> -ampere_ksp_reuse_preconditioner yes >> -ampere_ksp_rtol 1e-7 >> -ampere_ksp_type dgmres >> -ampere_mg_levels_esteig_ksp_max_it 10 >> -ampere_mg_levels_esteig_ksp_type cg >> -ampere_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05 >> -ampere_mg_levels_ksp_type chebyshev >> -ampere_mg_levels_pc_type jacobi >> -ampere_pc_gamg_agg_nsmooths 1 >> -ampere_pc_gamg_coarse_eq_limit 10 >> -ampere_pc_gamg_reuse_interpolation true >> -ampere_pc_gamg_square_graph 1 >> -ampere_pc_gamg_threshold 0.05 >> -ampere_pc_gamg_threshold_scale .0 >> -ampere_pc_gamg_type agg >> -ampere_pc_type gamg >> -dm_mat_type aijcusparse >> -dm_vec_type cuda >> -log_view >> -poisson_dm_mat_type aijcusparse >> -poisson_dm_vec_type cuda >> -poisson_ksp_atol 1e-15 >> -poisson_ksp_initial_guess_nonzero yes >> -poisson_ksp_reuse_preconditioner yes >> -poisson_ksp_rtol 1e-7 >> -poisson_ksp_type dgmres >> -poisson_log_view >> -poisson_mg_levels_esteig_ksp_max_it 10 >> -poisson_mg_levels_esteig_ksp_type cg >> -poisson_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05 >> -poisson_mg_levels_ksp_max_it 1 >> -poisson_mg_levels_ksp_type chebyshev >> -poisson_mg_levels_pc_type jacobi >> -poisson_pc_gamg_agg_nsmooths 1 >> -poisson_pc_gamg_coarse_eq_limit 10 >> -poisson_pc_gamg_reuse_interpolation true >> -poisson_pc_gamg_square_graph 1 >> -poisson_pc_gamg_threshold 0.05 >> -poisson_pc_gamg_threshold_scale .0 >> -poisson_pc_gamg_type agg >> -poisson_pc_type gamg >> -use_mat_nearnullspace true >> #End of PETSc Option Table entries >> >> Regards, >> >> Nicola >> >> Il giorno mar 4 ago 2020 alle ore 17:57 Mark Adams <mfad...@lbl.gov >> <mailto:mfad...@lbl.gov>> ha scritto: >> >> >> On Tue, Aug 4, 2020 at 6:35 AM Stefano Zampini <stefano.zamp...@gmail.com >> <mailto:stefano.zamp...@gmail.com>> wrote: >> Nicola, >> >> You are actually not using the GPU properly, since you use HYPRE >> preconditioning, which is CPU only. One of your solvers is actually slower >> on “GPU”. >> For a full AMG GPU, you can use PCGAMG, with cheby smoothers and with Jacobi >> preconditioning. Mark can help you out with the specific command line >> options. >> When it works properly, everything related to PC application is offloaded to >> the GPU, and you should expect to get the well-known and branded 10x (maybe >> more) speedup one is expecting from GPUs during KSPSolve >> >> >> The speedup depends on the machine, but on SUMMIT, using enough CPUs to >> saturate the memory bus vs all 6 GPUs the speedup is a function of problem >> subdomain size. I saw 10x at about 100K equations/process. >> >> Doing what you want to do is one of the last optimization steps of an >> already optimized code before entering production. Yours is not even >> optimized for proper GPU usage yet. >> Also, any specific reason why you are using dgmres and fgmres? >> >> PETSc has not been designed with multi-threading in mind. You can achieve >> “overlap” of the two solves by splitting the communicator. But then you need >> communications to let the two solutions talk to each other. >> >> Thanks >> Stefano >> >> >>> On Aug 4, 2020, at 12:04 PM, nicola varini <nicola.var...@gmail.com >>> <mailto:nicola.var...@gmail.com>> wrote: >>> >>> Dear all, thanks for your replies. The reason why I've asked if it is >>> possible to overlap poisson and ampere is because they roughly >>> take the same amount of time. Please find in attachment the profiling logs >>> for only CPU and only GPU. >>> Of course it is possible to split the MPI communicator and run each solver >>> on different subcommunicator, however this would involve more communication. >>> Did anyone ever tried to run 2 solvers with hyperthreading? >>> Thanks >>> >>> >>> Il giorno dom 2 ago 2020 alle ore 14:09 Mark Adams <mfad...@lbl.gov >>> <mailto:mfad...@lbl.gov>> ha scritto: >>> I suspect that the Poisson and Ampere's law solve are not coupled. You >>> might be able to duplicate the communicator and use two threads. You would >>> want to configure PETSc with threadsafty and threads and I think it >>> could/should work, but this mode is never used by anyone. >>> >>> That said, I would not recommend doing this unless you feel like playing in >>> computer science, as opposed to doing application science. The best case >>> scenario you get a speedup of 2x. That is a strict upper bound, but you >>> will never come close to it. Your hardware has some balance of CPU to GPU >>> processing rate. Your application has a balance of volume of work for your >>> two solves. They have to be the same to get close to 2x speedup and that >>> ratio(s) has to be 1:1. To be concrete, from what little I can guess about >>> your applications let's assume that the cost of each of these two solves is >>> about the same (eg, Laplacians on your domain and the best case scenario). >>> But, GPU machines are configured to have roughly 1-10% of capacity in the >>> GPUs, these days, that gives you an upper bound of about 10% speedup. That >>> is noise. Upshot, unless you configure your hardware to match this problem, >>> and the two solves have the same cost, you will not see close to 2x >>> speedup. Your time is better spent elsewhere. >>> >>> Mark >>> >>> On Sat, Aug 1, 2020 at 3:24 PM Jed Brown <j...@jedbrown.org >>> <mailto:j...@jedbrown.org>> wrote: >>> You can use MPI and split the communicator so n-1 ranks create a DMDA for >>> one part of your system and the other rank drives the GPU in the other >>> part. They can all be part of the same coupled system on the full >>> communicator, but PETSc doesn't currently support some ranks having their >>> Vec arrays on GPU and others on host, so you'd be paying host-device >>> transfer costs on each iteration (and that might swamp any performance >>> benefit you would have gotten). >>> >>> In any case, be sure to think about the execution time of each part. Load >>> balancing with matching time-to-solution for each part can be really hard. >>> >>> >>> Barry Smith <bsm...@petsc.dev <mailto:bsm...@petsc.dev>> writes: >>> >>> > Nicola, >>> > >>> > This is really viable or practical at this time with PETSc. It is not >>> > impossible but requires careful coding with threads, another possibility >>> > is to use one half of the virtual GPUs for each solve, this is also not >>> > trivial. I would recommend first seeing what kind of performance you can >>> > get on the GPU for each type of solve and revist this idea in the future. >>> > >>> > Barry >>> > >>> > >>> > >>> > >>> >> On Jul 31, 2020, at 9:23 AM, nicola varini <nicola.var...@gmail.com >>> >> <mailto:nicola.var...@gmail.com>> wrote: >>> >> >>> >> Hello, I would like to know if it is possible to overlap CPU and GPU >>> >> with DMDA. >>> >> I've a machine where each node has 1P100+1Haswell. >>> >> I've to resolve Poisson and Ampere equation for each time step. >>> >> I'm using 2D DMDA for each of them. Would be possible to compute poisson >>> >> and ampere equation at the same time? One on CPU and the other on GPU? >>> >> >>> >> Thanks >>> <out_gpu><out_nogpu> >> > > <out_miniapp_f_poisson>