Dear Barry, yes it gives the same problems. Il giorno gio 13 ago 2020 alle ore 23:22 Barry Smith <bsm...@petsc.dev> ha scritto:
> > Does the same thing work (with GAMG) if you run on the same problem on > the same machine same number of MPI ranks but make a new PETSC_ARCH that > does NOT use the GPUs? > > Barry > > Ideally one gets almost identical convergence with CPUs or GPUs (same > problem, same machine) but a bug or numerically change "might" affect this. > > On Aug 13, 2020, at 10:28 AM, nicola varini <nicola.var...@gmail.com> > wrote: > > Dear Barry, you are right. The Cray argument checking is incorrect. It > does work with download-fblaslapack. > However it does fail to converge. Is there anything obviously wrong with > my petscrc? > Anything else am I missing? > > Thanks > > Il giorno gio 13 ago 2020 alle ore 03:17 Barry Smith <bsm...@petsc.dev> > ha scritto: > >> >> The QR is always done on the CPU, we don't have generic calls to >> blas/lapack go to the GPU currently. >> >> The error message is: >> >> On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal value (info >> = -7) >> >> argument 7 is &LWORK which is defined by >> >> PetscBLASInt LWORK=N*bs; >> >> and >> >> N=nSAvec is the column block size of new P. >> >> Presumably this is a huge run with many processes so using the >> debugger is not practical? >> >> We need to see what these variables are >> >> N, bs, nSAvec >> >> perhaps nSAvec is zero which could easily upset LAPACK. >> >> Crudest thing would be to just put a print statement in the code >> before the LAPACK call of if they are called many times add an error check >> like that >> generates an error if any of these three values are 0 (or negative). >> >> Barry >> >> >> It is not impossible that the Cray argument checking is incorrect and >> the value passed in is fine. You can check this by using >> --download-fblaslapack and see if the same or some other error comes up. >> >> >> >> >> >> >> >> >> On Aug 12, 2020, at 7:19 PM, Mark Adams <mfad...@lbl.gov> wrote: >> >> Can you reproduce this on the CPU? >> The QR factorization seems to be failing. That could be from bad data or >> a bad GPU QR. >> >> On Wed, Aug 12, 2020 at 4:19 AM nicola varini <nicola.var...@gmail.com> >> wrote: >> >>> Dear all, following the suggestions I did resubmit the simulation with >>> the petscrc below. >>> However I do get the following error: >>> ======== >>> 7362 [592]PETSC ERROR: #1 formProl0() line 748 in >>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c >>> 7363 [339]PETSC ERROR: Petsc has generated inconsistent data >>> 7364 [339]PETSC ERROR: xGEQRF error >>> 7365 [339]PETSC ERROR: See >>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble >>> shooting. >>> 7366 [339]PETSC ERROR: Petsc Release Version 3.13.3, Jul 01, 2020 >>> 7367 [339]PETSC ERROR: >>> /users/nvarini/gbs_test_nicola/bin/gbs_daint_gpu_gnu on a named nid05083 >>> by nvarini Wed Aug 12 10:06:15 2020 >>> 7368 [339]PETSC ERROR: Configure options --with-cc=cc --with-fc=ftn >>> --known-mpi-shared-libraries=1 --known-mpi-c-double-complex=1 >>> --known-mpi-int64_t=1 --known-mpi-long-double=1 --with-batch=1 >>> --known-64-bit-blas-indices=0 --LIBS=-lstdc++ --with-cxxlib-autodetect=0 >>> --with-scalapa ck=1 --with-cxx=CC --with-debugging=0 >>> --with-hypre-dir=/opt/cray/pe/tpsl/19.06.1/GNU/8.2/haswell >>> --prefix=/scratch/snx3000/nvarini/petsc3.13.3-gpu --with-cuda=1 >>> --with-cuda-c=nvcc --with-cxxlib-autodetect=0 >>> --COPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include - >>> -with-cxx=CC >>> --CXXOPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include >>> 7369 [592]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in >>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c >>> 7370 [592]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in >>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c >>> 7371 [592]PETSC ERROR: #4 PCSetUp() line 898 in >>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c >>> 7372 [592]PETSC ERROR: #5 KSPSetUp() line 376 in >>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >>> 7373 [592]PETSC ERROR: #6 KSPSolve_Private() line 633 in >>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >>> 7374 [316]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in >>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c >>> 7375 [339]PETSC ERROR: #1 formProl0() line 748 in >>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c >>> 7376 [339]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in >>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c >>> 7377 [339]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in >>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c >>> 7378 [339]PETSC ERROR: #4 PCSetUp() line 898 in >>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c >>> 7379 [339]PETSC ERROR: #5 KSPSetUp() line 376 in >>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >>> 7380 [592]PETSC ERROR: #7 KSPSolve() line 853 in >>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >>> 7381 [339]PETSC ERROR: #6 KSPSolve_Private() line 633 in >>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >>> 7382 [339]PETSC ERROR: #7 KSPSolve() line 853 in >>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >>> 7383 On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal value >>> (info = -7) >>> 7384 [160]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in >>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c >>> ======== >>> >>> I did try other pc_gamg_type but they fails as well. >>> >>> >>> #PETSc Option Table entries: >>> -ampere_dm_mat_type aijcusparse >>> -ampere_dm_vec_type cuda >>> -ampere_ksp_atol 1e-15 >>> -ampere_ksp_initial_guess_nonzero yes >>> -ampere_ksp_reuse_preconditioner yes >>> -ampere_ksp_rtol 1e-7 >>> -ampere_ksp_type dgmres >>> -ampere_mg_levels_esteig_ksp_max_it 10 >>> -ampere_mg_levels_esteig_ksp_type cg >>> -ampere_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05 >>> -ampere_mg_levels_ksp_type chebyshev >>> -ampere_mg_levels_pc_type jacobi >>> -ampere_pc_gamg_agg_nsmooths 1 >>> -ampere_pc_gamg_coarse_eq_limit 10 >>> -ampere_pc_gamg_reuse_interpolation true >>> -ampere_pc_gamg_square_graph 1 >>> -ampere_pc_gamg_threshold 0.05 >>> -ampere_pc_gamg_threshold_scale .0 >>> -ampere_pc_gamg_type agg >>> -ampere_pc_type gamg >>> -dm_mat_type aijcusparse >>> -dm_vec_type cuda >>> -log_view >>> -poisson_dm_mat_type aijcusparse >>> -poisson_dm_vec_type cuda >>> -poisson_ksp_atol 1e-15 >>> -poisson_ksp_initial_guess_nonzero yes >>> -poisson_ksp_reuse_preconditioner yes >>> -poisson_ksp_rtol 1e-7 >>> -poisson_ksp_type dgmres >>> -poisson_log_view >>> -poisson_mg_levels_esteig_ksp_max_it 10 >>> -poisson_mg_levels_esteig_ksp_type cg >>> -poisson_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05 >>> -poisson_mg_levels_ksp_max_it 1 >>> -poisson_mg_levels_ksp_type chebyshev >>> -poisson_mg_levels_pc_type jacobi >>> -poisson_pc_gamg_agg_nsmooths 1 >>> -poisson_pc_gamg_coarse_eq_limit 10 >>> -poisson_pc_gamg_reuse_interpolation true >>> -poisson_pc_gamg_square_graph 1 >>> -poisson_pc_gamg_threshold 0.05 >>> -poisson_pc_gamg_threshold_scale .0 >>> -poisson_pc_gamg_type agg >>> -poisson_pc_type gamg >>> -use_mat_nearnullspace true >>> #End of PETSc Option Table entries >>> >>> Regards, >>> >>> Nicola >>> >>> Il giorno mar 4 ago 2020 alle ore 17:57 Mark Adams <mfad...@lbl.gov> ha >>> scritto: >>> >>>> >>>> >>>> On Tue, Aug 4, 2020 at 6:35 AM Stefano Zampini < >>>> stefano.zamp...@gmail.com> wrote: >>>> >>>>> Nicola, >>>>> >>>>> You are actually not using the GPU properly, since you use HYPRE >>>>> preconditioning, which is CPU only. One of your solvers is actually slower >>>>> on “GPU”. >>>>> For a full AMG GPU, you can use PCGAMG, with cheby smoothers and with >>>>> Jacobi preconditioning. Mark can help you out with the specific command >>>>> line options. >>>>> When it works properly, everything related to PC application is >>>>> offloaded to the GPU, and you should expect to get the well-known and >>>>> branded 10x (maybe more) speedup one is expecting from GPUs during >>>>> KSPSolve >>>>> >>>>> >>>> The speedup depends on the machine, but on SUMMIT, using enough CPUs to >>>> saturate the memory bus vs all 6 GPUs the speedup is a function of problem >>>> subdomain size. I saw 10x at about 100K equations/process. >>>> >>>> >>>>> Doing what you want to do is one of the last optimization steps of an >>>>> already optimized code before entering production. Yours is not even >>>>> optimized for proper GPU usage yet. >>>>> Also, any specific reason why you are using dgmres and fgmres? >>>>> >>>>> PETSc has not been designed with multi-threading in mind. You can >>>>> achieve “overlap” of the two solves by splitting the communicator. But >>>>> then >>>>> you need communications to let the two solutions talk to each other. >>>>> >>>>> Thanks >>>>> Stefano >>>>> >>>>> >>>>> On Aug 4, 2020, at 12:04 PM, nicola varini <nicola.var...@gmail.com> >>>>> wrote: >>>>> >>>>> Dear all, thanks for your replies. The reason why I've asked if it is >>>>> possible to overlap poisson and ampere is because they roughly >>>>> take the same amount of time. Please find in attachment the profiling >>>>> logs for only CPU and only GPU. >>>>> Of course it is possible to split the MPI communicator and run each >>>>> solver on different subcommunicator, however this would involve more >>>>> communication. >>>>> Did anyone ever tried to run 2 solvers with hyperthreading? >>>>> Thanks >>>>> >>>>> >>>>> Il giorno dom 2 ago 2020 alle ore 14:09 Mark Adams <mfad...@lbl.gov> >>>>> ha scritto: >>>>> >>>>>> I suspect that the Poisson and Ampere's law solve are not coupled. >>>>>> You might be able to duplicate the communicator and use two threads. You >>>>>> would want to configure PETSc with threadsafty and threads and I think it >>>>>> could/should work, but this mode is never used by anyone. >>>>>> >>>>>> That said, I would not recommend doing this unless you feel like >>>>>> playing in computer science, as opposed to doing application science. The >>>>>> best case scenario you get a speedup of 2x. That is a strict upper bound, >>>>>> but you will never come close to it. Your hardware has some balance of >>>>>> CPU >>>>>> to GPU processing rate. Your application has a balance of volume of work >>>>>> for your two solves. They have to be the same to get close to 2x speedup >>>>>> and that ratio(s) has to be 1:1. To be concrete, from what little I can >>>>>> guess about your applications let's assume that the cost of each of these >>>>>> two solves is about the same (eg, Laplacians on your domain and the best >>>>>> case scenario). But, GPU machines are configured to have roughly 1-10% of >>>>>> capacity in the GPUs, these days, that gives you an upper bound of about >>>>>> 10% speedup. That is noise. Upshot, unless you configure your hardware to >>>>>> match this problem, and the two solves have the same cost, you will not >>>>>> see >>>>>> close to 2x speedup. Your time is better spent elsewhere. >>>>>> >>>>>> Mark >>>>>> >>>>>> On Sat, Aug 1, 2020 at 3:24 PM Jed Brown <j...@jedbrown.org> wrote: >>>>>> >>>>>>> You can use MPI and split the communicator so n-1 ranks create a >>>>>>> DMDA for one part of your system and the other rank drives the GPU in >>>>>>> the >>>>>>> other part. They can all be part of the same coupled system on the full >>>>>>> communicator, but PETSc doesn't currently support some ranks having >>>>>>> their >>>>>>> Vec arrays on GPU and others on host, so you'd be paying host-device >>>>>>> transfer costs on each iteration (and that might swamp any performance >>>>>>> benefit you would have gotten). >>>>>>> >>>>>>> In any case, be sure to think about the execution time of each >>>>>>> part. Load balancing with matching time-to-solution for each part can >>>>>>> be >>>>>>> really hard. >>>>>>> >>>>>>> >>>>>>> Barry Smith <bsm...@petsc.dev> writes: >>>>>>> >>>>>>> > Nicola, >>>>>>> > >>>>>>> > This is really viable or practical at this time with PETSc. It >>>>>>> is not impossible but requires careful coding with threads, another >>>>>>> possibility is to use one half of the virtual GPUs for each solve, this >>>>>>> is >>>>>>> also not trivial. I would recommend first seeing what kind of >>>>>>> performance >>>>>>> you can get on the GPU for each type of solve and revist this idea in >>>>>>> the >>>>>>> future. >>>>>>> > >>>>>>> > Barry >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> >> On Jul 31, 2020, at 9:23 AM, nicola varini < >>>>>>> nicola.var...@gmail.com> wrote: >>>>>>> >> >>>>>>> >> Hello, I would like to know if it is possible to overlap CPU and >>>>>>> GPU with DMDA. >>>>>>> >> I've a machine where each node has 1P100+1Haswell. >>>>>>> >> I've to resolve Poisson and Ampere equation for each time step. >>>>>>> >> I'm using 2D DMDA for each of them. Would be possible to compute >>>>>>> poisson >>>>>>> >> and ampere equation at the same time? One on CPU and the other on >>>>>>> GPU? >>>>>>> >> >>>>>>> >> Thanks >>>>>>> >>>>>> <out_gpu><out_nogpu> >>>>> >>>>> >>>>> >> <out_miniapp_f_poisson> > > >