Hi Mark, I do confirm that hypre with boomeramg is working fine and is pretty fast. However, none of the GAMG option works. Did anyone ever succeeded in usign hypre with petsc on gpu? I did manage to compile hypre on gpu but I do get the following error: ======= CC gpuhypre/obj/vec/vec/impls/hypre/vhyp.o In file included from /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config.h:22, from /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/execution_policy.h:23, from /users/nvarini/hypre/include/_hypre_utilities.h:1129, from /users/nvarini/hypre/include/_hypre_IJ_mv.h:14, from /scratch/snx3000/nvarini/petsc-3.13.3/include/../src/vec/vec/impls/hypre/vhyp.h:6, from /scratch/snx3000/nvarini/petsc-3.13.3/src/vec/vec/impls/hypre/vhyp.c:7: /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/version.h:83:1: error: unknown type name 'namespace' namespace thrust ^~~~~~~~~ /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/version.h:84:1: error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token { ^ In file included from /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config/config.h:28, from /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config.h:23, from /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/execution_policy.h:23, from /users/nvarini/hypre/include/_hypre_utilities.h:1129, from /users/nvarini/hypre/include/_hypre_IJ_mv.h:14, from /scratch/snx3000/nvarini/petsc-3.13.3/include/../src/vec/vec/impls/hypre/vhyp.h:6, from /scratch/snx3000/nvarini/petsc-3.13.3/src/vec/vec/impls/hypre/vhyp.c:7: /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config/cpp_compatibility.h:21:10: fatal error: cstddef: No such file or directory #include <cstddef> ^~~~~~~~~ compilation terminated.
======= Nicola Il giorno ven 14 ago 2020 alle ore 20:13 Mark Adams <mfad...@lbl.gov> ha scritto: > You can try Hypre. If that fails then there is a problem with your system. > > And you can run with -info and grep on GAMG and send the output and I can > see if I see anything funny. > > If this is just a Lapacian with a stable discretization and not crazy > material parameters then stretched grids are about the only thing that can > hurt the solver. > > Do both of your solves fail in a similar way? > > On the CPU you can try this with large subdomains, preferably (in serial > ideally): > -ampere_mg_levels_ksp_type richardson > -ampere_mg_levels_pc_type sor > > And check that there are no unused options with -options_left. GAMG can > fail with bad eigen estimates, but these parameters look fine. > > On Fri, Aug 14, 2020 at 5:01 AM nicola varini <nicola.var...@gmail.com> > wrote: > >> Dear Barry, yes it gives the same problems. >> >> Il giorno gio 13 ago 2020 alle ore 23:22 Barry Smith <bsm...@petsc.dev> >> ha scritto: >> >>> >>> Does the same thing work (with GAMG) if you run on the same problem >>> on the same machine same number of MPI ranks but make a new PETSC_ARCH that >>> does NOT use the GPUs? >>> >>> Barry >>> >>> Ideally one gets almost identical convergence with CPUs or GPUs (same >>> problem, same machine) but a bug or numerically change "might" affect this. >>> >>> On Aug 13, 2020, at 10:28 AM, nicola varini <nicola.var...@gmail.com> >>> wrote: >>> >>> Dear Barry, you are right. The Cray argument checking is incorrect. It >>> does work with download-fblaslapack. >>> However it does fail to converge. Is there anything obviously wrong with >>> my petscrc? >>> Anything else am I missing? >>> >>> Thanks >>> >>> Il giorno gio 13 ago 2020 alle ore 03:17 Barry Smith <bsm...@petsc.dev> >>> ha scritto: >>> >>>> >>>> The QR is always done on the CPU, we don't have generic calls to >>>> blas/lapack go to the GPU currently. >>>> >>>> The error message is: >>>> >>>> On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal value >>>> (info = -7) >>>> >>>> argument 7 is &LWORK which is defined by >>>> >>>> PetscBLASInt LWORK=N*bs; >>>> >>>> and >>>> >>>> N=nSAvec is the column block size of new P. >>>> >>>> Presumably this is a huge run with many processes so using the >>>> debugger is not practical? >>>> >>>> We need to see what these variables are >>>> >>>> N, bs, nSAvec >>>> >>>> perhaps nSAvec is zero which could easily upset LAPACK. >>>> >>>> Crudest thing would be to just put a print statement in the code >>>> before the LAPACK call of if they are called many times add an error check >>>> like that >>>> generates an error if any of these three values are 0 (or negative). >>>> >>>> Barry >>>> >>>> >>>> It is not impossible that the Cray argument checking is incorrect >>>> and the value passed in is fine. You can check this by using >>>> --download-fblaslapack and see if the same or some other error comes up. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Aug 12, 2020, at 7:19 PM, Mark Adams <mfad...@lbl.gov> wrote: >>>> >>>> Can you reproduce this on the CPU? >>>> The QR factorization seems to be failing. That could be from bad data >>>> or a bad GPU QR. >>>> >>>> On Wed, Aug 12, 2020 at 4:19 AM nicola varini <nicola.var...@gmail.com> >>>> wrote: >>>> >>>>> Dear all, following the suggestions I did resubmit the simulation with >>>>> the petscrc below. >>>>> However I do get the following error: >>>>> ======== >>>>> 7362 [592]PETSC ERROR: #1 formProl0() line 748 in >>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c >>>>> 7363 [339]PETSC ERROR: Petsc has generated inconsistent data >>>>> 7364 [339]PETSC ERROR: xGEQRF error >>>>> 7365 [339]PETSC ERROR: See >>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble >>>>> shooting. >>>>> 7366 [339]PETSC ERROR: Petsc Release Version 3.13.3, Jul 01, 2020 >>>>> 7367 [339]PETSC ERROR: >>>>> /users/nvarini/gbs_test_nicola/bin/gbs_daint_gpu_gnu on a named nid05083 >>>>> by nvarini Wed Aug 12 10:06:15 2020 >>>>> 7368 [339]PETSC ERROR: Configure options --with-cc=cc --with-fc=ftn >>>>> --known-mpi-shared-libraries=1 --known-mpi-c-double-complex=1 >>>>> --known-mpi-int64_t=1 --known-mpi-long-double=1 --with-batch=1 >>>>> --known-64-bit-blas-indices=0 --LIBS=-lstdc++ --with-cxxlib-autodetect=0 >>>>> --with-scalapa ck=1 --with-cxx=CC --with-debugging=0 >>>>> --with-hypre-dir=/opt/cray/pe/tpsl/19.06.1/GNU/8.2/haswell >>>>> --prefix=/scratch/snx3000/nvarini/petsc3.13.3-gpu --with-cuda=1 >>>>> --with-cuda-c=nvcc --with-cxxlib-autodetect=0 >>>>> --COPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include - >>>>> -with-cxx=CC >>>>> --CXXOPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include >>>>> 7369 [592]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in >>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c >>>>> 7370 [592]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in >>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c >>>>> 7371 [592]PETSC ERROR: #4 PCSetUp() line 898 in >>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c >>>>> 7372 [592]PETSC ERROR: #5 KSPSetUp() line 376 in >>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >>>>> 7373 [592]PETSC ERROR: #6 KSPSolve_Private() line 633 in >>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >>>>> 7374 [316]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in >>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c >>>>> 7375 [339]PETSC ERROR: #1 formProl0() line 748 in >>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c >>>>> 7376 [339]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in >>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c >>>>> 7377 [339]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in >>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c >>>>> 7378 [339]PETSC ERROR: #4 PCSetUp() line 898 in >>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c >>>>> 7379 [339]PETSC ERROR: #5 KSPSetUp() line 376 in >>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >>>>> 7380 [592]PETSC ERROR: #7 KSPSolve() line 853 in >>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >>>>> 7381 [339]PETSC ERROR: #6 KSPSolve_Private() line 633 in >>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >>>>> 7382 [339]PETSC ERROR: #7 KSPSolve() line 853 in >>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >>>>> 7383 On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal value >>>>> (info = -7) >>>>> 7384 [160]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in >>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c >>>>> ======== >>>>> >>>>> I did try other pc_gamg_type but they fails as well. >>>>> >>>>> >>>>> #PETSc Option Table entries: >>>>> -ampere_dm_mat_type aijcusparse >>>>> -ampere_dm_vec_type cuda >>>>> -ampere_ksp_atol 1e-15 >>>>> -ampere_ksp_initial_guess_nonzero yes >>>>> -ampere_ksp_reuse_preconditioner yes >>>>> -ampere_ksp_rtol 1e-7 >>>>> -ampere_ksp_type dgmres >>>>> -ampere_mg_levels_esteig_ksp_max_it 10 >>>>> -ampere_mg_levels_esteig_ksp_type cg >>>>> -ampere_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05 >>>>> -ampere_mg_levels_ksp_type chebyshev >>>>> -ampere_mg_levels_pc_type jacobi >>>>> -ampere_pc_gamg_agg_nsmooths 1 >>>>> -ampere_pc_gamg_coarse_eq_limit 10 >>>>> -ampere_pc_gamg_reuse_interpolation true >>>>> -ampere_pc_gamg_square_graph 1 >>>>> -ampere_pc_gamg_threshold 0.05 >>>>> -ampere_pc_gamg_threshold_scale .0 >>>>> -ampere_pc_gamg_type agg >>>>> -ampere_pc_type gamg >>>>> -dm_mat_type aijcusparse >>>>> -dm_vec_type cuda >>>>> -log_view >>>>> -poisson_dm_mat_type aijcusparse >>>>> -poisson_dm_vec_type cuda >>>>> -poisson_ksp_atol 1e-15 >>>>> -poisson_ksp_initial_guess_nonzero yes >>>>> -poisson_ksp_reuse_preconditioner yes >>>>> -poisson_ksp_rtol 1e-7 >>>>> -poisson_ksp_type dgmres >>>>> -poisson_log_view >>>>> -poisson_mg_levels_esteig_ksp_max_it 10 >>>>> -poisson_mg_levels_esteig_ksp_type cg >>>>> -poisson_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05 >>>>> -poisson_mg_levels_ksp_max_it 1 >>>>> -poisson_mg_levels_ksp_type chebyshev >>>>> -poisson_mg_levels_pc_type jacobi >>>>> -poisson_pc_gamg_agg_nsmooths 1 >>>>> -poisson_pc_gamg_coarse_eq_limit 10 >>>>> -poisson_pc_gamg_reuse_interpolation true >>>>> -poisson_pc_gamg_square_graph 1 >>>>> -poisson_pc_gamg_threshold 0.05 >>>>> -poisson_pc_gamg_threshold_scale .0 >>>>> -poisson_pc_gamg_type agg >>>>> -poisson_pc_type gamg >>>>> -use_mat_nearnullspace true >>>>> #End of PETSc Option Table entries >>>>> >>>>> Regards, >>>>> >>>>> Nicola >>>>> >>>>> Il giorno mar 4 ago 2020 alle ore 17:57 Mark Adams <mfad...@lbl.gov> >>>>> ha scritto: >>>>> >>>>>> >>>>>> >>>>>> On Tue, Aug 4, 2020 at 6:35 AM Stefano Zampini < >>>>>> stefano.zamp...@gmail.com> wrote: >>>>>> >>>>>>> Nicola, >>>>>>> >>>>>>> You are actually not using the GPU properly, since you use HYPRE >>>>>>> preconditioning, which is CPU only. One of your solvers is actually >>>>>>> slower >>>>>>> on “GPU”. >>>>>>> For a full AMG GPU, you can use PCGAMG, with cheby smoothers and >>>>>>> with Jacobi preconditioning. Mark can help you out with the specific >>>>>>> command line options. >>>>>>> When it works properly, everything related to PC application is >>>>>>> offloaded to the GPU, and you should expect to get the well-known and >>>>>>> branded 10x (maybe more) speedup one is expecting from GPUs during >>>>>>> KSPSolve >>>>>>> >>>>>>> >>>>>> The speedup depends on the machine, but on SUMMIT, using enough CPUs >>>>>> to saturate the memory bus vs all 6 GPUs the speedup is a function of >>>>>> problem subdomain size. I saw 10x at about 100K equations/process. >>>>>> >>>>>> >>>>>>> Doing what you want to do is one of the last optimization steps of >>>>>>> an already optimized code before entering production. Yours is not even >>>>>>> optimized for proper GPU usage yet. >>>>>>> Also, any specific reason why you are using dgmres and fgmres? >>>>>>> >>>>>>> PETSc has not been designed with multi-threading in mind. You can >>>>>>> achieve “overlap” of the two solves by splitting the communicator. But >>>>>>> then >>>>>>> you need communications to let the two solutions talk to each other. >>>>>>> >>>>>>> Thanks >>>>>>> Stefano >>>>>>> >>>>>>> >>>>>>> On Aug 4, 2020, at 12:04 PM, nicola varini <nicola.var...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Dear all, thanks for your replies. The reason why I've asked if it >>>>>>> is possible to overlap poisson and ampere is because they roughly >>>>>>> take the same amount of time. Please find in attachment the >>>>>>> profiling logs for only CPU and only GPU. >>>>>>> Of course it is possible to split the MPI communicator and run each >>>>>>> solver on different subcommunicator, however this would involve more >>>>>>> communication. >>>>>>> Did anyone ever tried to run 2 solvers with hyperthreading? >>>>>>> Thanks >>>>>>> >>>>>>> >>>>>>> Il giorno dom 2 ago 2020 alle ore 14:09 Mark Adams <mfad...@lbl.gov> >>>>>>> ha scritto: >>>>>>> >>>>>>>> I suspect that the Poisson and Ampere's law solve are not coupled. >>>>>>>> You might be able to duplicate the communicator and use two threads. >>>>>>>> You >>>>>>>> would want to configure PETSc with threadsafty and threads and I think >>>>>>>> it >>>>>>>> could/should work, but this mode is never used by anyone. >>>>>>>> >>>>>>>> That said, I would not recommend doing this unless you feel like >>>>>>>> playing in computer science, as opposed to doing application science. >>>>>>>> The >>>>>>>> best case scenario you get a speedup of 2x. That is a strict upper >>>>>>>> bound, >>>>>>>> but you will never come close to it. Your hardware has some balance of >>>>>>>> CPU >>>>>>>> to GPU processing rate. Your application has a balance of volume of >>>>>>>> work >>>>>>>> for your two solves. They have to be the same to get close to 2x >>>>>>>> speedup >>>>>>>> and that ratio(s) has to be 1:1. To be concrete, from what little I can >>>>>>>> guess about your applications let's assume that the cost of each of >>>>>>>> these >>>>>>>> two solves is about the same (eg, Laplacians on your domain and the >>>>>>>> best >>>>>>>> case scenario). But, GPU machines are configured to have roughly 1-10% >>>>>>>> of >>>>>>>> capacity in the GPUs, these days, that gives you an upper bound of >>>>>>>> about >>>>>>>> 10% speedup. That is noise. Upshot, unless you configure your hardware >>>>>>>> to >>>>>>>> match this problem, and the two solves have the same cost, you will >>>>>>>> not see >>>>>>>> close to 2x speedup. Your time is better spent elsewhere. >>>>>>>> >>>>>>>> Mark >>>>>>>> >>>>>>>> On Sat, Aug 1, 2020 at 3:24 PM Jed Brown <j...@jedbrown.org> wrote: >>>>>>>> >>>>>>>>> You can use MPI and split the communicator so n-1 ranks create a >>>>>>>>> DMDA for one part of your system and the other rank drives the GPU in >>>>>>>>> the >>>>>>>>> other part. They can all be part of the same coupled system on the >>>>>>>>> full >>>>>>>>> communicator, but PETSc doesn't currently support some ranks having >>>>>>>>> their >>>>>>>>> Vec arrays on GPU and others on host, so you'd be paying host-device >>>>>>>>> transfer costs on each iteration (and that might swamp any performance >>>>>>>>> benefit you would have gotten). >>>>>>>>> >>>>>>>>> In any case, be sure to think about the execution time of each >>>>>>>>> part. Load balancing with matching time-to-solution for each part >>>>>>>>> can be >>>>>>>>> really hard. >>>>>>>>> >>>>>>>>> >>>>>>>>> Barry Smith <bsm...@petsc.dev> writes: >>>>>>>>> >>>>>>>>> > Nicola, >>>>>>>>> > >>>>>>>>> > This is really viable or practical at this time with PETSc. >>>>>>>>> It is not impossible but requires careful coding with threads, another >>>>>>>>> possibility is to use one half of the virtual GPUs for each solve, >>>>>>>>> this is >>>>>>>>> also not trivial. I would recommend first seeing what kind of >>>>>>>>> performance >>>>>>>>> you can get on the GPU for each type of solve and revist this idea in >>>>>>>>> the >>>>>>>>> future. >>>>>>>>> > >>>>>>>>> > Barry >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> >> On Jul 31, 2020, at 9:23 AM, nicola varini < >>>>>>>>> nicola.var...@gmail.com> wrote: >>>>>>>>> >> >>>>>>>>> >> Hello, I would like to know if it is possible to overlap CPU >>>>>>>>> and GPU with DMDA. >>>>>>>>> >> I've a machine where each node has 1P100+1Haswell. >>>>>>>>> >> I've to resolve Poisson and Ampere equation for each time step. >>>>>>>>> >> I'm using 2D DMDA for each of them. Would be possible to >>>>>>>>> compute poisson >>>>>>>>> >> and ampere equation at the same time? One on CPU and the other >>>>>>>>> on GPU? >>>>>>>>> >> >>>>>>>>> >> Thanks >>>>>>>>> >>>>>>>> <out_gpu><out_nogpu> >>>>>>> >>>>>>> >>>>>>> >>>> <out_miniapp_f_poisson> >>> >>> >>>