Hi Mark, this is the out of grep GAMG after I used -info: ======= [0] PCSetUp_GAMG(): level 0) N=582736, n data rows=1, n data cols=1, nnz/row (ave)=9, np=12 [0] PCGAMGFilterGraph(): 97.9676% nnz after filtering, with threshold 0., 8.95768 nnz ave. (N=582736) [0] PCGAMGCoarsen_AGG(): Square Graph on level 1 of 1 to square [0] PCGAMGProlongator_AGG(): New grid 38934 nodes [0] PCGAMGOptProlongator_AGG(): Smooth P0: max eigen=2.101683e+00 min=4.341777e-03 PC=jacobi [0] PCGAMGOptProlongator_AGG(): Smooth P0: level 0, cache spectra 0.00434178 2.10168 [0] PCSetUp_GAMG(): 1) N=38934, n data cols=1, nnz/row (ave)=18, 12 active pes [0] PCGAMGFilterGraph(): 97.024% nnz after filtering, with threshold 0., 17.9774 nnz ave. (N=38934) [0] PCGAMGProlongator_AGG(): New grid 4459 nodes [0] PCGAMGOptProlongator_AGG(): Smooth P0: max eigen=4.521607e+01 min=5.854294e-01 PC=jacobi [0] PCGAMGOptProlongator_AGG(): Smooth P0: level 1, cache spectra 0.585429 45.2161 [0] PCSetUp_GAMG(): 2) N=4459, n data cols=1, nnz/row (ave)=29, 12 active pes [0] PCGAMGFilterGraph(): 99.6422% nnz after filtering, with threshold 0., 27.5481 nnz ave. (N=4459) [0] PCGAMGProlongator_AGG(): New grid 345 nodes [0] PCGAMGOptProlongator_AGG(): Smooth P0: max eigen=1.394069e+01 min=1.086973e-01 PC=jacobi [0] PCGAMGOptProlongator_AGG(): Smooth P0: level 2, cache spectra 0.108697 13.9407 [0] PCGAMGCreateLevel_GAMG(): Number of equations (loc) 29 with simple aggregation [0] PCSetUp_GAMG(): 3) N=345, n data cols=1, nnz/row (ave)=31, 6 active pes [0] PCGAMGFilterGraph(): 99.6292% nnz after filtering, with threshold 0., 26.9667 nnz ave. (N=345) [0] PCGAMGProlongator_AGG(): New grid 26 nodes [0] PCGAMGOptProlongator_AGG(): Smooth P0: max eigen=1.463593e+02 min=1.469384e-01 PC=jacobi [0] PCGAMGOptProlongator_AGG(): Smooth P0: level 3, cache spectra 0.146938 146.359 [0] PCGAMGCreateLevel_GAMG(): Number of equations (loc) 5 with simple aggregation [0] PCSetUp_GAMG(): 4) N=26, n data cols=1, nnz/row (ave)=16, 1 active pes [0] PCSetUp_GAMG(): 5 levels, grid complexity = 1.16304 PCGAMGGraph_AGG 4 1.0 8.4114e-02 1.0 1.02e+06 1.0 3.8e+02 1.3e+03 4.0e+01 0 0 0 0 0 0 0 0 0 0 145 PCGAMGCoarse_AGG 4 1.0 3.2107e-01 1.0 9.43e+06 1.0 7.3e+02 1.1e+04 3.5e+01 0 0 0 0 0 0 0 0 0 0 351 PCGAMGProl_AGG 4 1.0 2.8825e-02 1.0 0.00e+00 0.0 3.5e+02 2.8e+03 6.4e+01 0 0 0 0 0 0 0 0 0 0 0 PCGAMGPOpt_AGG 4 1.0 1.1570e-01 1.0 2.61e+07 1.0 1.2e+03 2.6e+03 1.6e+02 0 0 0 0 1 0 0 0 0 1 2692 GAMG: createProl 4 1.0 5.5680e-01 1.0 3.64e+07 1.0 2.7e+03 4.6e+03 3.0e+02 0 0 0 0 1 0 0 0 0 1 784 GAMG: partLevel 4 1.0 1.1628e-01 1.0 5.90e+06 1.0 1.1e+03 3.0e+03 1.6e+02 0 0 0 0 1 0 0 0 0 1 604 ====== Nicola
Il giorno lun 17 ago 2020 alle ore 15:40 Mark Adams <mfad...@lbl.gov> ha scritto: > > > On Mon, Aug 17, 2020 at 9:24 AM nicola varini <nicola.var...@gmail.com> > wrote: > >> Hi Mark, I do confirm that hypre with boomeramg is working fine and is >> pretty fast. >> > > Good, you can send me the -info (grep GAMG) output and I try to see what > is going on. > > >> However, none of the GAMG option works. >> Did anyone ever succeeded in usign hypre with petsc on gpu? >> > > We have gotten Hypre to run on GPUs but it has been fragile. The > performance has been marginal (due to use of USM apparently), but it is > being worked on by the hypre team. > > The cude tools are changing fast and I am guessing this is a different > version than what we have tested, perhaps. Maybe someone else can help with > this, but I know we use cuda 10.2 and you are using cuda tools 10.1. > > And you do want to use the most up-to-date PETSc. > > >> I did manage to compile hypre on gpu but I do get the following error: >> ======= >> CC gpuhypre/obj/vec/vec/impls/hypre/vhyp.o >> In file included from >> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config.h:22, >> from >> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/execution_policy.h:23, >> from >> /users/nvarini/hypre/include/_hypre_utilities.h:1129, >> from /users/nvarini/hypre/include/_hypre_IJ_mv.h:14, >> from >> /scratch/snx3000/nvarini/petsc-3.13.3/include/../src/vec/vec/impls/hypre/vhyp.h:6, >> from >> /scratch/snx3000/nvarini/petsc-3.13.3/src/vec/vec/impls/hypre/vhyp.c:7: >> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/version.h:83:1: >> error: unknown type name 'namespace' >> namespace thrust >> ^~~~~~~~~ >> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/version.h:84:1: >> error: expected '=', ',', ';', 'asm' or '__attribute__' before '{' token >> { >> ^ >> In file included from >> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config/config.h:28, >> from >> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config.h:23, >> from >> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/execution_policy.h:23, >> from >> /users/nvarini/hypre/include/_hypre_utilities.h:1129, >> from /users/nvarini/hypre/include/_hypre_IJ_mv.h:14, >> from >> /scratch/snx3000/nvarini/petsc-3.13.3/include/../src/vec/vec/impls/hypre/vhyp.h:6, >> from >> /scratch/snx3000/nvarini/petsc-3.13.3/src/vec/vec/impls/hypre/vhyp.c:7: >> /opt/nvidia/cudatoolkit10/10.1.105_3.27-7.0.1.1_4.1__ga311ce7/include/thrust/detail/config/cpp_compatibility.h:21:10: >> fatal error: cstddef: No such file or directory >> #include <cstddef> >> ^~~~~~~~~ >> compilation terminated. >> >> ======= >> Nicola >> >> Il giorno ven 14 ago 2020 alle ore 20:13 Mark Adams <mfad...@lbl.gov> ha >> scritto: >> >>> You can try Hypre. If that fails then there is a problem with your >>> system. >>> >>> And you can run with -info and grep on GAMG and send the output and I >>> can see if I see anything funny. >>> >>> If this is just a Lapacian with a stable discretization and not crazy >>> material parameters then stretched grids are about the only thing that can >>> hurt the solver. >>> >>> Do both of your solves fail in a similar way? >>> >>> On the CPU you can try this with large subdomains, preferably (in serial >>> ideally): >>> -ampere_mg_levels_ksp_type richardson >>> -ampere_mg_levels_pc_type sor >>> >>> And check that there are no unused options with -options_left. GAMG can >>> fail with bad eigen estimates, but these parameters look fine. >>> >>> On Fri, Aug 14, 2020 at 5:01 AM nicola varini <nicola.var...@gmail.com> >>> wrote: >>> >>>> Dear Barry, yes it gives the same problems. >>>> >>>> Il giorno gio 13 ago 2020 alle ore 23:22 Barry Smith <bsm...@petsc.dev> >>>> ha scritto: >>>> >>>>> >>>>> Does the same thing work (with GAMG) if you run on the same problem >>>>> on the same machine same number of MPI ranks but make a new PETSC_ARCH >>>>> that >>>>> does NOT use the GPUs? >>>>> >>>>> Barry >>>>> >>>>> Ideally one gets almost identical convergence with CPUs or GPUs >>>>> (same problem, same machine) but a bug or numerically change "might" >>>>> affect >>>>> this. >>>>> >>>>> On Aug 13, 2020, at 10:28 AM, nicola varini <nicola.var...@gmail.com> >>>>> wrote: >>>>> >>>>> Dear Barry, you are right. The Cray argument checking is incorrect. It >>>>> does work with download-fblaslapack. >>>>> However it does fail to converge. Is there anything obviously wrong >>>>> with my petscrc? >>>>> Anything else am I missing? >>>>> >>>>> Thanks >>>>> >>>>> Il giorno gio 13 ago 2020 alle ore 03:17 Barry Smith <bsm...@petsc.dev> >>>>> ha scritto: >>>>> >>>>>> >>>>>> The QR is always done on the CPU, we don't have generic calls to >>>>>> blas/lapack go to the GPU currently. >>>>>> >>>>>> The error message is: >>>>>> >>>>>> On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal value >>>>>> (info = -7) >>>>>> >>>>>> argument 7 is &LWORK which is defined by >>>>>> >>>>>> PetscBLASInt LWORK=N*bs; >>>>>> >>>>>> and >>>>>> >>>>>> N=nSAvec is the column block size of new P. >>>>>> >>>>>> Presumably this is a huge run with many processes so using the >>>>>> debugger is not practical? >>>>>> >>>>>> We need to see what these variables are >>>>>> >>>>>> N, bs, nSAvec >>>>>> >>>>>> perhaps nSAvec is zero which could easily upset LAPACK. >>>>>> >>>>>> Crudest thing would be to just put a print statement in the code >>>>>> before the LAPACK call of if they are called many times add an error >>>>>> check >>>>>> like that >>>>>> generates an error if any of these three values are 0 (or >>>>>> negative). >>>>>> >>>>>> Barry >>>>>> >>>>>> >>>>>> It is not impossible that the Cray argument checking is incorrect >>>>>> and the value passed in is fine. You can check this by using >>>>>> --download-fblaslapack and see if the same or some other error comes up. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Aug 12, 2020, at 7:19 PM, Mark Adams <mfad...@lbl.gov> wrote: >>>>>> >>>>>> Can you reproduce this on the CPU? >>>>>> The QR factorization seems to be failing. That could be from bad data >>>>>> or a bad GPU QR. >>>>>> >>>>>> On Wed, Aug 12, 2020 at 4:19 AM nicola varini < >>>>>> nicola.var...@gmail.com> wrote: >>>>>> >>>>>>> Dear all, following the suggestions I did resubmit the simulation >>>>>>> with the petscrc below. >>>>>>> However I do get the following error: >>>>>>> ======== >>>>>>> 7362 [592]PETSC ERROR: #1 formProl0() line 748 in >>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c >>>>>>> 7363 [339]PETSC ERROR: Petsc has generated inconsistent data >>>>>>> 7364 [339]PETSC ERROR: xGEQRF error >>>>>>> 7365 [339]PETSC ERROR: See >>>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble >>>>>>> shooting. >>>>>>> 7366 [339]PETSC ERROR: Petsc Release Version 3.13.3, Jul 01, 2020 >>>>>>> 7367 [339]PETSC ERROR: >>>>>>> /users/nvarini/gbs_test_nicola/bin/gbs_daint_gpu_gnu on a named >>>>>>> nid05083 >>>>>>> by nvarini Wed Aug 12 10:06:15 2020 >>>>>>> 7368 [339]PETSC ERROR: Configure options --with-cc=cc >>>>>>> --with-fc=ftn --known-mpi-shared-libraries=1 >>>>>>> --known-mpi-c-double-complex=1 >>>>>>> --known-mpi-int64_t=1 --known-mpi-long-double=1 --with-batch=1 >>>>>>> --known-64-bit-blas-indices=0 --LIBS=-lstdc++ --with-cxxlib-autodetect=0 >>>>>>> --with-scalapa ck=1 --with-cxx=CC --with-debugging=0 >>>>>>> --with-hypre-dir=/opt/cray/pe/tpsl/19.06.1/GNU/8.2/haswell >>>>>>> --prefix=/scratch/snx3000/nvarini/petsc3.13.3-gpu --with-cuda=1 >>>>>>> --with-cuda-c=nvcc --with-cxxlib-autodetect=0 >>>>>>> --COPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include - >>>>>>> -with-cxx=CC >>>>>>> --CXXOPTFLAGS=-I/opt/cray/pe/mpt/7.7.10/gni/mpich-intel/16.0/include >>>>>>> 7369 [592]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in >>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c >>>>>>> 7370 [592]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in >>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c >>>>>>> 7371 [592]PETSC ERROR: #4 PCSetUp() line 898 in >>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c >>>>>>> 7372 [592]PETSC ERROR: #5 KSPSetUp() line 376 in >>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >>>>>>> 7373 [592]PETSC ERROR: #6 KSPSolve_Private() line 633 in >>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >>>>>>> 7374 [316]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in >>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c >>>>>>> 7375 [339]PETSC ERROR: #1 formProl0() line 748 in >>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c >>>>>>> 7376 [339]PETSC ERROR: #2 PCGAMGProlongator_AGG() line 1063 in >>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/agg.c >>>>>>> 7377 [339]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in >>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c >>>>>>> 7378 [339]PETSC ERROR: #4 PCSetUp() line 898 in >>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/interface/precon.c >>>>>>> 7379 [339]PETSC ERROR: #5 KSPSetUp() line 376 in >>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >>>>>>> 7380 [592]PETSC ERROR: #7 KSPSolve() line 853 in >>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >>>>>>> 7381 [339]PETSC ERROR: #6 KSPSolve_Private() line 633 in >>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >>>>>>> 7382 [339]PETSC ERROR: #7 KSPSolve() line 853 in >>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/ksp/interface/itfunc.c >>>>>>> 7383 On entry to __cray_mgm_dgeqrf, parameter 7 had an illegal >>>>>>> value (info = -7) >>>>>>> 7384 [160]PETSC ERROR: #3 PCSetUp_GAMG() line 548 in >>>>>>> /scratch/snx3000/nvarini/petsc-3.13.3/src/ksp/pc/impls/gamg/gamg.c >>>>>>> ======== >>>>>>> >>>>>>> I did try other pc_gamg_type but they fails as well. >>>>>>> >>>>>>> >>>>>>> #PETSc Option Table entries: >>>>>>> -ampere_dm_mat_type aijcusparse >>>>>>> -ampere_dm_vec_type cuda >>>>>>> -ampere_ksp_atol 1e-15 >>>>>>> -ampere_ksp_initial_guess_nonzero yes >>>>>>> -ampere_ksp_reuse_preconditioner yes >>>>>>> -ampere_ksp_rtol 1e-7 >>>>>>> -ampere_ksp_type dgmres >>>>>>> -ampere_mg_levels_esteig_ksp_max_it 10 >>>>>>> -ampere_mg_levels_esteig_ksp_type cg >>>>>>> -ampere_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05 >>>>>>> -ampere_mg_levels_ksp_type chebyshev >>>>>>> -ampere_mg_levels_pc_type jacobi >>>>>>> -ampere_pc_gamg_agg_nsmooths 1 >>>>>>> -ampere_pc_gamg_coarse_eq_limit 10 >>>>>>> -ampere_pc_gamg_reuse_interpolation true >>>>>>> -ampere_pc_gamg_square_graph 1 >>>>>>> -ampere_pc_gamg_threshold 0.05 >>>>>>> -ampere_pc_gamg_threshold_scale .0 >>>>>>> -ampere_pc_gamg_type agg >>>>>>> -ampere_pc_type gamg >>>>>>> -dm_mat_type aijcusparse >>>>>>> -dm_vec_type cuda >>>>>>> -log_view >>>>>>> -poisson_dm_mat_type aijcusparse >>>>>>> -poisson_dm_vec_type cuda >>>>>>> -poisson_ksp_atol 1e-15 >>>>>>> -poisson_ksp_initial_guess_nonzero yes >>>>>>> -poisson_ksp_reuse_preconditioner yes >>>>>>> -poisson_ksp_rtol 1e-7 >>>>>>> -poisson_ksp_type dgmres >>>>>>> -poisson_log_view >>>>>>> -poisson_mg_levels_esteig_ksp_max_it 10 >>>>>>> -poisson_mg_levels_esteig_ksp_type cg >>>>>>> -poisson_mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05 >>>>>>> -poisson_mg_levels_ksp_max_it 1 >>>>>>> -poisson_mg_levels_ksp_type chebyshev >>>>>>> -poisson_mg_levels_pc_type jacobi >>>>>>> -poisson_pc_gamg_agg_nsmooths 1 >>>>>>> -poisson_pc_gamg_coarse_eq_limit 10 >>>>>>> -poisson_pc_gamg_reuse_interpolation true >>>>>>> -poisson_pc_gamg_square_graph 1 >>>>>>> -poisson_pc_gamg_threshold 0.05 >>>>>>> -poisson_pc_gamg_threshold_scale .0 >>>>>>> -poisson_pc_gamg_type agg >>>>>>> -poisson_pc_type gamg >>>>>>> -use_mat_nearnullspace true >>>>>>> #End of PETSc Option Table entries >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Nicola >>>>>>> >>>>>>> Il giorno mar 4 ago 2020 alle ore 17:57 Mark Adams <mfad...@lbl.gov> >>>>>>> ha scritto: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Aug 4, 2020 at 6:35 AM Stefano Zampini < >>>>>>>> stefano.zamp...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Nicola, >>>>>>>>> >>>>>>>>> You are actually not using the GPU properly, since you use HYPRE >>>>>>>>> preconditioning, which is CPU only. One of your solvers is actually >>>>>>>>> slower >>>>>>>>> on “GPU”. >>>>>>>>> For a full AMG GPU, you can use PCGAMG, with cheby smoothers and >>>>>>>>> with Jacobi preconditioning. Mark can help you out with the specific >>>>>>>>> command line options. >>>>>>>>> When it works properly, everything related to PC application is >>>>>>>>> offloaded to the GPU, and you should expect to get the well-known and >>>>>>>>> branded 10x (maybe more) speedup one is expecting from GPUs during >>>>>>>>> KSPSolve >>>>>>>>> >>>>>>>>> >>>>>>>> The speedup depends on the machine, but on SUMMIT, using enough >>>>>>>> CPUs to saturate the memory bus vs all 6 GPUs the speedup is a >>>>>>>> function of >>>>>>>> problem subdomain size. I saw 10x at about 100K equations/process. >>>>>>>> >>>>>>>> >>>>>>>>> Doing what you want to do is one of the last optimization steps of >>>>>>>>> an already optimized code before entering production. Yours is not >>>>>>>>> even >>>>>>>>> optimized for proper GPU usage yet. >>>>>>>>> Also, any specific reason why you are using dgmres and fgmres? >>>>>>>>> >>>>>>>>> PETSc has not been designed with multi-threading in mind. You can >>>>>>>>> achieve “overlap” of the two solves by splitting the communicator. >>>>>>>>> But then >>>>>>>>> you need communications to let the two solutions talk to each other. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Stefano >>>>>>>>> >>>>>>>>> >>>>>>>>> On Aug 4, 2020, at 12:04 PM, nicola varini < >>>>>>>>> nicola.var...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> Dear all, thanks for your replies. The reason why I've asked if it >>>>>>>>> is possible to overlap poisson and ampere is because they roughly >>>>>>>>> take the same amount of time. Please find in attachment the >>>>>>>>> profiling logs for only CPU and only GPU. >>>>>>>>> Of course it is possible to split the MPI communicator and run >>>>>>>>> each solver on different subcommunicator, however this would involve >>>>>>>>> more >>>>>>>>> communication. >>>>>>>>> Did anyone ever tried to run 2 solvers with hyperthreading? >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>> >>>>>>>>> Il giorno dom 2 ago 2020 alle ore 14:09 Mark Adams < >>>>>>>>> mfad...@lbl.gov> ha scritto: >>>>>>>>> >>>>>>>>>> I suspect that the Poisson and Ampere's law solve are not >>>>>>>>>> coupled. You might be able to duplicate the communicator and use two >>>>>>>>>> threads. You would want to configure PETSc with threadsafty and >>>>>>>>>> threads and >>>>>>>>>> I think it could/should work, but this mode is never used by anyone. >>>>>>>>>> >>>>>>>>>> That said, I would not recommend doing this unless you feel like >>>>>>>>>> playing in computer science, as opposed to doing application >>>>>>>>>> science. The >>>>>>>>>> best case scenario you get a speedup of 2x. That is a strict upper >>>>>>>>>> bound, >>>>>>>>>> but you will never come close to it. Your hardware has some balance >>>>>>>>>> of CPU >>>>>>>>>> to GPU processing rate. Your application has a balance of volume of >>>>>>>>>> work >>>>>>>>>> for your two solves. They have to be the same to get close to 2x >>>>>>>>>> speedup >>>>>>>>>> and that ratio(s) has to be 1:1. To be concrete, from what little I >>>>>>>>>> can >>>>>>>>>> guess about your applications let's assume that the cost of each of >>>>>>>>>> these >>>>>>>>>> two solves is about the same (eg, Laplacians on your domain and the >>>>>>>>>> best >>>>>>>>>> case scenario). But, GPU machines are configured to have roughly >>>>>>>>>> 1-10% of >>>>>>>>>> capacity in the GPUs, these days, that gives you an upper bound of >>>>>>>>>> about >>>>>>>>>> 10% speedup. That is noise. Upshot, unless you configure your >>>>>>>>>> hardware to >>>>>>>>>> match this problem, and the two solves have the same cost, you will >>>>>>>>>> not see >>>>>>>>>> close to 2x speedup. Your time is better spent elsewhere. >>>>>>>>>> >>>>>>>>>> Mark >>>>>>>>>> >>>>>>>>>> On Sat, Aug 1, 2020 at 3:24 PM Jed Brown <j...@jedbrown.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> You can use MPI and split the communicator so n-1 ranks create a >>>>>>>>>>> DMDA for one part of your system and the other rank drives the GPU >>>>>>>>>>> in the >>>>>>>>>>> other part. They can all be part of the same coupled system on the >>>>>>>>>>> full >>>>>>>>>>> communicator, but PETSc doesn't currently support some ranks having >>>>>>>>>>> their >>>>>>>>>>> Vec arrays on GPU and others on host, so you'd be paying host-device >>>>>>>>>>> transfer costs on each iteration (and that might swamp any >>>>>>>>>>> performance >>>>>>>>>>> benefit you would have gotten). >>>>>>>>>>> >>>>>>>>>>> In any case, be sure to think about the execution time of each >>>>>>>>>>> part. Load balancing with matching time-to-solution for each part >>>>>>>>>>> can be >>>>>>>>>>> really hard. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Barry Smith <bsm...@petsc.dev> writes: >>>>>>>>>>> >>>>>>>>>>> > Nicola, >>>>>>>>>>> > >>>>>>>>>>> > This is really viable or practical at this time with >>>>>>>>>>> PETSc. It is not impossible but requires careful coding with >>>>>>>>>>> threads, >>>>>>>>>>> another possibility is to use one half of the virtual GPUs for each >>>>>>>>>>> solve, >>>>>>>>>>> this is also not trivial. I would recommend first seeing what kind >>>>>>>>>>> of >>>>>>>>>>> performance you can get on the GPU for each type of solve and >>>>>>>>>>> revist this >>>>>>>>>>> idea in the future. >>>>>>>>>>> > >>>>>>>>>>> > Barry >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> >> On Jul 31, 2020, at 9:23 AM, nicola varini < >>>>>>>>>>> nicola.var...@gmail.com> wrote: >>>>>>>>>>> >> >>>>>>>>>>> >> Hello, I would like to know if it is possible to overlap CPU >>>>>>>>>>> and GPU with DMDA. >>>>>>>>>>> >> I've a machine where each node has 1P100+1Haswell. >>>>>>>>>>> >> I've to resolve Poisson and Ampere equation for each time >>>>>>>>>>> step. >>>>>>>>>>> >> I'm using 2D DMDA for each of them. Would be possible to >>>>>>>>>>> compute poisson >>>>>>>>>>> >> and ampere equation at the same time? One on CPU and the >>>>>>>>>>> other on GPU? >>>>>>>>>>> >> >>>>>>>>>>> >> Thanks >>>>>>>>>>> >>>>>>>>>> <out_gpu><out_nogpu> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>> <out_miniapp_f_poisson> >>>>> >>>>> >>>>>