Re: [petsc-users] overlap cpu and gpu?

Barry Smith Tue, 04 Aug 2020 08:51:32 -0700

------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flop                             
 --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu - GPU
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---------------------------------------------------------------------------------------------------------------------------------------------------------------


--- Event Stage 0: Main Stage

KSPSolve              48 1.0 9.2492e+00 1.1 8.22e+08 1.2 5.9e+05 3.6e+03 
1.3e+03 17 99 88 97 78  17 99 88 97 79 51484   1792674    446 1.73e+02  957 
3.72e+02 100
KSPGMRESOrthog       306 1.1 2.2495e-01 1.5 3.86e+08 1.2 0.0e+00 0.0e+00 
3.0e+02  0 46  0  0 18   0 46  0  0 18 973865   2562528     94 3.67e+01    0 
0.00e+00 100
PCApply              354 1.1 5.7478e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00 11  0  0  0  0  11  0  0  0  0     0       0      0 0.00e+00  675 
2.62e+02  0

It is GPU %F that is percent of the flops, not of the time. Since hypre does 
not count flops the only flops counted are in PETSc and the hypre ones, that 
take place on the CPU are not counted.

Note that PCApply takes 11 percent of the time, (this is hypre) and KSPSolve 
(which is hypre plus GMRES) takes 17 percent of the time. So 11/17 of the time 
is not on the GPU. Note also the huge number of copies to and from the GPU 
above, this is because data has to be moved to the CPU for hypre and then back 
to the GPU for PETSc.

   Barry


> On Aug 4, 2020, at 5:46 AM, nicola varini <nicola.var...@gmail.com> wrote:
> 
> Thanks for your reply Stefano. I know that HYPRE is not ported on GPU, but 
> the Solver is running on GPU and is taking ~9s and is showing 100% of GPU 
> utilization. 
> 
> Il giorno mar 4 ago 2020 alle ore 12:35 Stefano Zampini 
> <stefano.zamp...@gmail.com <mailto:stefano.zamp...@gmail.com>> ha scritto:
> Nicola,
> 
> You are actually not using the GPU properly, since you use HYPRE 
> preconditioning, which is CPU only. One of your solvers is actually slower on 
> “GPU”.
> For a full AMG GPU, you can use PCGAMG, with cheby smoothers and with Jacobi 
> preconditioning. Mark can help you out with the specific command line options.
> When it works properly, everything related to PC application is offloaded to 
> the GPU, and you should expect to get the well-known and branded 10x (maybe 
> more) speedup one is expecting from GPUs during KSPSolve
> 
> Doing what you want to do is one of the last optimization steps of an already 
> optimized code before entering production. Yours is not even optimized for 
> proper GPU usage  yet.
> Also, any specific reason why you are using dgmres and fgmres?
> 
> PETSc has not been designed with multi-threading in mind. You can achieve 
> “overlap” of the two solves by splitting the communicator. But then you need 
> communications to let the two solutions talk to each other.
> 
> Thanks
> Stefano
> 
> 
>> On Aug 4, 2020, at 12:04 PM, nicola varini <nicola.var...@gmail.com 
>> <mailto:nicola.var...@gmail.com>> wrote:
>> 
>> Dear all, thanks for your replies. The reason why I've asked if it is 
>> possible to overlap poisson and ampere is because they roughly
>> take the same amount of time. Please find in attachment the profiling logs 
>> for only CPU  and only GPU.
>> Of course it is possible to split the MPI communicator and run each solver 
>> on different subcommunicator, however this would involve more communication.
>> Did anyone ever tried to run 2 solvers with hyperthreading? 
>> Thanks
>> 
>> 
>> Il giorno dom 2 ago 2020 alle ore 14:09 Mark Adams <mfad...@lbl.gov 
>> <mailto:mfad...@lbl.gov>> ha scritto:
>> I suspect that the Poisson and Ampere's law solve are not coupled. You might 
>> be able to duplicate the communicator and use two threads. You would want to 
>> configure PETSc with threadsafty and threads and I think it could/should 
>> work, but this mode is never used by anyone.
>> 
>> That said, I would not recommend doing this unless you feel like playing in 
>> computer science, as opposed to doing application science. The best case 
>> scenario you get a speedup of 2x. That is a strict upper bound, but you will 
>> never come close to it. Your hardware has some balance of CPU to GPU 
>> processing rate. Your application has a balance of volume of work for your 
>> two solves. They have to be the same to get close to 2x speedup and that 
>> ratio(s) has to be 1:1. To be concrete, from what little I can guess about 
>> your applications let's assume that the cost of each of these two solves is 
>> about the same (eg, Laplacians on your domain and the best case scenario). 
>> But, GPU machines are configured to have roughly 1-10% of capacity in the 
>> GPUs, these days, that gives you an upper bound of about 10% speedup. That 
>> is noise. Upshot, unless you configure your hardware to match this problem, 
>> and the two solves have the same cost, you will not see close to 2x speedup. 
>> Your time is better spent elsewhere.
>> 
>> Mark
>> 
>> On Sat, Aug 1, 2020 at 3:24 PM Jed Brown <j...@jedbrown.org 
>> <mailto:j...@jedbrown.org>> wrote:
>> You can use MPI and split the communicator so n-1 ranks create a DMDA for 
>> one part of your system and the other rank drives the GPU in the other part. 
>>  They can all be part of the same coupled system on the full communicator, 
>> but PETSc doesn't currently support some ranks having their Vec arrays on 
>> GPU and others on host, so you'd be paying host-device transfer costs on 
>> each iteration (and that might swamp any performance benefit you would have 
>> gotten).
>> 
>> In any case, be sure to think about the execution time of each part.  Load 
>> balancing with matching time-to-solution for each part can be really hard.
>> 
>> 
>> Barry Smith <bsm...@petsc.dev <mailto:bsm...@petsc.dev>> writes:
>> 
>> >   Nicola,
>> >
>> >     This is really viable or practical at this time with PETSc. It is not 
>> > impossible but requires careful coding with threads, another possibility 
>> > is to use one half of the virtual GPUs for each solve, this is also not 
>> > trivial. I would recommend first seeing what kind of performance you can 
>> > get on the GPU for each type of solve and revist this idea in the future.
>> >
>> >    Barry
>> >
>> >
>> >
>> >
>> >> On Jul 31, 2020, at 9:23 AM, nicola varini <nicola.var...@gmail.com 
>> >> <mailto:nicola.var...@gmail.com>> wrote:
>> >> 
>> >> Hello, I would like to know if it is possible to overlap CPU and GPU with 
>> >> DMDA.
>> >> I've a machine where each node has 1P100+1Haswell.
>> >> I've to resolve Poisson and Ampere equation for each time step.
>> >> I'm using 2D DMDA for each of them. Would be possible to compute poisson 
>> >> and ampere equation at the same time? One on CPU and the other on GPU?
>> >> 
>> >> Thanks
>> <out_gpu><out_nogpu>
>

Re: [petsc-users] overlap cpu and gpu?

Reply via email to