Thanks for your reply Stefano. I know that HYPRE is not ported on GPU, but the Solver is running on GPU and is taking ~9s and is showing 100% of GPU utilization.
Il giorno mar 4 ago 2020 alle ore 12:35 Stefano Zampini < stefano.zamp...@gmail.com> ha scritto: > Nicola, > > You are actually not using the GPU properly, since you use HYPRE > preconditioning, which is CPU only. One of your solvers is actually slower > on “GPU”. > For a full AMG GPU, you can use PCGAMG, with cheby smoothers and with > Jacobi preconditioning. Mark can help you out with the specific command > line options. > When it works properly, everything related to PC application is offloaded > to the GPU, and you should expect to get the well-known and branded 10x > (maybe more) speedup one is expecting from GPUs during KSPSolve > > Doing what you want to do is one of the last optimization steps of an > already optimized code before entering production. Yours is not even > optimized for proper GPU usage yet. > Also, any specific reason why you are using dgmres and fgmres? > > PETSc has not been designed with multi-threading in mind. You can achieve > “overlap” of the two solves by splitting the communicator. But then you > need communications to let the two solutions talk to each other. > > Thanks > Stefano > > > On Aug 4, 2020, at 12:04 PM, nicola varini <nicola.var...@gmail.com> > wrote: > > Dear all, thanks for your replies. The reason why I've asked if it is > possible to overlap poisson and ampere is because they roughly > take the same amount of time. Please find in attachment the profiling logs > for only CPU and only GPU. > Of course it is possible to split the MPI communicator and run each solver > on different subcommunicator, however this would involve more communication. > Did anyone ever tried to run 2 solvers with hyperthreading? > Thanks > > > Il giorno dom 2 ago 2020 alle ore 14:09 Mark Adams <mfad...@lbl.gov> ha > scritto: > >> I suspect that the Poisson and Ampere's law solve are not coupled. You >> might be able to duplicate the communicator and use two threads. You would >> want to configure PETSc with threadsafty and threads and I think it >> could/should work, but this mode is never used by anyone. >> >> That said, I would not recommend doing this unless you feel like playing >> in computer science, as opposed to doing application science. The best case >> scenario you get a speedup of 2x. That is a strict upper bound, but you >> will never come close to it. Your hardware has some balance of CPU to GPU >> processing rate. Your application has a balance of volume of work for your >> two solves. They have to be the same to get close to 2x speedup and that >> ratio(s) has to be 1:1. To be concrete, from what little I can guess about >> your applications let's assume that the cost of each of these two solves is >> about the same (eg, Laplacians on your domain and the best case scenario). >> But, GPU machines are configured to have roughly 1-10% of capacity in the >> GPUs, these days, that gives you an upper bound of about 10% speedup. That >> is noise. Upshot, unless you configure your hardware to match this problem, >> and the two solves have the same cost, you will not see close to 2x >> speedup. Your time is better spent elsewhere. >> >> Mark >> >> On Sat, Aug 1, 2020 at 3:24 PM Jed Brown <j...@jedbrown.org> wrote: >> >>> You can use MPI and split the communicator so n-1 ranks create a DMDA >>> for one part of your system and the other rank drives the GPU in the other >>> part. They can all be part of the same coupled system on the full >>> communicator, but PETSc doesn't currently support some ranks having their >>> Vec arrays on GPU and others on host, so you'd be paying host-device >>> transfer costs on each iteration (and that might swamp any performance >>> benefit you would have gotten). >>> >>> In any case, be sure to think about the execution time of each part. >>> Load balancing with matching time-to-solution for each part can be really >>> hard. >>> >>> >>> Barry Smith <bsm...@petsc.dev> writes: >>> >>> > Nicola, >>> > >>> > This is really viable or practical at this time with PETSc. It is >>> not impossible but requires careful coding with threads, another >>> possibility is to use one half of the virtual GPUs for each solve, this is >>> also not trivial. I would recommend first seeing what kind of performance >>> you can get on the GPU for each type of solve and revist this idea in the >>> future. >>> > >>> > Barry >>> > >>> > >>> > >>> > >>> >> On Jul 31, 2020, at 9:23 AM, nicola varini <nicola.var...@gmail.com> >>> wrote: >>> >> >>> >> Hello, I would like to know if it is possible to overlap CPU and GPU >>> with DMDA. >>> >> I've a machine where each node has 1P100+1Haswell. >>> >> I've to resolve Poisson and Ampere equation for each time step. >>> >> I'm using 2D DMDA for each of them. Would be possible to compute >>> poisson >>> >> and ampere equation at the same time? One on CPU and the other on GPU? >>> >> >>> >> Thanks >>> >> <out_gpu><out_nogpu> > > >