You might take a look at https://publications.anl.gov/anlpubs/2020/04/159190.pdf <https://publications.anl.gov/anlpubs/2020/04/159190.pdf> in the introduction there is a short discussion about some of the "gotchas" when using multi-core CPUs connected to multiple GPUs; it is focused on an IBM Power/Nvidia GPU system but the same abstract issues will arise on any similar system.
1) How many cores should share a GPU? Generally 1 but there may be exceptions 2) Special memory utilization that can be copied to/from GPUs faster, it should be turned on? 3) Is it worth doing anything with the "extra" cores that are not accessing a GPU? Probably not but there may be exceptions. 4) How to communicate between nodes with MPI, can one go directly from GPU to GPU and skip the CPU memory? Barry > On Jun 9, 2020, at 7:51 PM, Junchao Zhang <junchao.zh...@gmail.com> wrote: > > > > On Tue, Jun 9, 2020 at 7:11 PM GIBB Gordon <g.g...@epcc.ed.ac.uk > <mailto:g.g...@epcc.ed.ac.uk>> wrote: > Hi, > > First of all, my apologies if this is not the appropriate list to send these > questions to. > > I’m one of the developers of TPLS (https://sourceforge.net/projects/tpls/ > <https://sourceforge.net/projects/tpls/>), a Fortran code that uses PETSc, > parallelised using DM vectors. It uses a mix of our own solvers, and PETSc’s > Krylov solvers. At present it has been run on up to 25,000 MPI processes, > although larger problem sizes should be able to scale beyond that. > > With the awareness that more and more HPC machines now have one or more GPUs > per node, and that upcoming machines that approach/achieve Exascale will be > heterogeneous in nature, we are investigating whether it is worth using GPUs > with TPLS, and if so, how to best do this. > > I see that in principle all we’d need to do to is set some flags as described > at https://www.mcs.anl.gov/petsc/features/gpus.html > <https://www.mcs.anl.gov/petsc/features/gpus.html> to offload work onto the > GPU, however I have some questions about doing this in practice: > > The GPU machine I have access to has nodes with two 20 core CPUs and 4 NVIDIA > GPUs (so 10 cores per GPU). We could use CUDA or OpenCL, and may well explore > both of them. With TPLS being an MPI application, we would wish to use many > processes (and nodes), not just a single process. How would we best split > this problem up? > > Would we have 1 MPI process per GPU (so 4 per node), and then implement our > own solvers either to also work on the GPU, or use OpenMP to make use of the > 10 cores per GPU? If so, how would we specify to PETSc which GPU each process > would use? > > Would we instead just have 40 (or perhaps slightly fewer) MPI processes all > sharing the GPUs? Surely this would be inefficient, and would PETSc > distribute the work across all 4 GPUs, or would every process end out using a > single GPU? > See > https://docs.olcf.ornl.gov/systems/summit_user_guide.html#volta-multi-process-service > > <https://docs.olcf.ornl.gov/systems/summit_user_guide.html#volta-multi-process-service>. > In some cases, we did see better performance with multiple mpi ranks/GPU > than 1 rank/GPU. The optimal configuration depends on the code. Think two > extremes: One code with work done all on GPU and the other all on CPU. > Probably you only need 1 mpi rank/node for the former, but full ranks for the > latter. > > > Would the Krylov solvers be blocking whilst the GPUs are in use running the > solvers, or would the host code be able to continue and carry out other > calculations whilst waiting for the GPU code to finish? We may need to modify > our algorithm to allow for this, but it would make sense to introduce some > concurrency so that the CPUs aren’t idling whilst waiting for the GPUs to > complete their work. > We use asynchronous kernel launch and split-phase communication > (VecScatterBegin/End). As long as there is no dependency, you can overlap > computations on CPU and GPU, or computations with communications. > > > Finally, I’m trying to get the OpenCL PETSc to work on my laptop (Macbook Pro > with discrete AMD Radeon R9 M370X GPU). This is mostly because our GPU > cluster is out of action until at least late June and I want to get a head > start on experimenting with GPUs and TPLS. When I try to run TPLS with the > ViennaCL PETSc it reports that my GPU is unable to support double precision. > I confirmed that my discrete GPU does support this, however my integrated GPU > (Intel Iris) does not. I suspect that ViennaCL is using my integrated GPU > instead of my discrete one (it is listed as GPU 0 by OpenCL, with the AMD > card is GPU 1). Is there any way of getting PETSc to report which OpenCL > device is in use, or to select which device to use? I saw there was some > discussion about this on the mailing list archives but I couldn’t find any > conclusion. > No experience. Karl Rupp (cc'ed) might know. > > > Thanks in advance for your help, > > Regards, > > Gordon > > ----------------------------------------------- > Dr Gordon P S Gibb > EPCC, The University of Edinburgh > Tel: +44 131 651 3459 > > The University of Edinburgh is a charitable body, registered in Scotland, > with registration number SC005336.