You might take a look at 
https://publications.anl.gov/anlpubs/2020/04/159190.pdf 
<https://publications.anl.gov/anlpubs/2020/04/159190.pdf>  in the introduction 
there is a short discussion about some of the 
"gotchas" when using multi-core CPUs connected to multiple GPUs; it is focused 
on an IBM Power/Nvidia GPU system but the same abstract issues
will arise on any similar system.

1) How many cores should share a GPU?   Generally 1 but there may be exceptions

2) Special memory utilization that can be copied to/from GPUs faster, it should 
be turned on?

3) Is it worth doing anything with the "extra" cores that are not accessing a 
GPU?   Probably not but there may be exceptions.

4) How to communicate between nodes with MPI, can one go directly from GPU to 
GPU and skip the CPU memory?

  Barry




> On Jun 9, 2020, at 7:51 PM, Junchao Zhang <junchao.zh...@gmail.com> wrote:
> 
> 
> 
> On Tue, Jun 9, 2020 at 7:11 PM GIBB Gordon <g.g...@epcc.ed.ac.uk 
> <mailto:g.g...@epcc.ed.ac.uk>> wrote:
> Hi,
> 
> First of all, my apologies if this is not the appropriate list to send these 
> questions to.
> 
> I’m one of the developers of TPLS (https://sourceforge.net/projects/tpls/ 
> <https://sourceforge.net/projects/tpls/>), a Fortran code that uses PETSc, 
> parallelised using DM vectors. It uses a mix of our own solvers, and PETSc’s 
> Krylov solvers. At present it has been run on up to 25,000 MPI processes, 
> although larger problem sizes should be able to scale beyond that.
> 
> With the awareness that more and more HPC machines now have one or more GPUs 
> per node, and that upcoming machines that approach/achieve Exascale will be 
> heterogeneous in nature, we are investigating whether it is worth using GPUs 
> with TPLS, and if so, how to best do this.
> 
> I see that in principle all we’d need to do to is set some flags as described 
> at https://www.mcs.anl.gov/petsc/features/gpus.html 
> <https://www.mcs.anl.gov/petsc/features/gpus.html> to offload work onto the 
> GPU, however I have some questions about doing this in practice:
> 
> The GPU machine I have access to has nodes with two 20 core CPUs and 4 NVIDIA 
> GPUs (so 10 cores per GPU). We could use CUDA or OpenCL, and may well explore 
> both of them. With TPLS being an MPI application, we would wish to use many 
> processes (and nodes), not just a single process. How would we best split 
> this problem up? 
> 
> Would we have 1 MPI process per GPU (so 4 per node), and then implement our 
> own solvers either to also work on the GPU, or use OpenMP to make use of the 
> 10 cores per GPU? If so, how would we specify to PETSc which GPU each process 
> would use? 
> 
> Would we instead just have 40 (or perhaps slightly fewer) MPI processes all 
> sharing the GPUs? Surely this would be inefficient, and would PETSc 
> distribute the work across all 4 GPUs, or would every process end out using a 
> single GPU?
> See 
> https://docs.olcf.ornl.gov/systems/summit_user_guide.html#volta-multi-process-service
>  
> <https://docs.olcf.ornl.gov/systems/summit_user_guide.html#volta-multi-process-service>.
>   In some cases, we did see better performance with multiple mpi ranks/GPU 
> than 1 rank/GPU. The optimal configuration depends on the code. Think two 
> extremes:  One code with work done all on GPU and the other all on CPU. 
> Probably you only need 1 mpi rank/node for the former, but full ranks for the 
> latter. 
>  
> 
> Would the Krylov solvers be blocking whilst the GPUs are in use running the 
> solvers, or would the host code be able to continue and carry out other 
> calculations whilst waiting for the GPU code to finish? We may need to modify 
> our algorithm to allow for this, but it would make sense to introduce some 
> concurrency so that the CPUs aren’t idling whilst waiting for the GPUs to 
> complete their work.
> We use asynchronous kernel launch and split-phase communication 
> (VecScatterBegin/End). As long as there is no dependency, you can overlap 
> computations on CPU and GPU, or computations with communications. 
>  
> 
> Finally, I’m trying to get the OpenCL PETSc to work on my laptop (Macbook Pro 
> with discrete AMD Radeon R9 M370X GPU). This is mostly because our GPU 
> cluster is out of action until at least late June and I want to get a head 
> start on experimenting with GPUs and TPLS. When I try to run TPLS with the 
> ViennaCL PETSc it reports that my GPU is unable to support double precision. 
> I confirmed that my discrete GPU does support this, however my integrated GPU 
> (Intel Iris) does not. I suspect that ViennaCL is using my integrated GPU 
> instead of my discrete one (it is listed as GPU 0 by OpenCL, with the AMD 
> card is GPU 1). Is there any way of getting PETSc to report which OpenCL 
> device is in use, or to select which device to use? I saw there was some 
> discussion about this on the mailing list archives but I couldn’t find any 
> conclusion.
> No experience. Karl Rupp (cc'ed) might know.
>  
> 
> Thanks in advance for your help,
> 
> Regards,
> 
> Gordon
> 
> -----------------------------------------------
> Dr Gordon P S Gibb
> EPCC, The University of Edinburgh
> Tel: +44 131 651 3459
> 
> The University of Edinburgh is a charitable body, registered in Scotland, 
> with registration number SC005336.

Reply via email to