Hi,

First of all, my apologies if this is not the appropriate list to send these 
questions to.

I’m one of the developers of TPLS (https://sourceforge.net/projects/tpls/), a 
Fortran code that uses PETSc, parallelised using DM vectors. It uses a mix of 
our own solvers, and PETSc’s Krylov solvers. At present it has been run on up 
to 25,000 MPI processes, although larger problem sizes should be able to scale 
beyond that.

With the awareness that more and more HPC machines now have one or more GPUs 
per node, and that upcoming machines that approach/achieve Exascale will be 
heterogeneous in nature, we are investigating whether it is worth using GPUs 
with TPLS, and if so, how to best do this.

I see that in principle all we’d need to do to is set some flags as described 
at https://www.mcs.anl.gov/petsc/features/gpus.html to offload work onto the 
GPU, however I have some questions about doing this in practice:

The GPU machine I have access to has nodes with two 20 core CPUs and 4 NVIDIA 
GPUs (so 10 cores per GPU). We could use CUDA or OpenCL, and may well explore 
both of them. With TPLS being an MPI application, we would wish to use many 
processes (and nodes), not just a single process. How would we best split this 
problem up?

Would we have 1 MPI process per GPU (so 4 per node), and then implement our own 
solvers either to also work on the GPU, or use OpenMP to make use of the 10 
cores per GPU? If so, how would we specify to PETSc which GPU each process 
would use?

Would we instead just have 40 (or perhaps slightly fewer) MPI processes all 
sharing the GPUs? Surely this would be inefficient, and would PETSc distribute 
the work across all 4 GPUs, or would every process end out using a single GPU?

Would the Krylov solvers be blocking whilst the GPUs are in use running the 
solvers, or would the host code be able to continue and carry out other 
calculations whilst waiting for the GPU code to finish? We may need to modify 
our algorithm to allow for this, but it would make sense to introduce some 
concurrency so that the CPUs aren’t idling whilst waiting for the GPUs to 
complete their work.

Finally, I’m trying to get the OpenCL PETSc to work on my laptop (Macbook Pro 
with discrete AMD Radeon R9 M370X GPU). This is mostly because our GPU cluster 
is out of action until at least late June and I want to get a head start on 
experimenting with GPUs and TPLS. When I try to run TPLS with the ViennaCL 
PETSc it reports that my GPU is unable to support double precision. I confirmed 
that my discrete GPU does support this, however my integrated GPU (Intel Iris) 
does not. I suspect that ViennaCL is using my integrated GPU instead of my 
discrete one (it is listed as GPU 0 by OpenCL, with the AMD card is GPU 1). Is 
there any way of getting PETSc to report which OpenCL device is in use, or to 
select which device to use? I saw there was some discussion about this on the 
mailing list archives but I couldn’t find any conclusion.

Thanks in advance for your help,

Regards,

Gordon

-----------------------------------------------
Dr Gordon P S Gibb
EPCC, The University of Edinburgh
Tel: +44 131 651 3459

The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336.

Reply via email to