On Tue, Jun 9, 2020 at 7:11 PM GIBB Gordon <g.g...@epcc.ed.ac.uk> wrote:
> Hi, > > First of all, my apologies if this is not the appropriate list to send > these questions to. > > I’m one of the developers of TPLS (https://sourceforge.net/projects/tpls/), > a Fortran code that uses PETSc, parallelised using DM vectors. It uses a > mix of our own solvers, and PETSc’s Krylov solvers. At present it has been > run on up to 25,000 MPI processes, although larger problem sizes should be > able to scale beyond that. > > With the awareness that more and more HPC machines now have one or more > GPUs per node, and that upcoming machines that approach/achieve Exascale > will be heterogeneous in nature, we are investigating whether it is worth > using GPUs with TPLS, and if so, how to best do this. > > I see that in principle all we’d need to do to is set some flags as > described at https://www.mcs.anl.gov/petsc/features/gpus.html to offload > work onto the GPU, however I have some questions about doing this in > practice: > > The GPU machine I have access to has nodes with two 20 core CPUs and 4 > NVIDIA GPUs (so 10 cores per GPU). We could use CUDA or OpenCL, and may > well explore both of them. With TPLS being an MPI application, we would > wish to use many processes (and nodes), not just a single process. How > would we best split this problem up? > > Would we have 1 MPI process per GPU (so 4 per node), and then implement > our own solvers either to also work on the GPU, or use OpenMP to make use > of the 10 cores per GPU? If so, how would we specify to PETSc which GPU > each process would use? > > Would we instead just have 40 (or perhaps slightly fewer) MPI processes > all sharing the GPUs? Surely this would be inefficient, and would PETSc > distribute the work across all 4 GPUs, or would every process end out using > a single GPU? > See https://docs.olcf.ornl.gov/systems/summit_user_guide.html#volta-multi-process-service. In some cases, we did see better performance with multiple mpi ranks/GPU than 1 rank/GPU. The optimal configuration depends on the code. Think two extremes: One code with work done all on GPU and the other all on CPU. Probably you only need 1 mpi rank/node for the former, but full ranks for the latter. > > Would the Krylov solvers be blocking whilst the GPUs are in use running > the solvers, or would the host code be able to continue and carry out other > calculations whilst waiting for the GPU code to finish? We may need to > modify our algorithm to allow for this, but it would make sense to > introduce some concurrency so that the CPUs aren’t idling whilst waiting > for the GPUs to complete their work. > We use asynchronous kernel launch and split-phase communication (VecScatterBegin/End). As long as there is no dependency, you can overlap computations on CPU and GPU, or computations with communications. > > Finally, I’m trying to get the OpenCL PETSc to work on my laptop (Macbook > Pro with discrete AMD Radeon R9 M370X GPU). This is mostly because our GPU > cluster is out of action until at least late June and I want to get a head > start on experimenting with GPUs and TPLS. When I try to run TPLS with the > ViennaCL PETSc it reports that my GPU is unable to support double > precision. I confirmed that my discrete GPU does support this, however my > integrated GPU (Intel Iris) does not. I suspect that ViennaCL is using my > integrated GPU instead of my discrete one (it is listed as GPU 0 by OpenCL, > with the AMD card is GPU 1). Is there any way of getting PETSc to report > which OpenCL device is in use, or to select which device to use? I saw > there was some discussion about this on the mailing list archives but I > couldn’t find any conclusion. > No experience. Karl Rupp (cc'ed) might know. > > Thanks in advance for your help, > > Regards, > > Gordon > > ----------------------------------------------- > Dr Gordon P S Gibb > EPCC, The University of Edinburgh > Tel: +44 131 651 3459 > > The University of Edinburgh is a charitable body, registered in Scotland, > with registration number SC005336. >