Dinar Temirbulatov <dtemirbula...@gmail.com> wrote: >Another interesting use-case for OpenACC and OpenMP is mixing both >standard >annotations for the same loop: > // Compute matrix multiplication. >#pragma omp parallel for default(none) shared(A,B,C,size) >#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \ > pcopyout(C[0:size][0:size]) > > for (int i = 0; i < size; ++i) { > for (int j = 0; j < size; ++j) { > float tmp = 0.; > for (int k = 0; k < size; ++k) { >tmp += A[i][k] * B[k][j]; > } > C[i][j] = tmp; > } > } >This means that OpenACC pragmas should be parsed before OpenMP pass (in >case both standards were enabled), before the OpenMP pass would >change annotated GIMPLE statements irrecoverably. In my view this >use-case >could be handles for example in this way: >We could add some temporary variable for example >"expand_gimple_with_openmp" and change the example above to something >like >this just before the OpenMP pass: > > >if (expand_gimple_with_openmp) { >#pragma omp parallel for default(none) shared(A,B,C,size) >for (int i = 0; i < size; ++i) { > for (int j = 0; j < size; ++j) { > float tmp = 0.; > for (int k = 0; k < size; ++k) { >tmp += A[i][k] * B[k][j]; > } > C[i][j] = tmp; > } > } >else { >#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \ > pcopyout(C[0:size][0:size]) > > for (int i = 0; i < size; ++i) { > for (int j = 0; j < size; ++j) { > float tmp = 0.; > for (int k = 0; k < size; ++k) { >tmp += A[i][k] * B[k][j]; > } > C[i][j] = tmp; > } >} >and later at the Graphite pass we could understand that our statement >is >SCOP and we could produce kernel for this statement and then we could >assume that expand_gimple_with_openmp heuristic is false and the OpenMP >version of the loop could be eliminated or vice versa. But we have to >make >sure that optimization passes would not change our OpenACC gimple that >it >become unparalleled.
No, the point is that we want a middle-end annotation that covers both at the same time. Otherwise factoring in others will quickly get unmanageable. Richard. > thanks, Dinar. > > > >On Fri, May 10, 2013 at 2:06 PM, Tobias Burnus <bur...@net-b.de> wrote: > >> Jakub Jelinek wrote: >> [Fallback generation of CPU code] >> >>> If one uses the OpenMP 4.0 accelerator pragmas, then that is the >required >>> behavior, if the code is for whatever reason not possible to run on >the >>> accelerator, it should be executed on host [...] >>> >> (I haven't checked, but is this a compile time or run-time >requirement?) >> >> >> Otherwise, the OpenMP runtime as well as the pragmas have a way to >choose >>> which accelerator you want to run something on, as device id >(integer), so >>> the OpenMP runtime library should maintain the list of supported >>> accelerators (say if you have two Intel MIC cards, and two AMD GPGPU >>> devices), and probably we'll need a compiler switch to say for which >kinds >>> of accelerators we want to generate code for, plus the runtime could >have >>> dlopened plugins for each of the accelerator kinds. >>> >> >> At least two OpenACC implementations I know fail hard when the GPU is >not >> available (nonexisting or if the /dev/... has not the right >permissions). >> And three of them fail at compile time with an error message if an >> expression within a device section is not possible (e.g. calling some >> nondevice/noninlinable function). >> >> While it is convenient to have CPU fallback, it would be nice to know >> whether some code actually uses the accelerator - both at compile >time and >> at run time. Otherwise, one thinks the the GPU is used - without >realizing >> that it isn't because, e.g. the device permissions are wrong - or one >> forgot to declare a certain function as target function. >> >> Besides having a flag which tells the compiler for which accelerator >the >> code should be generated, also additional flags should be handled, >e.g. for >> different versions of the accelerator. For instance, one accelerator >model >> of the same series might support double-precision variables while >another >> might not. - I assume that falling back to the CPU if the accelerator >> doesn't support a certain feature won't work and one will get an >error in >> this case. >> >> >> Is there actually the need to handle multiple accelerators >simultaneously? >> My impression is that both OpenACC and OpenMP 4 assume that there is >only >> one kind of accelerator available besides the host. If I missed some >fine >> print or something else requires that there are multiple different >> accelerators, it will get more complicated - especially for those >code >> section where the user didn't explicitly specify which one should be >used. >> >> >> Finally, one should think about debugging. It is not really clear (to >me) >> how to handle this best, but as the compiler generates quite some >> additional code (e.g. for copying the data around) and as printf >debugging >> doesn't work on GPUs, it is not that easy. I wonder whether there >should be >> an optional library like libgomp_debug which adds additional sanity >checks >> (e.g. related to copying data to/from the GPU) and which allows to >print >> diagnostic output, when one sets an environment variables. >> >> Tobias >>