On 11/06/2015 11:12 AM, Bernd Schmidt wrote: > On 11/05/2015 10:51 PM, Martin Jambor wrote: >> Individual changes are described in slightly more detail in their >> respective messages. If you are interested in how the HSAIL >> generation works in general, I encourage you to have a look at my >> Cauldron slides or presentation, only very few things have changed as >> far as the general principles are concerned. Let me just quickly stress >> here that we do acceleration within a single compiler, as opposed to >> LTO-ways of all the other accelerator teams. > > Realistically we're probably not going to reject this work, but I still want > to ask whether the approach was acked by the community before you started. > I'm really not exactly thrilled about having two different classes of > backends in the compiler, and two different ways of handling offloading. > >> I also acknowledge that we should add HSA-specific tests to the GCC >> testsuite but we are only now looking at how to do that and will >> welcome any guidance in this regard. > > Yeah, I was looking for any kind of new test, because... > >> the class of OpenMP loops we can handle well is small, > > I'd appreciate more information on what this means. Any examples or > performance numbers?
Hello. As mentioned by Martin Jambor, it was explained during his speech at the Cauldron this year. It can be easily explained on the following simple case: #pragma omp target teams #pragma omp distribute parallel for private(j) for (j=0; j<N; j++) c[j] = a[j]; Which is simple vector copy, that's going to be transformed to: _4 = omp_data.i_1(D).D.5301 (iteration space) _5 = __builtin_omp_get_num_threads (); _6 = __builtin_omp_get_thread_num (); _7 = calculate_chunk_start (_4, _5, _6); // pseudocode _8 = calculate_chunk_end (_4, _5, _6); // pseudocode for(i = _7; i < _8; i++) dest[i] = src[i]; and such kernel is dispatched with default grid size (in our case 64), so that every work item handles chunk of size N/64. On the other hand, gridification is doing to transform to: _7 = __builtin_omp_get_thread_num (); dest[_7] = src[_7]; and the kernels is offloaded like this: HSA debug: GOMP_OFFLOAD_run called with grid size 10000000 and group size 0 Performance numbers are in order of magnitude and can be seen on slides 27-30 in [1] Martin [1] https://gcc.gnu.org/wiki/cauldron2015?action=AttachFile&do=get&target=mjambor-hsa-slides.pdf > > > Bernd