On 11/06/2015 11:12 AM, Bernd Schmidt wrote:
> On 11/05/2015 10:51 PM, Martin Jambor wrote:
>> Individual changes are described in slightly more detail in their
>> respective messages.  If you are interested in how the HSAIL
>> generation works in general, I encourage you to have a look at my
>> Cauldron slides or presentation, only very few things have changed as
>> far as the general principles are concerned.  Let me just quickly stress
>> here that we do acceleration within a single compiler, as opposed to
>> LTO-ways of all the other accelerator teams.
> 
> Realistically we're probably not going to reject this work, but I still want 
> to ask whether the approach was acked by the community before you started. 
> I'm really not exactly thrilled about having two different classes of 
> backends in the compiler, and two different ways of handling offloading.
> 
>> I also acknowledge that we should add HSA-specific tests to the GCC
>> testsuite but we are only now looking at how to do that and will
>> welcome any guidance in this regard.
> 
> Yeah, I was looking for any kind of new test, because...
> 
>> the class of OpenMP loops we can handle well is small,
> 
> I'd appreciate more information on what this means. Any examples or 
> performance numbers?

Hello.

As mentioned by Martin Jambor, it was explained during his speech at the 
Cauldron this year.
It can be easily explained on the following simple case:

#pragma omp target teams
#pragma omp distribute parallel for private(j)
   for (j=0; j<N; j++)
      c[j] = a[j];

Which is simple vector copy, that's going to be transformed to:

_4 = omp_data.i_1(D).D.5301 (iteration space)
_5 = __builtin_omp_get_num_threads ();
_6 = __builtin_omp_get_thread_num ();
_7 = calculate_chunk_start (_4, _5, _6); // pseudocode
_8 = calculate_chunk_end (_4, _5, _6); // pseudocode

for(i = _7; i < _8; i++)
  dest[i] = src[i];

and such kernel is dispatched with default grid size (in our case 64), so that 
every
work item handles chunk of size N/64.

On the other hand, gridification is doing to transform to:

_7 = __builtin_omp_get_thread_num ();
dest[_7] = src[_7];

and the kernels is offloaded like this:
HSA debug: GOMP_OFFLOAD_run called with grid size 10000000 and group size 0

Performance numbers are in order of magnitude and can be seen on slides 27-30 
in [1]

Martin

[1] 
https://gcc.gnu.org/wiki/cauldron2015?action=AttachFile&do=get&target=mjambor-hsa-slides.pdf

> 
> 
> Bernd

Reply via email to