https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121178

--- Comment #1 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
and something else:

Assume I want to create a struct with a temporary matrix only on gpu, with
known extents, known strides, and just allocated data, that is not on the host.

The mapping macros require that i have the data allocated on the host, but
maybe just an acc_malloc for some member field will do?


That was suggested recently by tobias burnus and is also suggested by matt
colgrove from nvidia...

Here is what nvc++ says on the following snippet:


#include <openacc.h>

struct mytensor
{
    int *strides;
    int *extents;
    double *data;
};

int main()
{

    mytensor t2;
    int strides[2]={1,5};
    int extents[2]={4,5};
    t2.data=(double*)acc_malloc(sizeof(double)*20);

    t2.strides=strides;
    t2.extents=extents;

#pragma acc data copyin(t2)
#pragma acc data copyin(t2.strides[0:2])
#pragma acc data copyin(t2.extents[0:2])

#pragma acc parallel loop gang present(t2)
    for(int i=1; i<20; i++)
    {
        t2.data[i]=20;
    }

    acc_free(t2.data);
    #pragma acc exit data delete(t2.extents[0:2])
    #pragma acc exit data delete(t2.strides[0:2])
    #pragma acc exit data delete(t2)
}


If I remove the sizes in the exit data statements, gcc will compile a working
executable..

In the code, I ordered 5 copies to device and 3 frees...

I the cuda call stack from gcc I see 8 synchronous copies memcpyHtoD... and two
after launch kernel....




If I leave the sizes, in the exit data statements and compile with nvc++, i
get:




main:
     41, Generating copyin(t2.extents[:2],t2.strides[:2],t2) [if not already
present]
         Generating present(t2)
         Generating NVIDIA GPU code
         49, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
     49, Recognized memory set idiom
     58, Generating exit data delete(t2,t2.strides[:2],t2.extents[:2])


that line "recognized memory set idiom, is not there when i use mappings for
every member field.... 

And it may mean that apparently that nvc++ assumes that there is some memory
amiss..

No. i have allocated the t2.data member field manually and freed it manually,
and the other member fields were correctly mapped.. there should nothing be
amiss... 

Interestingly, nvc++ just makes 7 cumemcpy async calls, gcc makes 8 ordinary
non async copies...


One can convert this snipped to OpenMP (in openmp, the delete statement
compiles in gcc when adding the sizes)


#include <omp.h>
struct mytensor
{
    int *strides;
    int *extents;
    double *data;
};

int main()
{
    mytensor t;
    int strides[2]={1,2};
    int extents[2]={4,5};
   
t.data=(double*)omp_target_alloc(sizeof(double)*20,omp_get_default_device());

    t.strides=strides;
    t.extents=extents;
    #pragma omp target enter data map(to:t)
    #pragma omp target enter data map(to:t.strides[0:2])
    #pragma omp target enter data map(to:t.extents[0:2])
    #pragma omp target enter data map(to:t.data[0:20])

    #pragma omp target teams distribute
    for(int i=1; i<20; i++)
    {
        t.data[i]=20;
    }


   omp_target_free(t.data,omp_get_default_device());

    #pragma omp target exit data map (delete:t.strides[0:2])
    #pragma omp target exit data map (delete:t.extents[0:2])
    #pragma omp target exit data map(delete:t)

}


Unfortunately, then, with gcc's code, I see memory copies before allocations in
the cuda call trace from nvidia insight.... 


I think this is strange. 


In my opinion, it should always copy after! the memory was allocated...

If that means it copies the struct first to the device (before the data array
was set to a host ptr), and then sets the pointer (at worst on the host struct)
to the memory allocated by target alloc, then the loop on the device would
access an uninitialized array pointer)....


Also, it uses synchronous copies but the struct members could be copied both at
the same time...


Unfortunately, I have seen erroneous results from similar problems in a larger
real world library that I am currently writing...

I.e. that gcc does copy something to device and then, by wrong code generation,
 the loop does not recognize that the field is on the device, and just accesses
the member field pointing to host data. Afterwards, the code issued a copy down
to the device data, from the field it had copied before the loop, overwriting a
computation erroneously done with host pointers, yielding very different
results with different compilers in a simple matrix multiplication...

This code also worked with structs. It maybe that this was due to similar
problems I see now with this cuda call stack of this smaller testing case?

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120814

Reply via email to