Hi Jakub,

On 30.06.22 10:21, Jakub Jelinek wrote:
So, what is the plan with reverse offload?

My idea was to just call omp_target_ext with
'device(omp_initial_device)'. This then automatically
works when called from a target region that runs on
omp_get_initial_device().

For the actual device part, this can be implemented
incrementally by supporting the reverse_offload for
a given device type.

For getting it to work when the code enclosing the ancestor:1
target region runs on an offloading device,
my idea is the following. Comments are welcome!


My idea was to do the same as done for I/O
(which supported for both nvptx and gcn). For GCN:

libgomp/plugin/plugin-gcn.c has:

struct kernargs {
  /* A pointer to struct output, below, for console output data.  */
  int64_t out_ptr;

  /* A pointer to struct heap, below.  */
  int64_t heap_ptr;

  /* A pointer to an ephemeral memory arena.
    Only needed for OpenMP.  */
  int64_t arena_ptr;

/* to be added: */
  /* A pointer to reverse-offload. */
  int64_t rev_ptr;

/* Now come the actual structs.*/
  /* Output data.  */
  struct output {
    int return_value;
    unsigned int next_output;
    struct printf_data {
...
};


This gets initialized on the host and then:

  while (hsa_fns.hsa_signal_wait_acquire_fn (s, HSA_SIGNAL_CONDITION_LT, 1,
                                             1000 * 1000,
                                             HSA_WAIT_STATE_BLOCKED) != 0)
    console_output (kernel, shadow->kernarg_address, false);

with:

  unsigned int from = __atomic_load_n (&kernargs->output_data.consumed,
                                       __ATOMIC_ACQUIRE);

The I/O itself is implemented in newlib,
https://sourceware.org/git/?p=newlib-cygwin.git;a=blob;f=newlib/libc/sys/amdgcn/write.c

  register void **kernargs asm("s8");
  struct output *data = (struct output *)kernargs[2];

and then the data is filled.


For reverse offload, the idea is fill it on the device side via
/libgomp/config/gcn/target.c's GOMP_target_ext for
device == GOMP_DEVICE_HOST_FALLBACK && fn != NULL as:

Try to obtain a lock (busy wait)
Put addr/kinds/sizes into the struct
Put the device's fn pointer in the struct
busy wait for completion ('while (fn != NULL) { }')
unlock


And on the host side:
If fn == NULL (= data there) - return output/offload checking loop
Otherwise:
call a new function in target.c and pass args to it.
Once it completed, set fn = NULL to indicate it has been processed.

And in target.c's new reverse-offload-handling function:
- find generated-target function on the host,
  based on device stub function's pointer address
- Handle the mapping
- Call host function
- Handle the mapping
- return

Additionally:

If 'requires reverse_offload' is set, fill not only
the normal splay_tree for "host -> device" lookup but
also another one for the "device -> host" lookups.

Does this make sense?

Tobias

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955

Reply via email to