Hi Jakub!

Thanks for the review.

On Tue, 20 Oct 2015 12:02:45 +0200, Jakub Jelinek <ja...@redhat.com> wrote:
> On Mon, Oct 19, 2015 at 06:44:40PM +0200, Thomas Schwinge wrote:
> > > How's the following (complete patch instead of incremental patch; the
> > > driver changes are still the same as before)?  The changes are:
> > > 
> > >   * libgomp/target.c:gomp_target_init again loads all the plugins.
> > >   * libgomp/target.c:resolve_device and
> > >     libgomp/oacc-init.c:resolve_device verify that a default device
> > >     (OpenMP device-var ICV, and acc_device_default, respectively) is
> > >     actually enabled, or resort to host fallback if not.
> > >   * GOMP_set_offload_targets renamed to GOMP_enable_offload_targets; used
> > >     to enable devices specified by -foffload.  Can be called multiple
> > >     times (executable, any shared libraries); the set of enabled devices
> > >     is the union of all those ever requested.
> > >   * GOMP_offload_register (but not the new GOMP_offload_register_ver)
> > >     changed to enable all devices.  This is to maintain compatibility
> > >     with old executables and shared libraries built without the -foffload
> > >     constructor support.
> 
> Any reason not to pass the bitmask of the enabled targets to
> GOMP_offload_register_ver instead, to decrease the amount of ctors and
> the times you lock the various locks during initialization, or just enable
> automatically the devices you load data for during GOMP_offload_register_ver?
> I mean, GOMP_offload_register would enable for compatibility all devices,
> GOMP_offload_register_ver would enable the device it is registered for.
> For -foffload=disable on all shared libraries/binaries, naturally you would
> not register anything, thus would not enable any devices (only host fallback
> would work).

As explained a few times already: GOMP_offload_register_ver constructors
will only be generated if there actually are offloaded code regions, but
for example:

    #include <openacc.h>
    int main()
    {
      __builtin_printf("%d\n", acc_get_num_devices(acc_device_nvidia));
      return 0;
    }

... is a valid OpenACC program (untested), which doesn't contain any
offloaded code regions.  As a user I'd expect it to return different
answers if compiled with -foffload=nvptx-none in contrast to
-foffload=disable.  Actually, I can foresee exactly such code to be used
to probe for offloading being available, for example in testsuites.  And,
I guess we agree that under -foffload=disable we'd like the
compilation/runtime system to be configured in a way that no offloading
will happen?

Always creating (dummy) GOMP_offload_register_ver constructors has been
another suggestion that I had voiced much earlier in this thread (months
ago), but everyone (including me) taking part in the discussion agreed
that it'd cause even higher compile-time overhead.

> Or are you worried about the case where one shared library is compiled
> with say -foffload=intelmic,ptx but doesn't actually contain any
> #pragma omp target/#pragma omp declare target (or OpenACC similar
> #directives), but only contains #pragma omp target data and/or the device
> query/copying routines, then dlopens some other shared library that actually
> has the offloading device code?

That's another example, yes.

> That could be solved by adding the call you are talking about, but
> if we really should care about that unlikely case, it would be better to
> only arrange for it if really needed by the shared library (i.e. if it calls
> one of the OpenMP or OpenACC library routines that talk to the devices, or
> has #pragma omp target data or similar constructs;
> I'd strongly prefer not to have constructors in code that just got compiled
> with -fopenmp, even in configuration where some offloading is configured by
> default, when nothing in the code really cares about offloading.

So, how to resolve our different opinions?  I mean, for any serious
program code, there will be constructor calls into libgomp already; are
you expecting that adding one more really will cause any noticeable
overhead?

I agree that enabling devices for GOMP_offload_register_ver calls makes
sense.  (I indeed had considered this earlier, but it didn't lead to
solving the problem complete -- see above.)  Can we come up with a scheme
to do it this way, and only generate the GOMP_enable_offload_targets
constructor of no GOMP_offload_register_ver constructors have been
generated?  But I have no idea how to implement that in a non-convoluted
way.  (And, it sounds excessive to me in terms of implementation overhead
on our side, in contrast to execution overhead of one libgomp constructor
call.)

> > --- a/gcc/gcc.c
> > +++ b/gcc/gcc.c
> > @@ -401,6 +401,8 @@ static const char 
> > *compare_debug_auxbase_opt_spec_function (int, const char **);
> >  static const char *pass_through_libs_spec_func (int, const char **);
> >  static const char *replace_extension_spec_func (int, const char **);
> >  static const char *greater_than_spec_func (int, const char **);
> > +static const char *add_omp_infile_spec_func (int, const char **);
> > +
> >  static char *convert_white_space (char *);
> >  
> >  /* The Specs Language
> 
> I'd like to defer review of the driver bits, can Joseph or Bernd please have
> a look at those?

Joseph has already been working on this code, completing my earlier WIP
patch while I've been out of office, and has submitted it for trunk
inclusion, so I'm assuming these changes do have his blessing.

> > --- a/libgomp/libgomp-plugin.h
> > +++ b/libgomp/libgomp-plugin.h
> > @@ -48,7 +48,8 @@ enum offload_target_type
> >    OFFLOAD_TARGET_TYPE_HOST = 2,
> >    /* OFFLOAD_TARGET_TYPE_HOST_NONSHM = 3 removed.  */
> >    OFFLOAD_TARGET_TYPE_NVIDIA_PTX = 5,
> > -  OFFLOAD_TARGET_TYPE_INTEL_MIC = 6
> > +  OFFLOAD_TARGET_TYPE_INTEL_MIC = 6,
> > +  OFFLOAD_TARGET_TYPE_HWM
> 
> What is HWM?  Is that OFFLOAD_TARGET_TYPE_LAST what you mean?

Nathan has used this term before (libgomp/openacc.h:acc_device_t), and he
told me this means "High Water Mark".  I have no strong opinion on the
name to use, just want to mention that "*_LAST" sounds to me like that
one still is part of the accepted set, whereas in this case it'd be the
first enumerator outside of the accepted ones.  (And I guess, we agree
that "OFFLOAD_TARGET_TYPE_INTEL_LAST = 6" followed by
"OFFLOAD_TARGET_TYPE_INTEL_MIC = OFFLOAD_TARGET_TYPE_INTEL_LAST" is
ugly?)

> > --- a/libgomp/target.c
> > +++ b/libgomp/target.c
> > @@ -72,6 +72,9 @@ static int num_offload_images;
> >  /* Array of descriptors for all available devices.  */
> >  static struct gomp_device_descr *devices;
> >  
> > +/* Set of enabled devices.  */
> > +static bool devices_enabled[OFFLOAD_TARGET_TYPE_HWM];
> 
> I must say I don't like the locking for this.

Are you worried about the performance issues of a very short locking
cycle that in the majority of all cases should happen without blocking,
in comparison to performance issues related to host/device memory
transfers or kernel launches that will follow after the call to
gomp_offload_target_enabled_p?  I don't really think that is reasonable
to worry about.

> If all you ever change on this is that you change it from 0 to 1,
> then supposedly just storing it with __atomic_store, perhaps with
> rel semantics, and reading it as __atomic_load, with acquire semantics,
> would be good enough?  And perhaps change it into int array,
> so that it is actually atomic even on the old Alphas (if there are any
> around).

If you're really worried about this, I can look into that, but to me that
sounds like unwarranted code complexity/premature optimization...


Grüße
 Thomas

Attachment: signature.asc
Description: PGP signature

Reply via email to