On 02/28/2014 03:37 AM, Thomas Schwinge wrote:
> The process cannot recover from this, trying to continue despite the
> error.  (It is of course questionable what exactly to do in this case, as
> libgomp's internal state may now be corrupt.)  So far, such errors may
> have been rare (aside from real bugs, only/primarily dynamic resource
> exhaustion?), but in the advent of libgomp using external modules (plugin
> interface, for acceleration devices) I expect them to become more
> frequent.

I could see that, yes.  However...

> Does it make sense to add the option for the user to install a handler
> for this?  Three quick ideas, all untested: generally use abort in
> libgomp, which can then be caught with a SIGABRT handler that the user
> has set up (difficult to communicate information from libgomp to the
> handler); adding a weak function GOMP_error that the user can provide and
> that libgomp will call in presence of an error; or provide some GOMP_init
> function for registering an error handler.  The actual interface might be
> something like: an enum to indicate the class (severity?) of the error, a
> const char* for an error message that libgomp has generated (possibly
> forwarded from a plugin), and a boolean return value to tell libgomp to
> either continue or terminate the process.  Then, also libgomp's internal
> initialization could be made more explicit, so that it can (be)
> reinitialize(d) after an error occured.  It makes sense that the default
> remains to terminate the process in presence of an error.

I've never been keen on weak symbol interposition.  That works for an
application which is the sole user of libgomp, but does not work for libraries
using libgomp and worse for multiple such libraries.

I'd be ok with some kind of registration interface, like

  old = omp_set_error_handler (new);

so that a library can set and restore the handler around its own omp usage.

As for the interface of the handler itself...  I dunno.  I'd hate to make it
too complicated to actually use.  If severity is restricted to a single bit
meaning "cannot possibly continue" or "transient error".  Maybe with the error
string, since who knows what kind of weirdness is going on in the plugin.

A significant question is what thread this error would be reported on, and the
state of the other threads in the team when it happens.  This is the primary
reason that OMP disallows EH to propagate from inside the parallel region to
outside -- the exception would have no where to go if it was raised from other
than the primary thread.

In order to be usable at all, one would have to arrange for the error to be
reported on a thread that could do something about it.  Does it do any good to
report the error on thread 3, if thread 3 can't communicate with the user to
actually report the error?

I suppose there are a few types of "continuable" errors that might want some
sort of user interaction.  E.g. if the error is from a plugin, before we've
committed to performing any computation with the plugin, we could well be in a
state where we could fall back to the host for execution.  But we could want to
let the user know that the plugin failed and this is why the run is going to be
slow this time.

> I have not yet researched what other OpenACC or OpenMP implementations
> are doing, or other compiler support/runtime libraries.

That would be good to know.


r~

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to