https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88981
--- Comment #2 from Tom de Vries <vries at gcc dot gnu.org> --- A good thing to note here, when adding #pragma acc wait, the program (compiled with -O0) takes ~10 seconds to finish on my quadro 1200m. Without the pragma acc wait, it still takes 10 seconds. When inspecting with a debugger where it's waiting (since there's no wait reponsible for this), we're hanging on either cuMemFree or cuCtxDestroy. I can't find documentation of this hanging behaviour, so this behaviour may be specific to the driver version or card or architecture.