Bug#1107746: gcc-14-offload-nvptx: offloading to gpu disabled if #pragma omp requires unified_shared_memory is used (it worked before on the same hardware)

Giacomo Mulas Fri, 13 Jun 2025 10:10:17 -0700

Package: gcc-14-offload-nvptx
Version: 14.2.0-19
Severity: normal

Dear Maintainer,


I found out that in the current debian gcc-14-offload-nvptx and
gcc-13-offload-nvptx, if I compile a code that requires
unified_shared_memory and uses openmp to offload to gpu, the code is never
run on the gpu.  It does compile the offload code, but then it is never
executed on the gpu.  If I attempt to run the code with
OMP_TARGET_OFFLOAD=MANDATORY, it gives the error: 

libgomp: OMP_TARGET_OFFLOAD is set to MANDATORY, but only the host device is 
available

which is incorrect.  If I comment out the line

#pragma omp requires unified_shared_memory

then the code _does_ run on the gpu.

What is weird is that exactly the same code, on exactly the same hardware,
used to work in the past.  I cannot pinpoint exactly what update of what
package caused this issue.  I do know that the same code worked on my laptop
and still does offload to the gpu on another system with an old centos
system and hand-compiled gcc 13 with offload to nvptx-none.  Did anything
change in the required system setup and I missed it?  What puzzles me is
that it did use to work on exactly the same laptop, with sid, just some time
ago (one year?). And since the code _does_ compile and _does_ run without 
errors unless one explicitly uses OMP_TARGET_OFFLOAD=MANDATORY, the change
may have occurred some time in the past and I did not notice till now.

I paste here a simple hello world code that shows the issue

#include <stdio.h>
#include <math.h>
#ifdef _OPENMP
#include <omp.h>
#endif
#define NX 100000000

#pragma omp requires unified_shared_memory

int main(void)
{
  double vecA[NX],vecB[NX],vecC[NX];
  double r=0.2;

/* Initialization of vectors */
#pragma omp target teams distribute parallel for simd
  for (long i = 0; i < NX; i++) {
     vecA[i] = pow(r, i);
     vecB[i] = 1.0;
  }

/* dot product of two vectors */
#pragma omp target teams distribute parallel for simd
  for (long i = 0; i < NX; i++) {
     vecC[i] = vecA[i] * vecB[i];
  }

  double sum = 0.0;
  /* calculate the sum */
  #pragma omp target teams distribute parallel for simd reduction(+:sum)
  for (long i = 0; i < NX; i++) {
    sum += vecC[i];
  }
  printf("The sum is: %8.6f \n", sum);
  return 0;
}

which I can compile with e.g.

gcc -O3 -fopenmp -foffload=nvptx-none -foffload-options="-O3 -fopt-info -lm" -o 
test test.c -lm

which compiles flawlessly and, due to the -fopt-info option given to the
offload compiler (and only to the offload compiler), gives out the following
info:

test.c:25:14: optimized: loop unrolled 7 times
test.c:32:9: optimized: loop unrolled 7 times
test.c:19:14: optimized: loop unrolled 3 times
test.c:30:11: optimized: basic block part vectorized using 16 byte vectors
test.c:23:9: optimized: basic block part vectorized using 16 byte vectors
test.c:16:9: optimized: basic block part vectorized using 16 byte vectors

if I run it (after allowing for a large stack with ulimit -s unlimited) 
with just

./test

I get:

The sum is: 1.250000

if I run it with 

OMP_TARGET_OFFLOAD=MANDATORY ./test 

and the #pragma omp requires unified_shared_memory line was uncommented, I get

libgomp: OMP_TARGET_OFFLOAD is set to MANDATORY, but only the host device is 
available

while if I comment out #pragma omp requires unified_shared_memory recompile
and rerun with 

OMP_TARGET_OFFLOAD=MANDATORY ./test

it does run on the gpu and produce the expected result. I actually verified
with nvidia-smi that it is indeed running on the gpu.

Thanks in advance for any info on how to get it to run again on the gpu 
also when 
#pragma omp requires unified_shared_memory
is enabled.

Best regards,
Giacomo Mulas

-- System Information:
Debian Release: 13.0
  APT prefers unstable
  APT policy: (401, 'unstable'), (10, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 6.12.32-amd64 (SMP w/12 CPU threads; PREEMPT)
Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_WARN, TAINT_OOT_MODULE, 
TAINT_UNSIGNED_MODULE
Locale: LANG=it_IT.UTF-8, LC_CTYPE=it_IT.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages gcc-14-offload-nvptx depends on:
ii  gcc-14                 14.2.0-19
ii  gcc-14-base            14.2.0-19
ii  libc6                  2.41-8
ii  libc6-dev              2.41-8
ii  libgmp10               2:6.3.0+dfsg-3
ii  libgomp-plugin-nvptx1  14.2.0-19
ii  libmpc3                1.3.1-1+b3
ii  libmpfr6               4.2.2-1
ii  libzstd1               1.5.7+dfsg-1
ii  nvptx-tools            0.20240810-3
ii  zlib1g                 1:1.3.dfsg+really1.3.1-1+b1

gcc-14-offload-nvptx recommends no packages.

gcc-14-offload-nvptx suggests no packages.

-- no debconf information

Bug#1107746: gcc-14-offload-nvptx: offloading to gpu disabled if #pragma omp requires unified_shared_memory is used (it worked before on the same hardware)

Reply via email to