https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85584
Bug ID: 85584 Summary: [og7, nvptx] make generic and per-worker broadcast buffers overlap Product: gcc Version: unknown Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: vries at gcc dot gnu.org Target Milestone: --- Consider this fortran testcase compiled at -O2 with -foffload=-mlong-vector-in-workers: ... module param integer, parameter :: N = 32 end module param program main use param integer :: i, j integer :: a(N) do i = 1, N a(i) = i end do !$acc parallel copy (a) vector_length (128) !$acc loop worker do i = 1, N !$acc loop vector do j = j, N a(j) = a(j) - a(j) end do end do !$acc end parallel do i = 1, N if (a(i) .ne. 0) call abort end do end program main ... In the ptx, we generate a broadcast buffer: ... .shared .align 8 .u8 __oacc_bcast[504]; ... which consists of 9 partitions of 56. 1 generic partition, and 8 per-worker partitions. The generic partition is addressed using __oacc_bcast, the per-worker partitions are addressed using %r109 calculated here: ... { .reg .u32 %tidy; .reg .u64 %t_bcast; .reg .u64 %y64; mov.u32 %tidy,%tid.y; cvt.u64.u32 %y64,%tidy; add.u64 %y64,%y64,1; cvta.shared.u64 %t_bcast,__oacc_bcast; mad.lo.u64 %r109,%y64,56,%t_bcast; } ... The generic partition broadcasting is guarded with bar.sync 0, the per-worker partition broadcasting is guarded with bar.sync %r110,128, where %r110 is calculated here: ... { .reg .u32 %tidy; mov.u32 %tidy,%tid.y; add.u32 %r110,%tidy,1; } ... In principle, it should be possible to make the generic partition overlap with the per-worker partitions, which would mean less shared memory used.