https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85584

            Bug ID: 85584
           Summary: [og7, nvptx] make generic and per-worker broadcast
                    buffers overlap
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vries at gcc dot gnu.org
  Target Milestone: ---

Consider this fortran testcase compiled at -O2 with
-foffload=-mlong-vector-in-workers:
...
module param
  integer, parameter :: N = 32
end module param

program main
  use param
  integer :: i, j
  integer :: a(N)

  do i = 1, N
    a(i) = i
  end do

  !$acc parallel copy (a) vector_length (128)
  !$acc loop worker
    do i = 1, N
       !$acc loop vector 
       do j = j, N
          a(j) = a(j) - a(j)
       end do
    end do
  !$acc end parallel

  do i = 1, N
    if (a(i) .ne. 0) call abort
  end do

end program main
...

In the ptx, we generate a broadcast buffer:
...
.shared .align 8 .u8 __oacc_bcast[504];
...
which consists of 9 partitions of 56. 1 generic partition, and 8 per-worker
partitions.

The generic partition is addressed using __oacc_bcast, the per-worker
partitions are addressed using %r109 calculated here:
...
{
    .reg .u32 %tidy;
    .reg .u64 %t_bcast;
    .reg .u64 %y64;
    mov.u32 %tidy,%tid.y;
    cvt.u64.u32 %y64,%tidy;
    add.u64 %y64,%y64,1;
    cvta.shared.u64 %t_bcast,__oacc_bcast;
    mad.lo.u64 %r109,%y64,56,%t_bcast;
}
...

The generic partition broadcasting is guarded with bar.sync 0, the per-worker
partition broadcasting is guarded with bar.sync %r110,128, where %r110 is
calculated here:
...
  {
    .reg .u32 %tidy;
    mov.u32 %tidy,%tid.y;
    add.u32 %r110,%tidy,1;
  }
...

In principle, it should be possible to make the generic partition overlap with
the per-worker partitions, which would mean less shared memory used.

Reply via email to