On 03/30/2018 07:45 AM, Tom de Vries wrote:
> On 03/30/2018 03:07 AM, Tom de Vries wrote:
>> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
>>> As a follow up patch will show, the nvptx BE falls back to using
>>> vector_length = 32 when a vector loop is nested inside a worker loop.
>>
>> I disabled the fallback, and analyzed the vred2d-128.c illegal memory
>> access execution failure.
>>
>> I minimized that down to this ptx:
>> ...
>> .shared .align 8 .u8 __oacc_bcast[176];
>>
>> {
>> {
>> .reg .u32 %x;
>> mov.u32 %x,%tid.x;
>> setp.ne.u32 %r86,%x,0;
>> }
>>
>> {
>> .reg .u32 %tidy;
>> .reg .u64 %t_bcast;
>> .reg .u64 %y64;
>> mov.u32 %tidy,%tid.y;
>> cvt.u64.u32 %y64,%tidy;
>> add.u64 %y64,%y64,1;
>> cvta.shared.u64 %t_bcast,__oacc_bcast;
>> mad.lo.u64 %r66,%y64,88,%t_bcast;
>> }
>>
>> @ %r86 bra $L28;
>> st.u32 [%r66+80],0;
>> $L28:
>> ret;
>> }
>> ...
>>
>> The ptx is called with 2 workers and 128 vector_length.
>>
>> So, 2 workers mean %tid.y has values 0 and 1.
>> Then %y64 has values 1 and 2.
>> Then %r66 has values __oacc_bcast + (1 * 88) and __oacc_bcast + (2 * 88).
>> Then the st.u32 accesss __oacc_bcast + (1 * 88) + 80 and __oacc_bcast
>> + (2 * 88) + 80.
>>
>> So we're accessing memory at location 256, while the __oacc_bcast is
>> only 176 bytes big.
>>
>> I formulated this assert that AFAIU detects this situation in the
>> compiler:
>> ...
>> @@ -1125,6 +1125,8 @@ nvptx_init_axis_predicate (FILE *file, int
>> regno, const char *name)
>> fprintf (file, "\t}\n");
>> }
>>
>> +static int nvptx_mach_max_workers ();
>> +
>> /* Emit code to initialize OpenACC worker broadcast and synchronization
>> registers. */
>>
>> @@ -1148,6 +1150,7 @@ nvptx_init_oacc_workers (FILE *file)
>> "// vector broadcast offset\n",
>> REGNO (cfun->machine->bcast_partition),
>> oacc_bcast_partition);
>> + gcc_assert (oacc_bcast_partition * (nvptx_mach_max_workers () +
>> 1) <= oacc_bcast_size);
>> }
>> if (cfun->machine->sync_bar)
>> fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; "
>> ...
>>
>> The assert is not triggered when the fallback is used.
>
> I've tracked the problem down to:
> ...
>> - if (oacc_bcast_size <
>> data.offset)
>>
>> - oacc_bcast_size =
>> data.offset;
>>
>> + if (oacc_bcast_partition <
>> data.offset)
>>
>> +
>> {
>>
>> + int psize =
>> data.offset;
>>
>> + psize = (psize + oacc_bcast_align - 1) & ~(oacc_bcast_align
>> - 1); +
>> oacc_bcast_partition =
>> psize;
>>
>> + oacc_bcast_size = psize * (nvptx_mach_max_workers () +
>> 1); +
>> }
>>
>
> ...
>
> We hit this if clause for a first compiled function, with num_workers(1).
>
> This sets oacc_bcast_partition and oacc_bcast_size as required for that
> functions.
>
> Then we hit this if clause for a second compiled function, with
> num_workers (2).
>
> We need oacc_bcast_size updated, but the 'oacc_bcast_partition <
> data.offset' is false, so the update doesn't happen.
>
> I managed to fix this by making the code unconditional, and using MAX to
> update oacc_bcast_partition and oacc_bcast_size.
It looks like that's fallout from this patch
<https://gcc.gnu.org/ml/gcc-patches/2018-03/msg01212.html>. I should
have checked that patch with the vector length fallback disabled.
Cesar