https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89445
Bug ID: 89445 Summary: [8 regression] _mm512_maskz_loadu_pd "forgets" to use the mask Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Target Milestone: --- Created attachment 45793 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45793&action=edit example showing segmentation fault In the following code: void daxpy(size_t n, double a, double const* __restrict x, double* __restrict y) { const __m512d v_a = _mm512_broadcastsd_pd(_mm_set_sd(a)); const __mmask16 final = (1U << (n % 8u)) - 1; __mmask16 mask = 65535u; for (size_t i = 0; i < n * sizeof(double); i += 8 * sizeof(double)) { if (i + 8 * sizeof(double) > n * sizeof(double)) mask = final; __m512d v_x = _mm512_maskz_loadu_pd(mask, (char const *)x + i); __m512d v_y = _mm512_maskz_loadu_pd(mask, (char const *)y + i); __m512d tmp = _mm512_fmadd_pd(v_x, v_a, v_y); _mm512_mask_storeu_pd((char *)y + i, mask, tmp); } } When compiled with GCC 8, the loop looks like .L5: cmpq %rax, %r10 cmovb %r9d, %r8d movzbl %r8b, %ecx kmovd %ecx, %k1 leaq (%rdx,%rax), %rcx vmovapd (%rsi,%rax), %zmm1{%k1}{z} vmovapd (%rcx), %zmm2{%k1}{z} vfmadd132pd %zmm0, %zmm2, %zmm1 vmovupd %zmm1, (%rcx){%k1} addq $64, %rax cmpq %rdi, %rax jb .L5 Whereas GCC trunk (as of r269073) generates: .L5: vmovapd (%rsi,%rax), %zmm1 cmpq %rax, %r9 vfmadd213pd (%rdx,%rax), %zmm0, %zmm1 cmovb %r8d, %ecx kmovb %ecx, %k1 vmovupd %zmm1, (%rdx,%rax){%k1} addq $64, %rax cmpq %rdi, %rax jb .L5 Godbolt link: https://gcc.godbolt.org/z/2ys7ZO Since the neither memory loads are masked, the resulting registers can contain garbage and trigger FP exceptions. They can also cause segmentation faults if portions of the source are not mapped regions. The attached example forces the operation on a page boundary where half the 64 bytes addressed by the second load are unmapped. When run, the example will crash.