https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110456
Bug ID: 110456
Summary: vectorization with loop masking prone to STLF issues
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: rguenth at gcc dot gnu.org
Target Milestone: ---
void __attribute__((noipa))
test (double * __restrict a, double *b, int n, int m)
{
for (int j = 0; j < m; ++j)
for (int i = 0; i < n; ++i)
a[i + j*n] = a[i + j*n /* + 512 */] + b[i + j*n];
}
double a[1024];
double b[1024];
int main(int argc, char **argv)
{
int m = atoi (argv[1]);
for (long i = 0; i < 1000000000; ++i)
test (a + 4, b + 4, 4, m);
}
Shows that when we apply loop masking with --param vect-partial-vector-usage
then masked stores will generally prohibit store-to-load forwarding,
especially when there's only a partial overlap with a following load like
when traversing a multi-dimensional array as above. The above runs
noticable slower compared to when the loads are offset
(uncomment the /* + 512 */).
The situation is difficult to avoid in general but there might be easy
heuristics that could be implemented like avoiding loop masking when
there's a read-modify-write operation to the same memory location in
a loop (with or without an immediately visible outer loop). For
unknown dependences and thus runtime disambiguation a proper distance
of any read/write operation could be ensured as well.