After cunrolling the inner loop, the remaining loop in the testcase has a single 32-bit access and a group of 64-bit accesses. We first try to vectorise at 128 bits (VF 4), but decide not to for cost reasons. We then try with 64 bits (VF 2) instead. This means that the group of 64-bit accesses uses a single-element vector, which is deliberately supported as of r251538. We then try to create "permutes" for these single-element vectors and fall foul of:
for (i = 0; i < 6; i++) sel[i] += exact_div (nelt, 2); in vect_grouped_store_supported, since nelt==1. Maybe we shouldn't even be trying to vectorise statements in the single-element case, and instead just copy the scalar statement for each member of the group. But until then, this patch treats non-strided grouped accesses as VMAT_CONTIGUOUS if no permutation is necessary. Tested on aarch64-linux-gnu, x86_64-linux-gnu and powerpc64le-linux-gnu. OK to install? Richard 2018-01-09 Richard Sandiford <richard.sandif...@linaro.org> gcc/ PR tree-optimization/83753 * tree-vect-stmts.c (get_group_load_store_type): Use VMAT_CONTIGUOUS for non-strided grouped accesses if the number of elements is 1. gcc/testsuite/ PR tree-optimization/83753 * gcc.dg/torture/pr83753.c: New test. Index: gcc/tree-vect-stmts.c =================================================================== --- gcc/tree-vect-stmts.c 2018-01-09 15:46:34.439449019 +0000 +++ gcc/tree-vect-stmts.c 2018-01-09 18:15:53.481983778 +0000 @@ -1849,10 +1849,16 @@ get_group_load_store_type (gimple *stmt, && (can_overrun_p || !would_overrun_p) && compare_step_with_zero (stmt) > 0) { - /* First try using LOAD/STORE_LANES. */ - if (vls_type == VLS_LOAD - ? vect_load_lanes_supported (vectype, group_size) - : vect_store_lanes_supported (vectype, group_size)) + /* First cope with the degenerate case of a single-element + vector. */ + if (known_eq (TYPE_VECTOR_SUBPARTS (vectype), 1U)) + *memory_access_type = VMAT_CONTIGUOUS; + + /* Otherwise try using LOAD/STORE_LANES. */ + if (*memory_access_type == VMAT_ELEMENTWISE + && (vls_type == VLS_LOAD + ? vect_load_lanes_supported (vectype, group_size) + : vect_store_lanes_supported (vectype, group_size))) { *memory_access_type = VMAT_LOAD_STORE_LANES; overrun_p = would_overrun_p; Index: gcc/testsuite/gcc.dg/torture/pr83753.c =================================================================== --- /dev/null 2018-01-08 18:48:58.045015662 +0000 +++ gcc/testsuite/gcc.dg/torture/pr83753.c 2018-01-09 18:15:53.480983817 +0000 @@ -0,0 +1,19 @@ +/* { dg-do compile } */ +/* { dg-options "-mcpu=xgene1" { target aarch64*-*-* } } */ + +typedef struct { + int m1[10]; + double m2[10][8]; +} blah; + +void +foo (blah *info) { + int i, d; + + for (d=0; d<10; d++) { + info->m1[d] = 0; + info->m2[d][0] = 1; + for (i=1; i<8; i++) + info->m2[d][i] = 2; + } +}