Hi The code below it looks like we always call “vect_permute_load_chain” to load non-unit strides of size powers of 2.
(---snip---) /* If reassociation width for vector type is 2 or greater target machine can execute 2 or more vector instructions in parallel. Otherwise try to get chain for loads group using vect_shift_permute_load_chain. */ mode = TYPE_MODE (STMT_VINFO_VECTYPE (vinfo_for_stmt (stmt))); if (targetm.sched.reassociation_width (VEC_PERM_EXPR, mode) > 1 || exact_log2 (size) != -1 || !vect_shift_permute_load_chain (dr_chain, size, stmt, gsi, &result_chain)) vect_permute_load_chain (dr_chain, size, stmt, gsi, &result_chain); static bool vect_shift_permute_load_chain (vec<tree> dr_chain, unsigned int length, gimple *stmt, gimple_stmt_iterator *gsi, vec<tree> *result_chain) { …... …... if (exact_log2 (length) != -1 && LOOP_VINFO_VECT_FACTOR (loop_vinfo) > 4) ⇐ This is not used. { unsigned int j, log_length = exact_log2 (length); for (i = 0; i < nelt / 2; ++i) sel[i] = i * 2; for (i = 0; i < nelt / 2; ++i) sel[nelt / 2 + i] = i * 2 + 1; (---snip------) Is there any reason to do so? I have not done any benchmarking, but tried simple test cases for -mavx targets with sizes 2, 4 and VF > 4 (short/char types). Looks like using vect_shift_permute_load_chain seems better. Should we change it to something like this ? diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c index d0e20da..b0f0a02 100644 --- a/gcc/tree-vect-data-refs.c +++ b/gcc/tree-vect-data-refs.c @@ -5733,9 +5733,9 @@ vect_transform_grouped_load (gimple *stmt, vec<tree> dr_chain, int size, get chain for loads group using vect_shift_permute_load_chain. */ mode = TYPE_MODE (STMT_VINFO_VECTYPE (vinfo_for_stmt (stmt))); if (targetm.sched.reassociation_width (VEC_PERM_EXPR, mode) > 1 - || exact_log2 (size) != -1 - || !vect_shift_permute_load_chain (dr_chain, size, stmt, - gsi, &result_chain)) + || (!vect_shift_permute_load_chain (dr_chain, size, stmt, + gsi, &result_chain) + && exact_log2 (size) != -1)) vect_permute_load_chain (dr_chain, size, stmt, gsi, &result_chain); vect_record_grouped_load_vectors (stmt, result_chain); result_chain.release (); regards, Venkat.