Extend GCC's loop vectorizer to recognize loops with predicate-guarded stores into a buffer with an offset incremented under the same predicate. Enable the backend to emit AVX-512 VPCOMPRESS for the vectorized compressing-store path when that is profitable.
This RFC is in response to PR tree-optimization/91198 (GCC not generating AVX-512 compress/expand instructions in relevant cases): https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91198 We will be sending an initial prototype patch for phase 1 soon to gcc-patches list. ------------------------------------------------------------------------------ Running example =============== The following kernel is used throughout this document to illustrate the proposed design. It is reduced from the SPEC CPU 2017 benchmark 527.cam4_r. int cam4_filter_pack_kernel_restrict (int N, double thresh, const double *restrict cwm, double *restrict coef, double *restrict fwaut, int *restrict ind) { int ncols = 0; for (int i = 0; i < N; ++i) { coef[i] = 0.0; fwaut[i] = 0.0; if (cwm[i] > thresh) { ind[ncols] = i + 1; ncols++; } } return ncols; } Assumption: Before the loop vectorizer pass, in GIMPLE IR, the loops of interest combine three features: (1) a predicate that controls whether the current iteration executes the store or not, (2) a loop-carried PHI that counts how many elements have been stored so far, and (3) a masked store whose address depends on that PHI. If-conversion pass has already run on the loop and it has created the masked store that uses that predicate and the value to be stored. Note: This document focuses mostly on simple loops like the above running example. It also assumes the relevant pointers or arrays do not alias (e.g. restrict), which simplifies runtime address and bound checks. More complex cases (like multiple number of predicated stores using same or different induction variables, epilogue vectorization, etc.) and addition of runtime aliasing checks can be covered in an extended design as a follow-up. ------------------------------------------------------------------------------ This RFC outlines a phased approach: Phase 1 detects and classifies the pattern early, Phase 2 represents it with explicit internal functions, and Phase 3 updates the predicated offset of the store and vectorizes the predicated store with compressing store using VPCOMPRESS instruction. ------------------------------------------------------------------------------ Phase 1 — Classification of the predicated PHI ============================================== Phase 1 detects the pattern early by requiring a header PHI for the running offset which is updated only when the predicate holds (termed as predicated-PHI from now on), and a masked store in the same loop that uses the predicate and uses the predicated-PHI as index to get address of the store. It also requires that when the predicate is true, the predicated-PHI must advance by exactly one element per scalar iteration. Phase 1 records that predicated-PHI along with the masked store on which it is paired with. Input - GIMPLE after if-conversion: the inner loop containing the PHI for ncols, the conditional update, and .MASK_STORE on ind, as in the following GIMPLE IR of the running example. Representative GIMPLE at vectorizer input: # ncols_49 = PHI <ncols_17(8), 0(17)> # i_51 = PHI <_58(8), 0(17)> _1 = (long unsigned int) i_51; _2 = _1 * 8; _11 = cwm_40(D) + _2; _12 = *_11; _58 = i_51 + 1; _56 = _12 > thresh_41(D); _13 = (long unsigned int) ncols_49; _14 = _13 * 4; _90 = _3 + _14; _15 = (int *) _90; .MASK_STORE (_15, 32B, _56, _58); _91 = (unsigned int) ncols_49; _92 = _91 + 1; ncols_44 = (int) _92; ncols_17 = _56 ? ncols_44 : ncols_49; Output Phase 1 performs the +1 legality check and records the predicated-PHI and pair it with the masked store: ncols_49 is marked as the predicated-PHI, and the loop’s vectorizer state records that PHI together with the original masked store .MASK_STORE (_15, 32B, _56, _58). Later phases use this recorded association as the anchor for internal-function rewrites and backend pattern selection. ------------------------------------------------------------------------------ Phase 2 — Internal functions and pattern recognition ==================================================== Phase 2 introduces two internal functions: one for the predicated-PHI update (if the predicate holds, add one, else leave it unchanged), and one for the compressing masked store, bundling the store's address, predicate mask, and value. They line up with the predicated-PHI and mask-store pair that Phase 1 already recorded. Input - Legality information recorded in Phase 1 (predicated-PHI, paired IFN_MASK_STORE). When that classification remains valid, Phase 2 goes ahead with introducing and wiring the internal functions. Output - Pattern recognition rewrites the classified predicated-PHI and original IFN_MASK_STORE onto two internal functions: IFN_PREDICATED_INDEX_UPDATE(ncols_49, _56) for the predicated-PHI update, and IFN_MASK_COMPRESS_STORE(_15, _56, _58) for the compressing masked store (ptr, mask, value). ------------------------------------------------------------------------------ Phase 3 — Vector semantics, aliasing, costing ============================================= This phase is further split into incremental milestones. Milestone M1 — Updating the predicated-PHI in vectorized IR ----------------------------------------------------------- Milestone M1 gives the predicated-PHI the right semantics in the vector loop: each vector step must advance the index by the sum of active predicate lanes. Input - Pattern IR from Phase 2: IFN_PREDICATED_INDEX_UPDATE(ncols_49, _56) Output - The predicated-PHI is updated from the widened predicate using one of two routes — (1) cast the comparison mask to a scalar value and apply popcount, or (2) use a horizontal lane reduction that yields the active-lane count. The exact choice can be decided later. Milestone M2 — Dependence and alias ----------------------------------- Milestone M2 integrates compress-store with dependence and alias analysis by treating it consistently with masked stores so that vectorization is legalized. Milestone M3 — Costing ---------------------- Milestone M3 costs the vector compress path against scalar and alternative vector schemes. Milestone M4 — Enable RTL to generate Vector compress store ----------------------------------------------------------- Milestone M4 makes necessary changes in RTL expansion phase, to detect the internal function IFN_MASK_COMPRESS_STORE and use respective RTL expander routies to generate VPCOMPRESS.
