https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88464
Bug ID: 88464 Summary: AVX-512 vectorization of masked scatter failing with "not suitable for scatter store" Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: mokreutzer at gmail dot com Target Milestone: --- Hi, I have the following simple loop which I want to compile for Skylake (AVX-512): ================================ #pragma GCC ivdep for (int i = 0; i < n; ++i) { if (b[off1[i]] < b[off2[i]]) a[off1[i]] = b[off1[i]]; else a[off2[i]] = b[off2[i]]; } ================================ Given AVX-512 masked scatter instructions and the absence of data conflicts ("ivdep"), vectorization should be possible along the lines of: 1. gather b[off1[i]] into zmm1 2. gather b[off2[i]] into zmm2 3. compare zmm1 and zmm2 with "<" and store result in mask1 4. compare zmm1 and zmm2 with ">=" and store result in mask2 5. scatter zmm1 to a[off1[i]] with mask1 6. scatter zmm2 to a[off2[i]] with mask2 However, GCC is not able to vectorize this loop (failing with "not vectorized: not suitable for scatter store"). I have tested this with the latest GCC trunk but the issue also occurs with all previous versions. If you want to have a look, here's a Godbolt example: https://godbolt.org/z/Is7Zml I understand that this loop is not a trivial case for vectorization and AVX-512 hasn't been around for too long, so it's likely that it isn't fully supported yet. But still, I'm wondering: 1. Am I missing some flags or hints to GCC in order to vectorize this loop? (I can imagine something related to the cost model, etc..) 2. Or is GCC currently just not capable of vectorizing it? If the answer is "2.": 3. Can we estimate to amount of work needed to support this? 4. Is there any plan on when this kind of pattern will be supported? 5. If it's realistic for a non-GCC developer to look into this, is there anything I can do to help? Many thanks in advance, Moritz