https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121218
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Richard Biener from comment #3) > Confirmed also with -fno-tree-loop-vectorize. The vect_slp debug counter > behaves a bit odd. > > I'll note the abort happens here: > > 7205 auto in = AllocateAligned<T>(kVectors * N); > 7206 auto actual_aligned = AllocateAligned<T>((kVectors + 1) * N + 1); > 7207 do { > 7208 if (!(in && actual_aligned)) { > 7209 __builtin_abort(); > > (gdb) p in > $1 = std::unique_ptr<long []> = {get() = 0x454880} > (gdb) p actual_aligned > $2 = std::unique_ptr<long []> = {get() = 0x454c80} > > I'm not sure what's that supposed to test? Is that already a reduced > testcase? > Why would we expect one of the allocations to fail? So this is actually bad reported location, with -fno-tree-tail-merge the abort happens here: 7070 Store(actual, d, actual_lanes.get()); 7071 const auto info = hwy::detail::MakeTypeInfo<T>(); 7072 const char *target_name = hwy::TargetName((1LL << 61)); 7073 for (int i = 0; i < N; i++) { 7074 if (expected_lanes[i] != actual_lanes[i]) 7075 __builtin_abort(); 7076 } So the interesting thing is that with -fdbg-cnt=vect_slp:1-6:12-18 I get all BB vector locations point to template <class D> static inline __attribute__((always_inline)) __attribute__((unused)) void StoreInterleaved4(VFromD<D> v0, VFromD<D> v1, VFromD<D> v2, VFromD<D> v3, D d, TFromD<D> *__restrict__ unaligned) { for (size_t i = 0; i < MaxLanes(d); ++i) { *unaligned++ = v0.raw[i]; *unaligned++ = v1.raw[i]; *unaligned++ = v2.raw[i]; *unaligned++ = v3.raw[i]; } } placing a memory barrier after the above loop fixes the issue, so I guess the issue isn't vectorization but subsequent optimization of the vector code with the following which is StoreInterleaved4(in0, in1, in2, in3, d, actual); StoreU(Zero(d), d, actual + kVectors * N); Vec<D> out0, out1, out2, out3; LoadInterleaved4(d, actual, out0, out1, out2, out3); AssertVecEqual( The problematical instantiation is uint64_t (smaller integer types work). Likewise V2DI: __attribute__((noinline)) void TestAllLoadStoreInterleaved4() { TestLoadStoreInterleaved4()(uint64_t(), Simd<unsigned long, 2ul, 0>()); }