https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121218

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #3)
> Confirmed also with -fno-tree-loop-vectorize.  The vect_slp debug counter
> behaves a bit odd.
> 
> I'll note the abort happens here:
> 
> 7205        auto in = AllocateAligned<T>(kVectors * N);
> 7206        auto actual_aligned = AllocateAligned<T>((kVectors + 1) * N + 1);
> 7207        do {
> 7208          if (!(in && actual_aligned)) {
> 7209            __builtin_abort();
> 
> (gdb) p in
> $1 = std::unique_ptr<long []> = {get() = 0x454880}
> (gdb) p actual_aligned
> $2 = std::unique_ptr<long []> = {get() = 0x454c80}
> 
> I'm not sure what's that supposed to test?  Is that already a reduced
> testcase?
> Why would we expect one of the allocations to fail?

So this is actually bad reported location, with -fno-tree-tail-merge the
abort happens here:
7070      Store(actual, d, actual_lanes.get());
7071      const auto info = hwy::detail::MakeTypeInfo<T>();
7072      const char *target_name = hwy::TargetName((1LL << 61));
7073      for (int i = 0; i < N; i++) {
7074        if (expected_lanes[i] != actual_lanes[i])
7075          __builtin_abort();
7076      }

So the interesting thing is that with -fdbg-cnt=vect_slp:1-6:12-18 I get
all BB vector locations point to

template <class D>
static inline __attribute__((always_inline)) __attribute__((unused)) void
StoreInterleaved4(VFromD<D> v0, VFromD<D> v1, VFromD<D> v2, VFromD<D> v3, D d,
                  TFromD<D> *__restrict__ unaligned) {
  for (size_t i = 0; i < MaxLanes(d); ++i) {
    *unaligned++ = v0.raw[i];
    *unaligned++ = v1.raw[i];
    *unaligned++ = v2.raw[i];
    *unaligned++ = v3.raw[i];
  }
}

placing a memory barrier after the above loop fixes the issue, so I guess
the issue isn't vectorization but subsequent optimization of the vector
code with the following which is

    StoreInterleaved4(in0, in1, in2, in3, d, actual);
    StoreU(Zero(d), d, actual + kVectors * N);
    Vec<D> out0, out1, out2, out3;
    LoadInterleaved4(d, actual, out0, out1, out2, out3);
    AssertVecEqual(

The problematical instantiation is uint64_t (smaller integer types work).
Likewise V2DI:

__attribute__((noinline)) void TestAllLoadStoreInterleaved4() {
  TestLoadStoreInterleaved4()(uint64_t(), Simd<unsigned long, 2ul, 0>());
}

Reply via email to