https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119108
--- Comment #10 from Tamar Christina <tnfchris at gcc dot gnu.org> --- (In reply to Matthew Malcomson from comment #9) > (In reply to Tamar Christina from comment #8) > > Ok, so having looked at this I'm not sure the compiler is at fault here. > > > > Similar to the SVN case the snappy code is misaligning the loads > > intentionally and loading 64-bits at a time from the 8-bit pointer: > > ... > > > So I think this is a case where the compiler can't do anything. (I also > > think that the C code uses UB similar to SVN, they misalign the byte array > > to 4-bytes but load 8-bytes at a time. They get lucky that the vector code > > is never entered). > > ... > > > > > The could would be beneficial if they: > > > > 1. added restrict to the functions, as eg in `FindMatchLengthPlain` values > > manually vectorized anyway so aliasing must not be a problem > > 2. they have a simple scalar loop variant that's left up to the vectorizer > > to vectorize. This would actually give them faster code and allow e.g. SVE > > codegen. > > > Thanks for looking into it Tamar! > > Few questions (some just because I want to make sure I understand -- some > more on topic ;-) > > Just to understand: > - What SVN case are you referencing? I'm on holiday till Friday, but have some time to kill in the train so.. I will make typos. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119016 > - How is this UB? The UNALIGNED_LOAD64 seems to use `memcpy`, and they > provide a relevant limit on the reads of 8 bytes at a time. > Yeah but the pointer isn't aligned to the type they're loading. As they manually misalign it by 4 in the caller and read a wider element. C requires the alignment to be that of the type. They load a uint64_t from a misaligned 8 bit pointer. When we vectorize since the load is uint64_t we will peel based on this size. And it's assumed that by peeling 64-bit chunks at a time we can arrive at the desired vector alignment (eg if it's V2DI we assume we can peel n-64 bit elements to get to 128-bit alignment). Since most AArch64 cores don't trap on unaligned loads we just keep peeling and so never enter the vector loop as alignment is never reached. Hence you have the overhead of vectorization but never the benefit. > More relevant to the issue: > - I tried by adding `__restrict__` to `s1` and `s2` in > `FindMatchLengthPlain` and replacing the function with a plain loop. I saw > a significant slowdown. Is your point that this would allow the compiler > to do something about the code even though it may not be better right now? Yes though the question is whether the loop actually entered the vector code this time. From memory I also remember that the loop required not equals vector comparison, but aarch64 doesn't have a cmpne for Adv. SIMD. So this generates a cmpeq + not which adds to the critical path unexpectedly. I do have a patch to fix this ready for GCC 16, and this in particular is a case where SVE would have done much better but we need first faults support first. But the question is do we enter the vector code now? > Or did you mean inline the loop or something. (N.b. didn't double-check the > codegen of that function -- just ran the benchmark naively again -- so if > there was any obvious adjustment in flags or the like I should make I didn't > make it ;-) Inlining should be less of an issue with the restrict since the Alias checks would be gone. But I suspect your slowdown is that you still don't enter the vector code, and so you stick in the scalar loop which does now 1 element at a time instead of 4. Could you check with perf what it executes now? FWIW an easy patch to just recover the regression would be to add #pragma GCC novector Infront of the loop. But I do think that we should be able to do better than that hand vectorized loop if there are enough elements.