Re: [PATCH v3 5/6] util/bufferiszero: optimize SSE2 and AVX2 variants

2024-02-06 Thread Richard Henderson
On 2/7/24 06:48, Alexander Monakov wrote: Increase unroll factor in SIMD loops from 4x to 8x in order to move their bottlenecks from ALU port contention to load issue rate (two loads per cycle on popular x86 implementations). Ah, that answers my question re 128 vs 256 byte minimum. So as far a

[PATCH v3 5/6] util/bufferiszero: optimize SSE2 and AVX2 variants

2024-02-06 Thread Alexander Monakov
Increase unroll factor in SIMD loops from 4x to 8x in order to move their bottlenecks from ALU port contention to load issue rate (two loads per cycle on popular x86 implementations). Avoid using out-of-bounds pointers in loop boundary conditions. Follow SSE2 implementation strategy in the AVX2 v