> On 30 Jul 2024, at 19:01, Andi Kleen wrote:
>
> External email: Use caution opening links or attachments
>
>
>> Is that from some kind of rigorous measurement under perf? As you
>> surely know, 0.6% wall-clock time can be from boost clock variation
>> or just run-to-run noise on x86.
>
>
> Am 30.07.2024 um 19:22 schrieb Alexander Monakov :
>
>
> On Tue, 30 Jul 2024, Andi Kleen wrote:
>>> I have looked at this code before. When AVX2 is available, so is SSSE3,
>>> and then a much more efficient approach is available: instead of comparing
>>> against \r \n \\ ? one-by-one, build
On Tue, 30 Jul 2024, Andi Kleen wrote:
> > I have looked at this code before. When AVX2 is available, so is SSSE3,
> > and then a much more efficient approach is available: instead of comparing
> > against \r \n \\ ? one-by-one, build a vector
> >
> > 0 1 2 3 4 5 6 7 8 9a bc
> Is that from some kind of rigorous measurement under perf? As you
> surely know, 0.6% wall-clock time can be from boost clock variation
> or just run-to-run noise on x86.
I compared it using hyperfine which does rigorous measurements yes.
It was well above the run-to-run variability.
I had some
On Tue, Jul 30, 2024 at 08:41:59AM -0700, Andi Kleen wrote:
> From: Andi Kleen
>
> AVX2 is widely available on x86 and it allows to do the scanner line
> check with 32 bytes at a time. The code is similar to the SSE2 code
> path, just using AVX and 32 bytes at a time instead of SSE2 16 bytes.
>
Hi,
On Tue, 30 Jul 2024, Andi Kleen wrote:
> AVX2 is widely available on x86 and it allows to do the scanner line
> check with 32 bytes at a time. The code is similar to the SSE2 code
> path, just using AVX and 32 bytes at a time instead of SSE2 16 bytes.
>
> Also adjust the code to allow inlini
Andrew Pinski writes:
>
> Using the builtin here seems wrong. Why not use the intrinsic
> _mm256_movemask_epi8 ?
I followed the rest of the vectorized code paths. The original reason was that
there was some incompatibility of the intrinsic header with the source
build. I don't know if it's still
On Tue, Jul 30, 2024 at 8:43 AM Andi Kleen wrote:
>
> From: Andi Kleen
>
> AVX2 is widely available on x86 and it allows to do the scanner line
> check with 32 bytes at a time. The code is similar to the SSE2 code
> path, just using AVX and 32 bytes at a time instead of SSE2 16 bytes.
>
> Also ad
From: Andi Kleen
AVX2 is widely available on x86 and it allows to do the scanner line
check with 32 bytes at a time. The code is similar to the SSE2 code
path, just using AVX and 32 bytes at a time instead of SSE2 16 bytes.
Also adjust the code to allow inlining when the compiler
is built for an