Re: [PATCH 2/2] Add AVX2 code path to lexer

2024-07-30 Thread Kyrylo Tkachov
> On 30 Jul 2024, at 19:01, Andi Kleen wrote: > > External email: Use caution opening links or attachments > > >> Is that from some kind of rigorous measurement under perf? As you >> surely know, 0.6% wall-clock time can be from boost clock variation >> or just run-to-run noise on x86. > >

Re: [PATCH 2/2] Add AVX2 code path to lexer

2024-07-30 Thread Richard Biener
> Am 30.07.2024 um 19:22 schrieb Alexander Monakov : > >  > On Tue, 30 Jul 2024, Andi Kleen wrote: >>> I have looked at this code before. When AVX2 is available, so is SSSE3, >>> and then a much more efficient approach is available: instead of comparing >>> against \r \n \\ ? one-by-one, build

Re: [PATCH 2/2] Add AVX2 code path to lexer

2024-07-30 Thread Alexander Monakov
On Tue, 30 Jul 2024, Andi Kleen wrote: > > I have looked at this code before. When AVX2 is available, so is SSSE3, > > and then a much more efficient approach is available: instead of comparing > > against \r \n \\ ? one-by-one, build a vector > > > > 0 1 2 3 4 5 6 7 8 9a bc

Re: [PATCH 2/2] Add AVX2 code path to lexer

2024-07-30 Thread Andi Kleen
> Is that from some kind of rigorous measurement under perf? As you > surely know, 0.6% wall-clock time can be from boost clock variation > or just run-to-run noise on x86. I compared it using hyperfine which does rigorous measurements yes. It was well above the run-to-run variability. I had some

Re: [PATCH 2/2] Add AVX2 code path to lexer

2024-07-30 Thread Jakub Jelinek
On Tue, Jul 30, 2024 at 08:41:59AM -0700, Andi Kleen wrote: > From: Andi Kleen > > AVX2 is widely available on x86 and it allows to do the scanner line > check with 32 bytes at a time. The code is similar to the SSE2 code > path, just using AVX and 32 bytes at a time instead of SSE2 16 bytes. >

Re: [PATCH 2/2] Add AVX2 code path to lexer

2024-07-30 Thread Alexander Monakov
Hi, On Tue, 30 Jul 2024, Andi Kleen wrote: > AVX2 is widely available on x86 and it allows to do the scanner line > check with 32 bytes at a time. The code is similar to the SSE2 code > path, just using AVX and 32 bytes at a time instead of SSE2 16 bytes. > > Also adjust the code to allow inlini

Re: [PATCH 2/2] Add AVX2 code path to lexer

2024-07-30 Thread Andi Kleen
Andrew Pinski writes: > > Using the builtin here seems wrong. Why not use the intrinsic > _mm256_movemask_epi8 ? I followed the rest of the vectorized code paths. The original reason was that there was some incompatibility of the intrinsic header with the source build. I don't know if it's still

Re: [PATCH 2/2] Add AVX2 code path to lexer

2024-07-30 Thread Andrew Pinski
On Tue, Jul 30, 2024 at 8:43 AM Andi Kleen wrote: > > From: Andi Kleen > > AVX2 is widely available on x86 and it allows to do the scanner line > check with 32 bytes at a time. The code is similar to the SSE2 code > path, just using AVX and 32 bytes at a time instead of SSE2 16 bytes. > > Also ad

[PATCH 2/2] Add AVX2 code path to lexer

2024-07-30 Thread Andi Kleen
From: Andi Kleen AVX2 is widely available on x86 and it allows to do the scanner line check with 32 bytes at a time. The code is similar to the SSE2 code path, just using AVX and 32 bytes at a time instead of SSE2 16 bytes. Also adjust the code to allow inlining when the compiler is built for an