On 08/11/16 09:46, James Greenhalgh wrote: > On Mon, Nov 07, 2016 at 01:39:53PM +0000, Richard Earnshaw (lists) wrote: >> This patch contains an implementation of search_line_fast for the CPP >> lexer. It's based in part on the AArch32 (ARM) code but incorporates >> new instructions available in AArch64 (reduction add operations) plus >> some tricks for reducing the realignment overheads. We assume a page >> size of 4k, but that's a safe assumption -- AArch64 systems can never >> have a smaller page size than that: on systems with larger pages we will >> go through the realignment code more often than strictly necessary, but >> it's still likely to be in the noise (less than 0.5% of the time). >> Bootstrapped on aarch64-none-linux-gnu. > > Some very minor nits wrt. style for the Advanced SIMD intrinsics, otherwise > OK from me. > >> >> + const uint8x16_t xmask = (uint8x16_t) vdupq_n_u64 (0x8040201008040201ULL); > > > It is a pedantic point, but these casts are a GNU extension, the "portable" > way to write this would be: > > vreinterpretq_u8_u64 (vdupq_n_u64 (0x8040201008040201ULL));
We've used GNU-style casts in the original code and never encountered problems. I personally find the reinterpret casts less readable.. > >> + >> +#ifdef __AARCH64EB >> + const int16x8_t shift = {8, 8, 8, 8, 0, 0, 0, 0}; > > This sort of vector initialisation is a bit scary for user programmers, as > we shouldn't generally mix Neon intrinsics with the GNU extensions (for > exactly the reason you have here, keeping BE and LE straight is extra > effort) > > This could be written portably as: > > vcombine_u16 (vdup_n_u16 (8), vdup_n_u16 (0)); > Nice idea, but that's the wrong way around and fixing it currently generates *terrible* code. > Or if you prefer to be explicit about the elements: > > int16_t buf[] = {8, 8, 8, 8, 0, 0, 0, 0}; > int16x8_t shift = vld1q_s16 (buf); > >> +#else >> + const int16x8_t shift = {0, 0, 0, 0, 8, 8, 8, 8}; >> +#endif >> + >> + unsigned int found; >> + const uint8_t *p; >> + uint8x16_t data; >> + uint8x16_t t; >> + uint16x8_t m; >> + uint8x16_t u, v, w; >> + >> + /* Align the source pointer. */ >> + p = (const uint8_t *)((uintptr_t)s & -16); >> + >> + /* Assuming random string start positions, with a 4k page size we'll take >> + the slow path about 0.37% of the time. */ >> + if (__builtin_expect ((AARCH64_MIN_PAGE_SIZE >> + - (((uintptr_t) s) & (AARCH64_MIN_PAGE_SIZE - 1))) >> + < 16, 0)) >> + { >> + /* Slow path: the string starts near a possible page boundary. */ >> + uint32_t misalign, mask; >> + >> + misalign = (uintptr_t)s & 15; >> + mask = (-1u << misalign) & 0xffff; >> + data = vld1q_u8 (p); >> + t = vceqq_u8 (data, repl_nl); >> + u = vceqq_u8 (data, repl_cr); >> + v = vorrq_u8 (t, vceqq_u8 (data, repl_bs)); >> + w = vorrq_u8 (u, vceqq_u8 (data, repl_qm)); >> + t = vorrq_u8 (v, w); > > Can you trust the compiler to perform the reassociation here manually? > That would let you write this in the more natural form: > > t = vceqq_u8 (data, repl_nl); > t = vorrq_u8 (t, vceqq_u8 (data, repl_cr)); > t = vorrq_u8 (t, vceqq_u8 (data, repl_bs)); > t = vorrq_u8 (t, vceqq_u8 (data, repl_qm)); > Maybe, but we have plenty of spare registers (this is target specific code, I know what's happening). Either way, the reassoc code is currently messing with this and serializing the VORRQ operations. >> + t = vandq_u8 (t, xmask); >> + m = vpaddlq_u8 (t); >> + m = vshlq_u16 (m, shift); >> + found = vaddvq_u16 (m); >> + found &= mask; >> + if (found) >> + return (const uchar*)p + __builtin_ctz (found); >> + } >> + else >> + { >> + data = vld1q_u8 ((const uint8_t *) s); >> + t = vceqq_u8 (data, repl_nl); >> + u = vceqq_u8 (data, repl_cr); >> + v = vorrq_u8 (t, vceqq_u8 (data, repl_bs)); >> + w = vorrq_u8 (u, vceqq_u8 (data, repl_qm)); >> + t = vorrq_u8 (v, w); >> + if (__builtin_expect (vpaddd_u64 ((uint64x2_t)t), 0)) >> + goto done; > > As above, this cast is a GNU extension: > > if (__builtin_expect (vpaddd_u64 (vreinterpretq_u64_u8 (t)), 0)) > >> + } >> + >> + do >> + { >> + p += 16; >> + data = vld1q_u8 (p); >> + t = vceqq_u8 (data, repl_nl); >> + u = vceqq_u8 (data, repl_cr); >> + v = vorrq_u8 (t, vceqq_u8 (data, repl_bs)); >> + w = vorrq_u8 (u, vceqq_u8 (data, repl_qm)); >> + t = vorrq_u8 (v, w); >> + } while (!vpaddd_u64 ((uint64x2_t)t)); > > Likewise here. > > Thanks, > James >