Hi Paul, > > Candidates for optimization: > > > > - The C locale handling > > https://sourceware.org/bugzilla/show_bug.cgi?id=19932 > > https://sourceware.org/bugzilla/show_bug.cgi?id=29511 > > It's now a clear POSIX violation. Would it make sense to get this fixed > > in glibc, so that gnulib's override can be dropped on future glibc > > versions? > > Absolutely. Does наб's patch in the latter bug report look good to you?
Just saw that наб is already at patch v16, on libc-alpha. I gave a bit of feedback about it now. Since this will be fixed in glibc in the near future, I will hold off from optimizing the invocation of hard_locale. > Although mbiter is faster on ASCII than on non-ASCII, it's not > well-optimized compared to mbcel. On Fedora x86-64 here is the kernel of > an mbcel-based loop that merely scans ASCII and adds each byte's > numerical value to a sum: > > .L28: addq %rax, %rbp > movl $1, %eax > addq %rax, %rbx > cmpq %r12, %rbx > jnb .L19 > movsbq (%rbx), %rax > testb %al, %al > jns .L28 > > where %rbp = sum, %rbx = pointer to next byte, and %r12 = pointer just > past end of input. > > In contrast, with mbiter the kernel is: > > .L24: movq 136(%rsp), %rax > movq 112(%rsp), %r14 > movq $1, 144(%rsp) > movsbl (%rbx), %edx > movb $1, 152(%rsp) > leaq 1(%rax), %rbx > movl %edx, 156(%rsp) > movq %rbx, 136(%rsp) > addq %rdx, %r15 > movb $0, 128(%rsp) > cmpq %r14, %rbx > jnb .L5 > .L14: movzbl (%rbx), %ecx > movl %ecx, %eax > shrb $5, %al > andl $7, %eax > movl is_basic_table(,%rax,4), %eax > shrl %cl, %eax > testb $1, %al > jne .L24 > > where %r15 = sum, %rbx = pointer to next byte, %r14 = pointer just past > end of input. Excellent result!!! This means the mbiter can/should get the following optimizations: - Optimize away is_basic_table; a simpler range check for [0x00..0x7F] like in mbcel will speed this up. - The movb $0, 128(%rsp) line should already be gone through my patch "Optimize away the in_shift field" yesterday. - There are 6 other instructions that read or write from the struct on the stack. It seems that gcc does not optimize this as well as the struct-as-return-value situation. I'll benchmark this again... > > - Resetting an mbstate_t: Should we define a function > > void mbszero (mbstate_t *); > > that clears the relevant part of an mbstate_t (i.e. 24 bytes instead > > of 128 bytes on BSD systems)? > > Advantage: performance. > > Drawback: Yet another gnulib-invented, nonstandard API. > > It's likely worth it for mbcel on BSDish hosts. Quite possibly it's also > worth it for mbiter and mbuiter. Not sure it's worth it everywhere. Good, thanks for your opinion. I'll then add an 'mbszero' function and mark it as recommended in loops. > Here's a summary of the results I got on Fedora 38 x86-64 on an AMD > Phenom II X4 910e processor dated 2010. > > user CPU sec speedup > mbiter mbcel factor test > 1.735 0.478 3.630 a - ASCII text, C locale > 1.703 0.447 3.810 b - ASCII text, UTF-8 locale > 3.852 1.514 2.544 c - French text, C locale > 3.544 1.600 2.215 d - French text, ISO-8859-1 locale > 3.651 1.662 2.197 e - French text, UTF-8 locale > 26.787 15.115 1.772 f - Greek text, C locale > 21.651 17.106 1.266 g - Greek text, ISO-8859-7 locale > 22.565 17.633 1.280 h - Greek text, UTF-8 locale > 10.011 8.051 1.243 i - Chinese text, UTF-8 locale > 9.787 7.967 1.228 j - Chinese text, GB18030 locale > > With a better CPU (a Xeon W-1350 dated 2021) and a slightly-slower OS > (Ubuntu 23.04) I got these numbers: > > user CPU sec speedup > mbiter mbcel factor test > 0.531 0.238 2.231 a - ASCII text, C locale > 0.478 0.187 2.556 b - ASCII text, UTF-8 locale > 1.262 0.510 2.475 c - French text, C locale > 1.121 0.529 2.119 d - French text, ISO-8859-1 locale > 1.080 0.571 1.891 e - French text, UTF-8 locale > 10.349 5.876 1.761 f - Greek text, C locale > 8.530 6.537 1.305 g - Greek text, ISO-8859-7 locale > 8.407 6.506 1.292 h - Greek text, UTF-8 locale > 3.427 2.578 1.329 i - Chinese text, UTF-8 locale > 3.279 2.489 1.317 j - Chinese text, GB18030 locale Impressive! I'll repeat these benchmarks, after having optimized mbiter a bit more. Bruno