Hi Paul, I replied: > > Here's a summary of the results I got on Fedora 38 x86-64 on an AMD > > Phenom II X4 910e processor dated 2010. > > > > user CPU sec speedup > > mbiter mbcel factor test > > 1.735 0.478 3.630 a - ASCII text, C locale > > 1.703 0.447 3.810 b - ASCII text, UTF-8 locale > > 3.852 1.514 2.544 c - French text, C locale > > 3.544 1.600 2.215 d - French text, ISO-8859-1 locale > > 3.651 1.662 2.197 e - French text, UTF-8 locale > > 26.787 15.115 1.772 f - Greek text, C locale > > 21.651 17.106 1.266 g - Greek text, ISO-8859-7 locale > > 22.565 17.633 1.280 h - Greek text, UTF-8 locale > > 10.011 8.051 1.243 i - Chinese text, UTF-8 locale > > 9.787 7.967 1.228 j - Chinese text, GB18030 locale > > Impressive! I'll repeat these benchmarks, after having optimized mbiter > a bit more.
Even after optimizing away the is_basic_table, I'm measuring essentially the same timing differences as you did (on an AMD Ryzen 7): mbiter mbcel factor a 0.849 0.221 3.84 b 0.994 0.221 4.50 c 1.959 0.674 d 1.598 0.726 e 1.876 0.749 f 14.284 5.813 g 8.580 6.715 h 9.080 6.686 i 4.167 2.849 j 4.356 3.062 Factor 4 for the ASCII text cases (a, b). The inner loops still look the same (this is with gcc 13.1, -O2): Inner loop with mbcel: .L18: addq %rax, %r14 movl $1, %eax .L7: addq %rax, %rbx cmpq %r15, %rbx jnb .L5 .L8: movsbq (%rbx), %rax testb %al, %al jns .L18 Inner loop with mbiter: .L24: movq 136(%rsp), %rax movq 112(%rsp), %r15 movq $1, 144(%rsp) movsbl (%rbx), %ecx movb $1, 152(%rsp) leaq 1(%rax), %rbx movl %ecx, 156(%rsp) .L7: movq %rbx, 136(%rsp) addq %rcx, %r14 movb $0, 128(%rsp) cmpq %r15, %rbx jnb .L5 .L14: cmpb $0, (%rbx) jns .L24 With an old gcc 4.2.4 I get these timings: mbiter mbcel factor a 0.988 0.700 1.41 b 1.031 0.716 1.44 c 2.754 1.435 d 2.004 1.432 e 2.017 1.487 f 20.237 8.242 g 13.971 9.701 h 13.117 9.626 i 5.553 4.068 j 6.542 4.080 and this code: Inner loop with mbcel: .L23: movsbl %al,%eax movl $1, %ecx movl %eax, -248(%ebp) movl $0, -244(%ebp) .L13: movl -248(%ebp), %eax addl %eax, -232(%ebp) movl -244(%ebp), %edx adcl %edx, -228(%ebp) addl %ecx, %ebx cmpl %edi, %ebx jae .L9 .L10: movzbl (%ebx), %eax testb %al, %al jns .L23 Inner loop with mbiter: .L32: movl -100(%ebp), %ecx movl $1, -96(%ebp) movl -116(%ebp), %esi movsbl (%ecx),%eax movb $1, -92(%ebp) movl %eax, -88(%ebp) .L13: xorl %edx, %edx movl %ecx, %ebx addl %eax, -288(%ebp) adcl %edx, -284(%ebp) addl -96(%ebp), %ebx movb $0, -104(%ebp) cmpl %esi, %ebx movl %ebx, -100(%ebp) jae .L9 .L10: cmpb $0, (%ebx) jns .L32 So, here the performance difference was not so dramatic. So, clearly, the struct-as-return-value approach nowadays generates more efficient code than when mbiter was designed (in 2001-2005). I will thus create new variants of mbiter, mbuiter, called mbitern, mbuitern ('n' for "new"), that allow better optimization by current GCC. The main API differences between mbiter and mbcel are: - mbiter produces a result, by reference, that includes a 'struct mbchar', mbcel's result is by value. - In mbiter, the loop control is in the struct. In mbcel, it is in local variables. - In mbiter, mbi_avail can be called multiple times, and the multibyte character is only consumed by mbi_advance (atomically). In mbcel, the multibyte character is only available after mbcel_scan was invoked in the loop's body. This difference creates the possibility (programmer mistake) that the result of mbcel_scan gets used while it is not yet ready. But this is not a big problem, since gcc would warn about "uninitialized" variables when the programmer makes this mistake. (Another thing we could not rely on in 2001-2005.) Bruno