Hi Paul,

I replied:
> > Here's a summary of the results I got on Fedora 38 x86-64 on an AMD 
> > Phenom II X4 910e processor dated 2010.
> > 
> >   user CPU sec   speedup
> >   mbiter  mbcel  factor  test
> >    1.735  0.478  3.630   a - ASCII text, C locale
> >    1.703  0.447  3.810   b - ASCII text, UTF-8 locale
> >    3.852  1.514  2.544   c - French text, C locale
> >    3.544  1.600  2.215   d - French text, ISO-8859-1 locale
> >    3.651  1.662  2.197   e - French text, UTF-8 locale
> >   26.787 15.115  1.772   f - Greek text, C locale
> >   21.651 17.106  1.266   g - Greek text, ISO-8859-7 locale
> >   22.565 17.633  1.280   h - Greek text, UTF-8 locale
> >   10.011  8.051  1.243   i - Chinese text, UTF-8 locale
> >    9.787  7.967  1.228   j - Chinese text, GB18030 locale
> 
> Impressive! I'll repeat these benchmarks, after having optimized mbiter
> a bit more.

Even after optimizing away the is_basic_table, I'm measuring essentially the
same timing differences as you did (on an AMD Ryzen 7):

    mbiter  mbcel  factor
a    0.849  0.221  3.84
b    0.994  0.221  4.50
c    1.959  0.674
d    1.598  0.726
e    1.876  0.749
f   14.284  5.813
g    8.580  6.715
h    9.080  6.686
i    4.167  2.849
j    4.356  3.062

Factor 4 for the ASCII text cases (a, b).

The inner loops still look the same (this is with gcc 13.1, -O2):

Inner loop with mbcel:

.L18:
        addq    %rax, %r14
        movl    $1, %eax
.L7:
        addq    %rax, %rbx
        cmpq    %r15, %rbx
        jnb     .L5
.L8:
        movsbq  (%rbx), %rax
        testb   %al, %al
        jns     .L18

Inner loop with mbiter:

.L24:
        movq    136(%rsp), %rax
        movq    112(%rsp), %r15
        movq    $1, 144(%rsp)
        movsbl  (%rbx), %ecx
        movb    $1, 152(%rsp)
        leaq    1(%rax), %rbx
        movl    %ecx, 156(%rsp)
.L7:
        movq    %rbx, 136(%rsp)
        addq    %rcx, %r14
        movb    $0, 128(%rsp)
        cmpq    %r15, %rbx
        jnb     .L5
.L14:
        cmpb    $0, (%rbx)
        jns     .L24

With an old gcc 4.2.4 I get these timings:

    mbiter  mbcel  factor
a    0.988  0.700  1.41
b    1.031  0.716  1.44
c    2.754  1.435
d    2.004  1.432
e    2.017  1.487
f   20.237  8.242
g   13.971  9.701
h   13.117  9.626
i    5.553  4.068
j    6.542  4.080

and this code:

Inner loop with mbcel:

.L23:
        movsbl  %al,%eax
        movl    $1, %ecx
        movl    %eax, -248(%ebp)
        movl    $0, -244(%ebp)
.L13:
        movl    -248(%ebp), %eax
        addl    %eax, -232(%ebp)
        movl    -244(%ebp), %edx
        adcl    %edx, -228(%ebp)
        addl    %ecx, %ebx
        cmpl    %edi, %ebx
        jae     .L9
.L10:
        movzbl  (%ebx), %eax
        testb   %al, %al
        jns     .L23

Inner loop with mbiter:

.L32:
        movl    -100(%ebp), %ecx
        movl    $1, -96(%ebp)
        movl    -116(%ebp), %esi
        movsbl  (%ecx),%eax
        movb    $1, -92(%ebp)
        movl    %eax, -88(%ebp)
.L13:
        xorl    %edx, %edx
        movl    %ecx, %ebx
        addl    %eax, -288(%ebp)
        adcl    %edx, -284(%ebp)
        addl    -96(%ebp), %ebx
        movb    $0, -104(%ebp)
        cmpl    %esi, %ebx
        movl    %ebx, -100(%ebp)
        jae     .L9
.L10:
        cmpb    $0, (%ebx)
        jns     .L32

So, here the performance difference was not so dramatic.


So, clearly, the struct-as-return-value approach nowadays generates
more efficient code than when mbiter was designed (in 2001-2005).

I will thus create new variants of mbiter, mbuiter, called mbitern, mbuitern
('n' for "new"), that allow better optimization by current GCC.

The main API differences between mbiter and mbcel are:
  - mbiter produces a result, by reference, that includes a 'struct mbchar',
    mbcel's result is by value.
  - In mbiter, the loop control is in the struct.
    In mbcel, it is in local variables.
  - In mbiter, mbi_avail can be called multiple times, and the multibyte
    character is only consumed by mbi_advance (atomically).
    In mbcel, the multibyte character is only available after mbcel_scan
    was invoked in the loop's body.
    This difference creates the possibility (programmer mistake) that the
    result of mbcel_scan gets used while it is not yet ready. But this
    is not a big problem, since gcc would warn about "uninitialized" variables
    when the programmer makes this mistake. (Another thing we could not
    rely on in 2001-2005.)

Bruno




Reply via email to