On Wed, Oct 02, 2024 at 08:15:38PM +0100, Jonathan Wakely wrote:
> On Wed, 2 Oct 2024 at 19:25, Jonathan Wakely <jwak...@redhat.com> wrote:
> >
> > On Wed, 2 Oct 2024 at 19:16, Jonathan Wakely <jwak...@redhat.com> wrote:
> > >
> > > On Wed, 2 Oct 2024 at 19:15, Dmitry Ilvokhin <d...@ilvokhin.com> wrote:
> > > >
> > > > Instead of looping over every byte of the tail, unroll loop manually
> > > > using switch statement, then compilers (at least GCC and Clang) will
> > > > generate a jump table [1], which is faster on a microbenchmark [2].
> > > >
> > > > [1]: https://godbolt.org/z/aE8Mq3j5G
> > > > [2]: https://quick-bench.com/q/ylYLW2R22AZKRvameYYtbYxag24
> > > >
> > > > libstdc++-v3/ChangeLog:
> > > >
> > > >         * libstdc++-v3/libsupc++/hash_bytes.cc (load_bytes): unroll
> > > >           loop using switch statement.
> > > >
> > > > Signed-off-by: Dmitry Ilvokhin <d...@ilvokhin.com>
> > > > ---
> > > >  libstdc++-v3/libsupc++/hash_bytes.cc | 27 +++++++++++++++++++++++----
> > > >  1 file changed, 23 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/libstdc++-v3/libsupc++/hash_bytes.cc 
> > > > b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > index 3665375096a..294a7323dd0 100644
> > > > --- a/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > +++ b/libstdc++-v3/libsupc++/hash_bytes.cc
> > > > @@ -50,10 +50,29 @@ namespace
> > > >    load_bytes(const char* p, int n)
> > > >    {
> > > >      std::size_t result = 0;
> > > > -    --n;
> > > > -    do
> > > > -      result = (result << 8) + static_cast<unsigned char>(p[n]);
> > > > -    while (--n >= 0);
> > >
> > > Don't we still need to loop, for the case where n >= 8? Otherwise we
> > > only hash the first 8 bytes.
> >
> > Ah, but it's only ever called with load_bytes(end, len & 0x7)
> 
> It seems to be slower for short strings, but a win overall:
> https://quick-bench.com/q/xhh5m1akZzwUAXRiYJ17z9FASc8
> This measures different lengths, and tries to ensure that the string
> contents aren't treated as constant.
>

I had a little bit time to play with this benchmark.

First off, there is no need to read first byte in the switch, because we
always know n > 0, this makes generated code look much better: GCC was
able to completely eliminate code for cases we don't actually have. See
difference in generated code for load_bytes_switch_baseline and
load_bytes_switch_load in [1].

Next, I suspected code alignment has a notable impact on the benchmark.
To confirm this, I added additional branch for case n == 1.  It doesn't
change generated code much, but adds two additional instructions: cmp
and je. Conveniently, this is enough to align benchmark to 64 byte
boundary. See benchmark code in [2]: sub (last instruction of the loop)
shifted from 0x407e08 (0x407e08 % 64 == 8) for LoadBytesSwitchLoad to
0x407f40 (0x407f40 % 64 == 0) for LoadBytesSwitchIf.

This makes LoadBytesSwitchIf faster than LoadBytesLoop (current trunk)
for every case. Honestly, I doubt it worth to optimize code alignment
for this particular case, it looks too fragile to me.

Version with load of the first byte out of the switch seems much more
robust, but it still slightly slower than a loop for case n = 1, see
benchmark in [3].

I am happy to submit modified version of the patch if you think
regression for n = 1 case is acceptable.

In anyway, now I have one less unexplained performance mystery on my
TODO list.

[1]: https://godbolt.org/z/vcWYs6o9x
[2]: https://quick-bench.com/q/QocQhela6YQzbsZWQukBRFix8ms
[3]: https://quick-bench.com/q/To93I-w6G1w5GwFz1Hi-XmPN57Y

> >
> >
> > >
> > > > +    switch(n & 7)
> > > > +      {
> > > > +      case 7:
> > > > +       result |= std::size_t(p[6]) << 48;
> > > > +       [[gnu::fallthrough]];
> > > > +      case 6:
> > > > +       result |= std::size_t(p[5]) << 40;
> > > > +       [[gnu::fallthrough]];
> > > > +      case 5:
> > > > +       result |= std::size_t(p[4]) << 32;
> > > > +       [[gnu::fallthrough]];
> > > > +      case 4:
> > > > +       result |= std::size_t(p[3]) << 24;
> > > > +       [[gnu::fallthrough]];
> > > > +      case 3:
> > > > +       result |= std::size_t(p[2]) << 16;
> > > > +       [[gnu::fallthrough]];
> > > > +      case 2:
> > > > +       result |= std::size_t(p[1]) << 8;
> > > > +       [[gnu::fallthrough]];
> > > > +      case 1:
> > > > +       result |= std::size_t(p[0]);
> > > > +      };
> > > >      return result;
> > > >    }
> > > >
> > > > --
> > > > 2.43.5
> > > >
> 

Reply via email to