15 Regression] x86: std::vector::operator[] could be somewhat faster using BT instead of SHL

cvs-commit at gcc dot gnu.org via Gcc-bugs Thu, 23 Jan 2025 06:52:00 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80813


--- Comment #7 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jan Hubicka <[email protected]>:

https://gcc.gnu.org/g:2d55c0161562f96d2230cd132b494a5d06352a23

commit r15-7163-g2d55c0161562f96d2230cd132b494a5d06352a23
Author: Jan Hubicka <[email protected]>
Date:   Thu Jan 23 15:50:50 2025 +0100

    Optimize vector<bool>::operator[]

    the following testcase:

      bool f(const std::vector<bool>& v, std::size_t x) {
        return v[x];
      }

    is compiled as:

    f(std::vector<bool, std::allocator<bool> > const&, unsigned long):
            testq   %rsi, %rsi
            leaq    63(%rsi), %rax
            movq    (%rdi), %rdx
            cmovns  %rsi, %rax
            sarq    $6, %rax
            leaq    (%rdx,%rax,8), %rdx
            movq    %rsi, %rax
            sarq    $63, %rax
            shrq    $58, %rax
            addq    %rax, %rsi
            andl    $63, %esi
            subq    %rax, %rsi
            jns     .L2
            addq    $64, %rsi
            subq    $8, %rdx
    .L2:
            movl    $1, %eax
            shlx    %rsi, %rax, %rax
            andq    (%rdx), %rax
            setne   %al
            ret

    which is quite expensive for simple bit access in a bitmap.  The reason is
that
    the bit access is implemented using iterators
            return begin()[__n];
    Which in turn cares about situation where __n is negative yielding the
extra
    conditional.

        _GLIBCXX20_CONSTEXPR
        void
        _M_incr(ptrdiff_t __i)
        {
          _M_assume_normalized();
          difference_type __n = __i + _M_offset;
          _M_p += __n / int(_S_word_bit);
          __n = __n % int(_S_word_bit);
          if (__n < 0)
            {
              __n += int(_S_word_bit);
              --_M_p;
            }
          _M_offset = static_cast<unsigned int>(__n);
        }

    While we can use __builtin_unreachable to declare that __n is in range
    0...max_size () but I think it is better to implement it directly, since
    resulting code is shorter and much easier to optimize.

    We now porduce:
    .LFB1248:
            .cfi_startproc
            movq    (%rdi), %rax
            movq    %rsi, %rdx
            shrq    $6, %rdx
            andq    (%rax,%rdx,8), %rsi
            andl    $63, %esi
            setne   %al
            ret

    Testcase suggests
            movq    (%rdi), %rax
            movl    %esi, %ecx
            shrq    $5, %rsi        # does still need to be 64-bit
            movl    (%rax,%rsi,4), %eax
            btl     %ecx, %eax
            setb    %al
            retq
    Which is still one instruction shorter.

    libstdc++-v3/ChangeLog:

            PR target/80813
            * include/bits/stl_bvector.h (vector<bool, _Alloc>::operator []):
Do
            not use iterators.

    gcc/testsuite/ChangeLog:

            PR target/80813
            * g++.dg/tree-ssa/bvector-3.C: New test.

[Bug target/80813] [12/13/14/15 Regression] x86: std::vector::operator[] could be somewhat faster using BT instead of SHL

Reply via email to