On 5/15/24 12:49 AM, Christoph Müllner wrote:
GCC has a generic cmpmemsi expansion via the by-pieces framework,
which shows some room for target-specific optimizations.
E.g. for comparing two aligned memory blocks of 15 bytes
we get the following sequence:

my_mem_cmp_aligned_15:
         li      a4,0
         j       .L2
.L8:
         bgeu    a4,a7,.L7
.L2:
         add     a2,a0,a4
         add     a3,a1,a4
         lbu     a5,0(a2)
         lbu     a6,0(a3)
         addi    a4,a4,1
         li      a7,15    // missed hoisting
         subw    a5,a5,a6
         andi    a5,a5,0xff // useless
         beq     a5,zero,.L8
         lbu     a0,0(a2) // loading again!
         lbu     a5,0(a3) // loading again!
         subw    a0,a0,a5
         ret
.L7:
         li      a0,0
         ret

Diff first byte: 15 insns
Diff second byte: 25 insns
No diff: 25 insns

Possible improvements:
* unroll the loop and use load-with-displacement to avoid offset increments
* load and compare multiple (aligned) bytes at once
* Use the bitmanip/strcmp result calculation (reverse words and
   synthesize (a2 >= a3) ? 1 : -1 in a branchless sequence)

When applying these improvements we get the following sequence:

my_mem_cmp_aligned_15:
         ld      a5,0(a0)
         ld      a4,0(a1)
         bne     a5,a4,.L2
         ld      a5,8(a0)
         ld      a4,8(a1)
         slli    a5,a5,8
         slli    a4,a4,8
         bne     a5,a4,.L2
         li      a0,0
.L3:
         sext.w  a0,a0
         ret
.L2:
         rev8    a5,a5
         rev8    a4,a4
         sltu    a5,a5,a4
         neg     a5,a5
         ori     a0,a5,1
         j       .L3

Diff first byte: 11 insns
Diff second byte: 16 insns
No diff: 11 insns

This patch implements this improvements.

The tests consist of a execution test (similar to
gcc/testsuite/gcc.dg/torture/inline-mem-cmp-1.c) and a few tests
that test the expansion conditions (known length and alignment).

Similar to the cpymemsi expansion this patch does not introduce any
gating for the cmpmemsi expansion (on top of requiring the known length,
alignment and Zbb).

Bootstrapped and SPEC CPU 2017 tested.

gcc/ChangeLog:

        * config/riscv/riscv-protos.h (riscv_expand_block_compare): New
        prototype.
        * config/riscv/riscv-string.cc (GEN_EMIT_HELPER2): New helper
        for zero_extendhi.
        (do_load_from_addr): Add support for HI and SI/64 modes.
        (do_load): Add helper for zero-extended loads.
        (emit_memcmp_scalar_load_and_compare): New helper to emit memcmp.
        (emit_memcmp_scalar_result_calculation): Likewise.
        (riscv_expand_block_compare_scalar): Likewise.
        (riscv_expand_block_compare): New RISC-V expander for memory compare.
        * config/riscv/riscv.md (cmpmemsi): New cmpmem expansion.

gcc/testsuite/ChangeLog:

        * gcc.target/riscv/cmpmemsi-1.c: New test.
        * gcc.target/riscv/cmpmemsi-2.c: New test.
        * gcc.target/riscv/cmpmemsi-3.c: New test.
        * gcc.target/riscv/cmpmemsi.c: New test.
[ ... ]
I fixed some of the nits from the linter (whitespace stuff) and pushed both patches of this series.

Jeff

Reply via email to