Here are my test results:
buildtype : debugoptimized
default_library : shared
-march=x86-64-v4 (Cascade Lake)
gcc 15.2.1
clang 21.1.6
GCC - BEFORE
Alignment Block size TSC cycles/block TSC cycles/byte
Aligned 20 20.5 1.02
Unaligned 20 14.1 0.70
Aligned 21 15.8 0.75
Unaligned 21 15.8 0.75
Aligned 1500 148.2 0.10
Unaligned 1500 148.3 0.10
Aligned 1501 148.4 0.10
Unaligned 1501 148.2 0.10
GCC - AFTER
Alignment Block size TSC cycles/block TSC cycles/byte
Aligned 20 20.8 1.04
Unaligned 20 15.6 0.78
Aligned 21 16.9 0.81
Unaligned 21 16.9 0.80
Aligned 1500 109.5 0.07
Unaligned 1500 111.6 0.07
Aligned 1501 111.1 0.07
Unaligned 1501 113.0 0.08
Aligned 9000 612.4 0.07
Unaligned 9000 612.6 0.07
Aligned 9001 581.5 0.06
Unaligned 9001 601.7 0.07
CLANG - BEFORE
Alignment Block size TSC cycles/block TSC cycles/byte
Aligned 20 14.2 0.71
Unaligned 20 9.5 0.47
Aligned 21 11.7 0.56
Unaligned 21 11.8 0.56
Aligned 1500 610.7 0.41
Unaligned 1500 632.0 0.42
Aligned 1501 610.4 0.41
Unaligned 1501 627.6 0.42
CLANG - AFTER
Alignment Block size TSC cycles/block TSC cycles/byte
Aligned 20 14.0 0.70
Unaligned 20 9.1 0.45
Aligned 21 9.7 0.46
Unaligned 21 9.6 0.46
Aligned 1500 77.9 0.05
Unaligned 1500 79.4 0.05
Aligned 1501 79.4 0.05
Unaligned 1501 80.4 0.05
Aligned 9000 447.8 0.05
Unaligned 9000 492.1 0.05
Aligned 9001 448.5 0.05
Unaligned 9001 492.6 0.05
Before your patch,
With small block size, clang is better than GCC.
With large block size, GCC is better than clang.
After your patch, clang is always better than GCC.
07/02/2026 02:29, Scott Mitchell:
> Thanks for testing! I included my build/host config, results on the
> main branch, and then with this path applied below. What is your build
> flags/configuration (e, cpu_instruction_set, march, optimization
> level, etc.)? I wasn't able to get any Clang version (18, 19, 20) to
> vectorize on Godbolt https://godbolt.org/z/8149r7sq8, and curious if
> your config enables vectorization.
>
> #### build / host config
> User defined options
> b_lto : false
> buildtype : release
> c_args : -fno-omit-frame-pointer
> -DPACKET_QDISC_BYPASS=1 -DRTE_MEMCPY_AVX512=1
> cpu_instruction_set: cascadelake
> default_library : static
> max_lcores : 128
> optimization : 3
> $ clang --version
> clang version 18.1.8 (Red Hat, Inc. 18.1.8-3.el9)
> $ cat /etc/redhat-release
> Red Hat Enterprise Linux release 9.4 (Plow)
>
> #### main branch
> $ echo "cksum_perf_autotest" | /usr/local/bin/dpdk-test
> ### rte_raw_cksum() performance ###
> Alignment Block size TSC cycles/block TSC cycles/byte
> Aligned 20 10.0 0.50
> Unaligned 20 10.1 0.50
> Aligned 21 11.1 0.53
> Unaligned 21 11.6 0.55
> Aligned 100 39.4 0.39
> Unaligned 100 67.3 0.67
> Aligned 101 43.3 0.43
> Unaligned 101 41.5 0.41
> Aligned 1500 728.2 0.49
> Unaligned 1500 805.8 0.54
> Aligned 1501 768.8 0.51
> Unaligned 1501 787.3 0.52
> Test OK
>
> #### with this patch
> $ echo "cksum_perf_autotest" | /usr/local/bin/dpdk-test
> ### rte_raw_cksum() performance ###
> Alignment Block size TSC cycles/block TSC cycles/byte
> Aligned 20 12.6 0.63
> Unaligned 20 12.3 0.62
> Aligned 21 13.6 0.65
> Unaligned 21 13.6 0.65
> Aligned 100 22.7 0.23
> Unaligned 100 22.6 0.23
> Aligned 101 47.4 0.47
> Unaligned 101 23.9 0.24
> Aligned 1500 73.9 0.05
> Unaligned 1500 73.9 0.05
> Aligned 1501 95.7 0.06
> Unaligned 1501 73.9 0.05
> Aligned 9000 459.8 0.05
> Unaligned 9000 523.5 0.06
> Aligned 9001 536.7 0.06
> Unaligned 9001 507.5 0.06
> Aligned 65536 3158.4 0.05
> Unaligned 65536 3506.1 0.05
> Aligned 65537 3277.6 0.05
> Unaligned 65537 3697.6 0.06
> Test OK
>