[Bug c/99953] New: In AVX, SIMD support environment, strlen performance without optimization is 3 times faster than optimized strlen function.

2021-04-07 Thread novemberizing at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99953

Bug ID: 99953
   Summary: In AVX, SIMD support environment, strlen performance
without optimization is 3 times faster than optimized
strlen function.
   Product: gcc
   Version: 9.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: novemberizing at gmail dot com
  Target Milestone: ---

Created attachment 50519
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50519&action=edit
the preprocessed file

I tested the performance of 65K bytes string and 65536 times for each O0, O1,
O2, O3, and the related performance was not optimized as shown below. If it is
not optimized, it has been confirmed that glibc@strlen_avx is called.

$ gcc -Wall -Wextra -fno-strict-aliasing -fwrapv
-fno-aggressive-loop-optimizations  -fsanitize=undefined -save-temps strlen.c

$ ./a.out 
no optimize =>  0.07655
o1 optimize =>  0.62935
o2 optimize =>  0.22461
o3 optimize =>  0.23192

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
9.3.0-17ubuntu1~20.04' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-9
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new
--enable-gnu-unique-object --disable-vtable-verify --enable-plugin
--enable-default-pie --with-system-zlib --with-target-system-zlib=auto
--enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686
--with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib
--with-tune=generic
--enable-offload-targets=nvptx-none=/build/gcc-9-HskZEa/gcc-9-9.3.0/debian/tmp-nvptx/usr,hsa
--without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)

[Bug c/99953] In AVX, SIMD support environment, strlen performance without optimization is 3 times faster than optimized strlen function.

2021-04-07 Thread novemberizing at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99953

--- Comment #1 from Hyun Sik Park  ---
Created attachment 50520
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50520&action=edit
simple test source

[Bug c/99953] In AVX, SIMD support environment, strlen performance without optimization is 3 times faster than optimized strlen function.

2021-04-07 Thread novemberizing at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99953

--- Comment #2 from Hyun Sik Park  ---
Test environment: gcc version 9.3.0 (Ubuntu 9.3.0–17ubuntu1~20.04)/Acer Aspire
V3–372/Intel(R) Core(TM) i5–6200U CPU @ 2.30GHz 4 Core

[Bug target/99953] In AVX, SIMD support environment, strlen performance without optimization is 3 times faster than optimized strlen function.

2021-04-07 Thread novemberizing at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99953

--- Comment #5 from Hyun Sik Park  ---
$ gcc -march=native strlen.c 
$ ./a.out 
no optimize =>  0.07860
o1 optimize =>  0.62609
o2 optimize =>  0.24775
o3 optimize =>  0.22288

Same result.

[Bug target/99953] In AVX, SIMD support environment, strlen performance without optimization is 3 times faster than optimized strlen function.

2021-04-07 Thread novemberizing at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99953

--- Comment #7 from Hyun Sik Park  ---
Thank you.

I was tested and the result is below.

$ ./a.out
no optimize => 0.09640
o1 optimize => 0.09126
o2 optimize => 0.09422
o3 optimize => 0.09081

experiment_optimize_3
17d5: 48 01 c7 add %rax,%rdi
17d8: e8 c3 f8 ff ff callq 10a0 strlen@plt
17dd: 48 8b 74 24 08 mov 0x8(%rsp),%rsi

experiment_optimize_2
168d: 48 01 c7 add %rax,%rdi
1690: e8 0b fa ff ff callq 10a0 strlen@plt
1695: 48 8b 74 24 10 mov 0x10(%rsp),%rsi

experiment_optimize_1
154c: e8 4f fb ff ff callq 10a0 strlen@plt
1551: 48 89 04 24 mov %rax,(%rsp)

experiment_optimize_0
1375: 48 89 c7 mov %rax,%rdi
1378: e8 23 fd ff ff callq 10a0 strlen@plt
137d: 48 89 45 a8 mov %rax,-0x58(%rbp)

Thank you.