[Bug lto/102649] New: GCC 9.3.1 LTO bug -- incorrect function call, bad stack arguments pushed

2021-10-08 Thread davidhaufegcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102649

Bug ID: 102649
   Summary: GCC 9.3.1 LTO bug -- incorrect function call, bad
stack arguments pushed
   Product: gcc
   Version: 9.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: lto
  Assignee: unassigned at gcc dot gnu.org
  Reporter: davidhaufegcc at gmail dot com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

Hello,
We witnessed incorrect application behavior in a large binary built using LTO.
Doing an assembly instruction stepping of the binary, the issue was identified.
We have a function with 21 parameters. The function is called from many
call-sites. In the instance that is not working properly, the C++ function
caller passes a hard-coded integer '0' to a variable which is passed on the
stack (ie not register passed). GCC ends up generating two versions of the
called function under LTO. A version of the function that takes this integer
parameter, and one that optimizes out the need for this integer to be passed at
all, as it is a hardcoded 0. 

The issue is that the caller is still pushing an integer 0 function parameter
onto the stack. The callee does not expect the caller to have done this and
then is incorrectly popping stack function arguments that have been offset by
this extra stack arg. 

This issue was complicated to track down because some time later in our
codebase, unrelated classes/files in the same static library as the caller were
touched. The bug has since stopped. Rolling back GIT we can reproduce the bug
over about 10 checkins of unrelated code, and then unrelated code causes the
bug to stop. GCC generates the proper variable passing stack for the optimized
function. 

Compile flag investigation:
All builds were done with -O3 -flto -fno-fat-lto-objects -ffast-math
-funroll-loops
Disabling LTO -- bug does not present itself
With LTO on, we decomposed -ffast-math into its individual flags. If we leave
all -ffast-math flags on but disable -freciprocal-math, the bug does not
present itself. The code in question doesn't have any division anywhere around
it.

We speculate that disabling -freciprocal-math or the codebase generally
changing fixed the bug because it simply changes the global state of the
compile. This made us very nervous as there was no way to anticipate this bug
going forward. 

We are using the devtoolset-9 (GCC 9.3.1) centos7/rh7 package. Moving to the
devtoolset-10 (GCC 10.2.1) package "fixes" the issue with the same code and
build flags. devtoolset-8 (GCC 8.3.1)  does not present the bug either.

Our concern is that the bug is not actually fixed though, and that moving
versions of GCC is like changing our codebase by 10 unrelated check-ins or
disabling -freciprocal-math. It is simply changing the state of the compile.
The bug may or may not be fixed.

I would like to help in any way I can. This build generates a binary that is
200MB w/o debug symbols. It is a lot of code. I do not think we can create a
smaller test case showing this behavior. I thought about doing a bisect of the
GCC repo, but even that might just be changing the state of GCC and not
actually showing the bug is fixed. 

It is a concerning bug. I can try to provide any further information that would
be useful. 

Thanks,
Dave Haufe

$ ./gcc -v
Using built-in specs.
COLLECT_GCC=./gcc
COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap
--enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-9/root/usr
--mandir=/opt/rh/devtoolset-9/root/usr/share/man
--infodir=/opt/rh/devtoolset-9/root/usr/share/info
--with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared
--enable-threads=posix --enable-checking=release --enable-multilib
--with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions
--enable-gnu-unique-object --enable-linker-build-id
--with-gcc-major-version-only --with-linker-hash-style=gnu
--with-default-libstdcxx-abi=gcc4-compatible --enable-plugin
--enable-initfini-array
--with-isl=/builddir/build/BUILD/gcc-9.3.1-20200408/obj-x86_64-redhat-linux/isl-install
--disable-libmpx --enable-gnu-indirect-function --with-tune=generic
--with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC)

$ cat /etc/*release*
CentOS Linux release 7.9.2009 (Core)
Derived from Red Hat Enterprise Linux 7.9 (Source)
cat: /etc/lsb-release.d: Is a directory
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/";
BUG_REPORT_URL="https://bugs.centos.org/";

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPO

[Bug lto/102649] GCC 9.3.1 LTO bug -- incorrect function call, bad stack arguments pushed

2021-10-11 Thread davidhaufegcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102649

--- Comment #2 from David Haufe  ---
I had assumed this would be the response. Unfortunately the source code
involved is both large (1000+ object files in this build) and proprietary. The
behavior we see where if we roll forward GIT and rebuild, and unrelated changes
"fix" the problem, makes it seem futile to develop an isolated test case. 

I can provide the assembly for the functions that highlight the error if that
would be beneficial? Not sure how helpful that would be though. Are there any
other best practices in a case like this one?