We use some optimized XOR routines for software RAID. Unfortunately, the
compiler generated incorrect code when this was compiled for Redhat 7.3 +
2.4.24 (this is normally kernel code). I later found out that all versions of
gcc that I tested (up to FC4 - 4.0.0 20050519 (Red Hat 4.0.0-8)) had this
issue.
gcc -v on RH 7.3:
build-lin3> gcc -v
Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/2.96/specs
gcc version 2.96 2731 (Red Hat Linux 7.3 2.96-110)
build-lin3> uname -a
Linux build-lin3 2.4.21-kdb #2 SMP Tue Apr 6 12:52:57 EDT 2004 i686 unknown
I've also tested on gcc 4.0.0:
rack-lin9$ gcc -v
Using built-in specs.
Target: i386-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --
infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-
checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-
exceptions --enable-libgcj-multifile --enable-
languages=c,c++,objc,java,f95,ada --enable-java-awt=gtk --with-java-
home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --host=i386-redhat-linux
Thread model: posix
gcc version 4.0.0 20050519 (Red Hat 4.0.0-8)
rack-lin9$ uname -a
Linux rack-lin9 2.6.11-1.1369_FC4smp #1 SMP Thu Jun 2 23:08:39 EDT 2005 i686
i686 i386 GNU/Linux
Compile command line when test fails:
gcc -o xor_fail -fomit-frame-pointer -O2 xor.c
Compile command line when test PASSES:
gcc -o xor xor.c
I'll attach the test program to the bug.
The generated code runs into problems in the loop:
/* now perform the xor across a stride */
for (offset = stride; offset < maxoffs; offset += 32) {
/* load first strip unit */
__asm__ __volatile__(
"add %1, %0\n"
"movaps 0(%0), %%xmm0\n"
"movaps 16(%0), %%xmm1\n"
: : "r" (bptr[0]), "r" (offset));
/* now xor the next N-1 strip units */
for (j = 1; j < num_of_buffers; j++){
__asm__ __volatile__(
"add%1, %0\n"
"xorps 0(%0), %%xmm0\n"
"xorps 16(%0), %%xmm1\n"
: : "r" (bptr[j]), "r" (offset) );
}
/* now write out the result */
__asm__ __volatile__(
"add %1, %0\n"
"movntps %%xmm0, 0(%0)\n"
"movntps %%xmm1, 16(%0)\n"
: : "r" (dest), "r" (offset) );
}
Specifically, in first loading the data:
__asm__ __volatile__(
"add %1, %0\n"
"movaps 0(%0), %%xmm0\n"
"movaps 16(%0), %%xmm1\n"
: : "r" (bptr[0]), "r" (offset));
We end up referencing memory off the end of the array bptr[0]. This is
because the loop doesn't initialize %ebx and %ebx ends up being too large to
access this array. The loop jumps to .L261, but .L261 is below movl (%ebp), %
ebx.
movl(%ebp), %ebx
.p2align 2
.L261:
.stabn 68,0,168,.LM68-sse_multi_xor_gen
.LM68:
#APP
add %edx, %ebx
movaps 0(%ebx), %xmm0
movaps 16(%ebx), %xmm1
.stabn 68,0,175,.LM69-sse_multi_xor_gen
.LM69:
#NO_APP
movl$1, %ecx
cmpl%edi, %ecx
jge .L273
.p2align 2
.L265:
.stabn 68,0,176,.LM70-sse_multi_xor_gen
.LM70:
movl(%ebp,%ecx,4), %eax
#APP
add%edx, %eax
xorps 0(%eax), %xmm0
xorps 16(%eax), %xmm1
.stabn 68,0,175,.LM71-sse_multi_xor_gen
.LM71:
#NO_APP
incl%ecx
cmpl%edi, %ecx
jl .L265
.L273:
.stabn 68,0,183,.LM72-sse_multi_xor_gen
.LM72:
movl88(%esp), %eax
#APP
add %edx, %eax
movntps %xmm0, 0(%eax)
movntps %xmm1, 16(%eax)
.stabn 68,0,166,.LM73-sse_multi_xor_gen
.LM73:
#NO_APP
addl$32, %edx
cmpl%esi, %edx
jb .L261
The workaround fix is to just remove -fomit-frame-pointer. Though I'm fairly
concerned since the Linux kernel uses -fomit-frame-pointer for the kernel
sources.
--
Summary: Incorrect code generated for SSE2 based xor routine when
compiled with -O2 -fomit-frame-pointer
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P2
Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: jeff at panasas dot com
CC: gcc-bugs at gcc dot gnu dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=23909