We use some optimized XOR routines for software RAID. Unfortunately, the compiler generated incorrect code when this was compiled for Redhat 7.3 + 2.4.24 (this is normally kernel code). I later found out that all versions of gcc that I tested (up to FC4 - 4.0.0 20050519 (Red Hat 4.0.0-8)) had this issue.
gcc -v on RH 7.3: build-lin3> gcc -v Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/2.96/specs gcc version 2.96 20000731 (Red Hat Linux 7.3 2.96-110) build-lin3> uname -a Linux build-lin3 2.4.21-kdb #2 SMP Tue Apr 6 12:52:57 EDT 2004 i686 unknown I've also tested on gcc 4.0.0: rack-lin9$ gcc -v Using built-in specs. Target: i386-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man -- infodir=/usr/share/info --enable-shared --enable-threads=posix --enable- checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind- exceptions --enable-libgcj-multifile --enable- languages=c,c++,objc,java,f95,ada --enable-java-awt=gtk --with-java- home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --host=i386-redhat-linux Thread model: posix gcc version 4.0.0 20050519 (Red Hat 4.0.0-8) rack-lin9$ uname -a Linux rack-lin9 2.6.11-1.1369_FC4smp #1 SMP Thu Jun 2 23:08:39 EDT 2005 i686 i686 i386 GNU/Linux Compile command line when test fails: gcc -o xor_fail -fomit-frame-pointer -O2 xor.c Compile command line when test PASSES: gcc -o xor xor.c I'll attach the test program to the bug. The generated code runs into problems in the loop: /* now perform the xor across a stride */ for (offset = stride; offset < maxoffs; offset += 32) { /* load first strip unit */ __asm__ __volatile__( "add %1, %0\n" "movaps 0(%0), %%xmm0\n" "movaps 16(%0), %%xmm1\n" : : "r" (bptr[0]), "r" (offset)); /* now xor the next N-1 strip units */ for (j = 1; j < num_of_buffers; j++){ __asm__ __volatile__( "add %1, %0\n" "xorps 0(%0), %%xmm0\n" "xorps 16(%0), %%xmm1\n" : : "r" (bptr[j]), "r" (offset) ); } /* now write out the result */ __asm__ __volatile__( "add %1, %0\n" "movntps %%xmm0, 0(%0)\n" "movntps %%xmm1, 16(%0)\n" : : "r" (dest), "r" (offset) ); } Specifically, in first loading the data: __asm__ __volatile__( "add %1, %0\n" "movaps 0(%0), %%xmm0\n" "movaps 16(%0), %%xmm1\n" : : "r" (bptr[0]), "r" (offset)); We end up referencing memory off the end of the array bptr[0]. This is because the loop doesn't initialize %ebx and %ebx ends up being too large to access this array. The loop jumps to .L261, but .L261 is below movl (%ebp), % ebx. movl (%ebp), %ebx .p2align 2 .L261: .stabn 68,0,168,.LM68-sse_multi_xor_gen .LM68: #APP add %edx, %ebx movaps 0(%ebx), %xmm0 movaps 16(%ebx), %xmm1 .stabn 68,0,175,.LM69-sse_multi_xor_gen .LM69: #NO_APP movl $1, %ecx cmpl %edi, %ecx jge .L273 .p2align 2 .L265: .stabn 68,0,176,.LM70-sse_multi_xor_gen .LM70: movl (%ebp,%ecx,4), %eax #APP add %edx, %eax xorps 0(%eax), %xmm0 xorps 16(%eax), %xmm1 .stabn 68,0,175,.LM71-sse_multi_xor_gen .LM71: #NO_APP incl %ecx cmpl %edi, %ecx jl .L265 .L273: .stabn 68,0,183,.LM72-sse_multi_xor_gen .LM72: movl 88(%esp), %eax #APP add %edx, %eax movntps %xmm0, 0(%eax) movntps %xmm1, 16(%eax) .stabn 68,0,166,.LM73-sse_multi_xor_gen .LM73: #NO_APP addl $32, %edx cmpl %esi, %edx jb .L261 The workaround fix is to just remove -fomit-frame-pointer. Though I'm fairly concerned since the Linux kernel uses -fomit-frame-pointer for the kernel sources. -- Summary: Incorrect code generated for SSE2 based xor routine when compiled with -O2 -fomit-frame-pointer Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P2 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: jeff at panasas dot com CC: gcc-bugs at gcc dot gnu dot org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=23909