[Bug target/48781] New: gcc generate movdqa instructions on unaligned memory address when using -mtune=native -march=native

2011-04-26 Thread tanzhangxi at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48781

   Summary: gcc generate movdqa instructions on unaligned memory
address when using -mtune=native -march=native
   Product: gcc
   Version: 4.4.4
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: tanzhan...@gmail.com


GCC generated the movdaq instructions (requires 16-byte aligned memory address)
on the following simple code.

__uint128_t *mem_pool;
void test(unsigned int addr, unsigned int* data) {
 *(__uint128_t *)data = mem_pool[addr];
}

Both 'mem_pool' and 'data' are not 128-bit aligned.

 <_Z4testjPj>:
   0:   89 ff   mov%edi,%edi
   2:   48 8b 05 00 00 00 00mov0(%rip),%rax# 9 <_Z4testjPj+0x9>
   9:   48 c1 e7 04 shl$0x4,%rdi
   d:   66 0f 6f 04 07  movdqa (%rdi,%rax,1),%xmm0
  12:   66 0f 7f 06 movdqa %xmm0,(%rsi)
  16:   c3  retq   

I compiled this code with the following command:
g++44 -c test.cpp -O3 -mtune=native -march=native  -o test.o

My target machine is Intel Xeon 5150. It works if don't use  -mtune and -march.
The gcc is shipped with RHEL/Centos 5.6.

Using built-in specs.
Target: x86_64-redhat-linux6E

Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
--infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla
--enable-bootstrap --enable-shared --enable-threads=posix
--enable-checking=release --with-system-zlib --enable-__cxa_atexit
--disable-libunwind-exceptions --disable-gnu-unique-object
--enable-languages=c,c++,fortran --disable-libgcj
--with-mpfr=/builddir/build/BUILD/gcc-4.4.4-20100726/obj-x86_64-redhat-linux6E/mpfr-install/
--with-ppl=/builddir/build/BUILD/gcc-4.4.4-20100726/obj-x86_64-redhat-linux6E/ppl-install
--with-cloog=/builddir/build/BUILD/gcc-4.4.4-20100726/obj-x86_64-redhat-linux6E/cloog-install
--with-tune=generic --with-arch_32=i586 --build=x86_64-redhat-linux6E
Thread model: posix
gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) 

Btw, if this is not correct, what is the right attribute to decorate a
uint128_t pointer in order to make use of 16-bit aligned SSE2 instructions? The
__atrribute__((align(x))) only works for variables, but not the value pointed
by pointers?


[Bug target/48781] gcc generate movdqa instructions on unaligned memory address when using -mtune=native -march=native

2011-04-26 Thread tanzhangxi at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48781

--- Comment #2 from Zhangxi Tan  2011-04-26 
22:25:42 UTC ---
(In reply to comment #1)
> The alignment of __uint128_t is 16byte.  I think you are invoking undefined
> behavior by using a data type which increases the alignment.

_uint128_t does not assume a 16-byte alignment, right? Maybe using (unsigned
int *) for pointer 'data' is not an good example.

Changing it to (void *) and casting it to (__uint128_t *) has the same problem.
Seems like, it's the combination of -O3, -mtune and -march. Problem doesn't
show up if I don't use -O3.


[Bug c/50065] New: -Os, -O2, -O3 optimization breaks LD/ST ordering on 32-bit SPARC

2011-08-12 Thread tanzhangxi at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50065

 Bug #: 50065
   Summary: -Os, -O2, -O3 optimization breaks LD/ST ordering on
32-bit SPARC
Classification: Unclassified
   Product: gcc
   Version: 4.6.1
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: c
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: tanzhan...@gmail.com


Considering the following C code.

static inline int spinlock_trylock(char* lock)
{
int reg;
__asm__ __volatile__ ("ldstub [%1],%0" : "=r"(reg) : "r"(lock) : "memory",
"cc");
return reg;
}

static inline int spinlock_is_locked(char* lock)
{
int reg;
__asm__ __volatile__ ("ldub [%1],%0" : "=r"(reg) : "r"(lock) : "memory");
return reg;
}


static inline void spinlock_lock(char* lock)
{
while(spinlock_trylock(lock))
  while(spinlock_is_locked(lock));
}

static inline void spinlock_unlock(char* lock)
{
*(volatile unsigned char*)lock = 0;
}

char remap_lock;
int remap_barrier;

void inc_remap_barrier()
{

  spinlock_lock(&remap_lock);
  remap_barrier++;
  spinlock_unlock(&remap_lock);
}

A simple ++ counter is protected by a spinlock implemented with the ldstub
atomic instruction on SPARC.

If I use -Os/-O2/-O3 to optimize the code, e.g.
sparc-ramp-gcc -c -Os test.c -o test.o

gcc will generate the following code
 :
   0:   03 00 00 00 sethi  %hi(0), %g1
   4:   10 80 00 06 b  1c 
   8:   82 10 60 00 mov  %g1, %g1   ! 0 
   c:   c4 08 40 00 ldub  [ %g1 ], %g2
  10:   80 a0 a0 00 cmp  %g2, 0
  14:   12 bf ff fe bne  c 
  18:   01 00 00 00 nop 
  1c:   c4 68 40 00 ldstub  [ %g1 ], %g2
  20:   80 a0 a0 00 cmp  %g2, 0
  24:   12 bf ff fa bne  c 
  28:   05 00 00 00 sethi  %hi(0), %g2
  2c:   c0 28 40 00 clrb  [ %g1 ]
  30:   c6 00 a0 00 ld  [ %g2 ], %g3
  34:   86 00 e0 01 inc  %g3
  38:   81 c3 e0 08 retl 
  3c:   c6 20 a0 00 st  %g3, [ %g2 ]


instruction 2C, clrb [%g1] corresponds to inline function 'spinlock_unlock'
*(volatile unsigned char*)lock = 0;

This happens before the lock protected content 'remap_barrier++', i.e.

  30:   c6 00 a0 00 ld  [ %g2 ], %g3
  34:   86 00 e0 01 inc  %g3
  38:   81 c3 e0 08 retl 
  3c:   c6 20 a0 00 st  %g3, [ %g2 ] ---> use the branch delay slot

This is wrong and will cause serious lock issues under a multithreading
environment.

However, the same code works fine with -O1 and -O0

This problem happens with couple of cross GCC builds (4.3.2 / 4.6.1) at our
side with various configurations (with glibc or a bare metal setting).
It breaks with the following configurations:

1. 
COLLECT_GCC=sparc-ramp-gcc
COLLECT_LTO_WRAPPER=/home/charming/toolchain/sparc-ramp/libexec/gcc/sparc-ramp-elf/4.6.1/lto-wrapper
Target: sparc-ramp-elf
Configured with: /home/charming/toolchain/.build/src/gcc-4.6.1/configure
--build=i686-build_pc-linux-gnu --host=i686-build_pc-linux-gnu
--target=sparc-ramp-elf --prefix=/home/charming/toolchain/sparc-ramp
--with-local-prefix=/home/charming/toolchain/sparc-ramp/sparc-ramp-elf/sysroot
--disable-multilib --disable-libmudflap
--with-sysroot=/home/charming/toolchain/sparc-ramp/sparc-ramp-elf/sysroot
--with-newlib --enable-threads=no --disable-shared
--with-pkgversion='crosstool-NG 1.12.0' --disable-__cxa_atexit
--with-gmp=/home/charming/toolchain/.build/sparc-ramp-elf/build/static
--with-mpfr=/home/charming/toolchain/.build/sparc-ramp-elf/build/static
--with-mpc=/home/charming/toolchain/.build/sparc-ramp-elf/build/static
--with-ppl=/home/charming/toolchain/.build/sparc-ramp-elf/build/static
--with-cloog=/home/charming/toolchain/.build/sparc-ramp-elf/build/static
--with-libelf=/home/charming/toolchain/.build/sparc-ramp-elf/build/static
--enable-lto
--with-host-libstdcxx='-L/home/charming/toolchain/.build/sparc-ramp-elf/build/static/lib
-lpwl' --enable-target-optspace --disable-libgomp --disable-libmudflap
--disable-nls --enable-languages=c --with-cpu=v8
Thread model: single
gcc version 4.6.1 (crosstool-NG 1.12.0) 

2.
Using built-in specs.
COLLECT_GCC=sparc-ramp-gcc
COLLECT_LTO_WRAPPER=/home/xtan/toolchain/sparc-ramp/libexec/gcc/sparc-ramp-linux-gnu/4.6.1/lto-wrapper
Target: sparc-ramp-linux-gnu
Configured with: /home/xtan/toolchain/.build/src/gcc-4.6.1/configure
--build=x86_64-build_unknown-linux-gnu --host=x86_64-build_unknown-linux-gnu
--target=sparc-ramp-linux-gnu --prefix=/home/xtan/toolchain/sparc-ramp
--with-sysroot=/home/xtan/toolchain/sparc-ramp/sparc-ramp-linux-gnu/sysroot
--enable-languages=c,c++ --disable-multilib
--with-pkgversion=crosstool-NG-1.11.3 --enable-__cxa_atexit
--disable-libmudflap --disable-libgomp --disable-libssp
--with-gmp=/home/xtan/toolchain/.build/sparc-ramp-linux-gnu/build/static
--with-mpfr=/home/xtan/toolchain/.build/sparc-ramp-linux-gnu/build/static
--with-mpc=/home/xtan/toolchain/.build/sparc-ramp-linux-gnu/build/static
--with-ppl=/home/xtan/toolchain/.b

[Bug rtl-optimization/50065] -Os, -O2, -O3 optimization breaks LD/ST ordering on 32-bit SPARC

2011-08-13 Thread tanzhangxi at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50065

--- Comment #3 from Zhangxi Tan  2011-08-14 
00:57:07 UTC ---
The code is equivalent to

volatile unsigned char lock;
int remap_barrier;

while (atomic_test_and_set(lock)) {
   while (lock) {
 ;
   }
}
remap_barrier++;
lock = 0;

Eric: could you let me know you you think the code inside function  
spinlock_lock(&remap_lock) is a NOP? This is a suggested lock implementation by
the SPARC spec. Also, the arch_write_lock/unlock in the SPARC port of Linux
uses a very similar implementation.

Andrew: could you let me know in which version I can find this ipa-sra fix? At
least, the stable 4.6.1 doesn't work.


[Bug rtl-optimization/50065] -Os, -O2, -O3 optimization breaks LD/ST ordering on 32-bit SPARC

2011-08-13 Thread tanzhangxi at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50065

--- Comment #4 from Zhangxi Tan  2011-08-14 
01:30:33 UTC ---
I don't think this is an valid optimization.

There are only two memory models in SPARC32, TSO and PSO (not RMO in the 64-bit
v9). Both don't allow relaxing the read->write order, i.e.  'LD remap_barrier'
should always be executed before 'ST lock'.

This optimization violates the memory model, therefore should be prohibited.

In addition, I still(In reply to comment #2)
> > instruction 2C, clrb [%g1] corresponds to inline function 'spinlock_unlock'
> > *(volatile unsigned char*)lock = 0;
> > 
> > This happens before the lock protected content 'remap_barrier++', i.e.
> > 
> >   30:   c6 00 a0 00 ld  [ %g2 ], %g3
> >   34:   86 00 e0 01 inc  %g3
> >   38:   81 c3 e0 08 retl 
> >   3c:   c6 20 a0 00 st  %g3, [ %g2 ] ---> use the branch delay slot
> > 
> > This is wrong and will cause serious lock issues under a multithreading
> > environment.
> 
> On what grounds is this wrong exactly?  The end of the code is equivalent to:
> 
> volatile unsigned char lock;
> int remap_barrier;
> 
> remap_barrier++;
> lock = 0;
> 
> It is perfectly valid for an optimizing C compiler to swap the two lines.
> 
> You want something like:
> 
> static inline void spin_unlock(char *lock)
> {
> __asm__ __volatile__("stb %%g0, [%0]" : : "r" (lock) : "memory");
> }


[Bug rtl-optimization/50065] -Os, -O2, -O3 optimization breaks LD/ST ordering on 32-bit SPARC

2011-08-14 Thread tanzhangxi at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50065

--- Comment #8 from Zhangxi Tan  2011-08-14 
21:00:40 UTC ---
Thanks for the clear explanation.
I agree that a memory barrier would solve this issue.

Regarding the spinlock_unlock in linux, the regular arch_spin_unlock is
implemented with a single inline assembly. That will prevent the memory
reordering in C. However, for the 32-bit port the arch_write_unlock is still
defined as the following without a memory barrier in
arch/sparc/include/asm/spinlock_32.h

#define arch_write_unlock(rw)   do { (rw)->lock = 0; } while(0)

OTH, the 64-bit implemention is ok. Or did I miss something here.
Anyway, I think this is a separated issue from this thread.

(In reply to comment #6)
> > The code is equivalent to
> > 
> > volatile unsigned char lock;
> > int remap_barrier;
> > 
> > while (atomic_test_and_set(lock)) {
> >while (lock) {
> >  ;
> >}
> > }
> > remap_barrier++;
> > lock = 0;
> > 
> > Eric: could you let me know you you think the code inside function  
> > spinlock_lock(&remap_lock) is a NOP?
> 
> I don't, you simply misquoted, I wrote "the end of the code".  The first part
> of the spinlock implementation is correct, in particular you have the required
> memory barrier in spinlock_is_locked.  The second part is not correct, as you
> don't have the memory barrier in spinlock_unlock.
> 
> > Also, the arch_write_lock/unlock in the SPARC port of Linux uses a very 
> > similar implementation.
> 
> No, it precisely doesn't, it has the memory barrier in spinlock_unlock.