Re: should sync builtins be full optimization barriers?

2011-09-12 Thread Paolo Bonzini

On 09/11/2011 09:00 PM, Geert Bosch wrote:

So, if I understand correctly, then operations using relaxed memory
order will still need fences, but indeed do not require any
optimization barrier. For memory_order_seq_cst we'll need a full
barrier, and for the others there is a partial barrier.


If you do not need an optimization barrier, you do not need a processor 
barrier either, and vice versa.  Optimizations are just another factor 
that can lead to reordered loads and stores.


Paolo


Re: should sync builtins be full optimization barriers?

2011-09-12 Thread Paolo Bonzini

On 09/12/2011 01:22 AM, Andrew MacLeod wrote:

You're right that using lock_test_and_set as an exchange is very wrong
because of the compiler barrier semantics, but I think this is
entirely a red herring in this case.  The same problem could happen
with a fetch_and_add or even a lock_release operation.


My point is that if even once we get the right barriers in place, due to
its definition as acquire, this testcase could actually still fail, AND
the optimization is valid...


Ah, sure.


unless we decide to retroactively make
all the original sync routine set_cst.


I've certainly seen code using lock_test_and_set to avoid asm for xchg. 
 That would be very much against the documentation with respect to the 
values of the second parameter, and that's also why clang introduced 
__sync_swap.  However, perhaps it makes sense to make lock_test_and_set 
provide sequential consistency.


Probably not much so for lock_release, which is quite clearly a 
store-release.


Paolo


Incorrect optimized (-O2) linked list code with 4.3.2

2011-09-12 Thread pavan tc
Hi,

I would like to know if there have been issues with optimized linked
list code with GCC 4.3.2. [optiimization flag : -O2]

The following is the inlined code that has the problem:

static inline void
list_add_tail (struct list_head *new, struct list_head *head)
{
new->next = head;
new->prev = head->prev;

new->prev->next = new;
new->next->prev = new;
}

The above code has been used in the loop as below:

pool = GF_CALLOC (count, padded_sizeof_type, gf_common_mt_long);
if (!pool) {
GF_FREE (mem_pool);
return NULL;
}

for (i = 0; i < count; i++) {
list = pool + (i * (padded_sizeof_type));
INIT_LIST_HEAD (list);
list_add_tail (list, &mem_pool->list);

}

'&mem_pool-> list' is used as the list head. mem_pool is a pointer to type :
struct mem_pool {
struct list_head  list;
int   hot_count;
int   cold_count;
gf_lock_t lock;
unsigned long padded_sizeof_type;
void *pool;
void *pool_end;
int   real_sizeof_type;
};

'list' is the new member being added to the tail of the list pointed to by head.
It is a pointer to type:
struct list_head {
struct list_head *next;
struct list_head *prev;
};

The generated assembly for the loop (with the linined list_add_tail())
is as below:

   40a1c:   e8 0f 03 fd ff callq  10d30 <__gf_calloc@plt>
   40a21:   48 85 c0   test   %rax,%rax
   40a24:   48 89 c7   mov%rax,%rdi
   40a27:   0f 84 bf 00 00 00   je 40aec 
   40a2d:   48 8b 73 08  mov0x8(%rbx),%rsi
   40a31:   4d 8d 44 24 01  lea0x1(%r12),%r8
   40a36:   31 c0   xor%eax,%eax
   40a38:   b9 01 00 00 00 mov$0x1,%ecx
   40a3d:   0f 1f 00nopl   (%rax)
   40a40:   49 0f af c5imul   %r13,%rax
   <=== loop start
   40a44:   48 8d 04 07  lea(%rdi,%rax,1),%rax
   40a48:   48 89 18  mov%rbx,(%rax)
# list->next = head
   40a4b:   48 89 06  mov%rax,(%rsi)
#   head->prev->next = list
   40a4e:   48 8b 10  mov(%rax),%rdx
# rdx holds list->next
   40a51:   48 89 70 08 mov%rsi,0x8(%rax)  #
list->prev = head->prev;
   40a55:   48 89 42 08 mov%rax,0x8(%rdx) #
list->next->prev = list
   40a59:   48 89 c8  mov%rcx,%rax
   40a5c:   48 83 c1 01  add$0x1,%rcx
   40a60:   4c 39 c1   cmp%r8,%rcx
   40a63:   75 db   jne40a40 

In the assembly above, %rbx  holds the address of 'head'.
%rsi holds the value of head->prev. This is assigned outside the loop and the
compiler classifies it as a loop invariant, which is where, I think,
the problem is.
This line of code should have been inside the loop.
 - %rsi still holds the value of head->prev that was assigned
outside the loop.

The following experiments eliminate the problem:

1. Using 'volatile' on the address that 'head' points to.
2. Using a function call (logging calls, for example) inside the loop.
3. Using the direct libc calloc instead of the GF_CALLOC.
[GF_CALLOC does some accounting when accounting is enabled. Calls vanilla
libc calloc() otherwise].

So, anything that necessitates a different usage of %rsi seems to be correcting
the behaviour.

4. Using gcc 4.4.3 [ The obvious solution would then be to use 4.4.3,
but I would
like to understand if this is a known problem with 4.3.2. Small
programs written to
emulate this problem do not exhibit the erroneous behaviour.]

Please let me know if any more details about this behaviour are required.
I'll be glad to provide them.

TIA,
Pavan


Votre site dans Google

2011-09-12 Thread Votre Site
Bonjour,
on fait votre référencement web intensif pour que vous sortiez premier dans 
Google.
Voir: http://www.PremierGoogle.com

Nous construisons votre site ou re-construisons votre site actuel, le rendant 
encore plus attrayant.

On offre des formations accélérées à vos employés, à vos enfants, à vos parents 
ou vos amis. Cela dure quelques jours.
Les formations sont: bureautique, photoshop et infographie, Excel, Access, 
anglais, conception de site internet, programmation ou tout autre formation que 
vous désirez.
Ces formations accélérées permettent à vous ou à vos enfants d'être vite 
compétents et de trouver du travail rapidement.
Elles permettent à vous ou à vos employés d'être plus compétents, qualifiés et 
rapides au travail.
Voir: http://www.Ecoleinforma.com

Merci.

PremierGoogle.com / Ecoleinforma.com
 


An internal compiler error when building gcc4.6 using "-flto -fuse-linker-plugin" on Win7 mingw64 target

2011-09-12 Thread PcX



Hi, all

I report it to gcc bugzilla 
:http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50351
When I use mingw64 gcc build gcc4.6.2 20110909  using "-flto
-fuse-linker-plugin", an internal compiler error came out:
--
i686-w64-mingw32-gcc   -pipe -g0 -O2 -fomit-frame-pointer -finline-functions
-minline-all-stringops -flto  -DIN_GCC   -W -Wall -Wwrite-strings -Wcast-qual
-Wstrict-prototypes -Wmissing-prototypes -Wmissing-format-attribute -pedantic
-Wno-long-long -Wno-variadic-macros -Wno-overlength-strings
-Wold-style-definition -Wc++-compat   -DHAVE_CONFIG_H -Wl,-O1 -Wl,--as-needed
-s -Wl,--large-address-aware  -flto -fuse-linker-plugin -o cc1.exe c-lang.o
c-family/stub-objc.o attribs.o c-errors.o c-decl.o c-typeck.o c-convert.o
c-aux-info.o c-objc-common.o c-parser.o tree-mudflap.o c-family/c-common.o
c-family/c-cppbuiltin.o c-family/c-dump.o c-family/c-format.o
c-family/c-gimplify.o c-family/c-lex.o c-family/c-omp.o c-family/c-opts.o
c-family/c-pch.o c-family/c-ppoutput.o c-family/c-pragma.o
c-family/c-pretty-print.o c-family/c-semantics.o c-family/c-ada-spec.o i386-c.o
msformat-c.o \
  cc1-checksum.o main.o  libbackend.a ../libcpp/libcpp.a
../libdecnumber/libdecnumber.a ../libcpp/libcpp.a  /mingw/lib/libiconv.a
../libiberty/libiberty.a ../libdecnumber/libdecnumber.a -lstdc++ -lcloog-isl
-lisl -lppl_c -lppl  -lgmpxx -lmpc -lmpfr -lgmp  -lstdc++ -lz
In file included from :1011:0:
source/isl/constraints.c: In function
'cloog_constraint_set_defining_inequalities':
isl_constraint.c:170:15: warning: 'l' may be used uninitialized in this
function [-Wuninitialized]
source/isl/constraints.c:154:25: note: 'l' was declared here
isl_constraint.c:166:9: warning: 'u' may be used uninitialized in this function
[-Wuninitialized]
source/isl/constraints.c:153:25: note: 'u' was declared here
In file included from ../.././gcc/c-decl.c:6139:0,
 from ../.././gcc/c-decl.c:489,
 from :5284:
../.././libcpp/lex.c: In function 'search_line_sse2':
../.././libcpp/lex.c:370:15: internal compiler error: in convert_move, at
expr.c:326
Please submit a full bug report,
with preprocessed source if appropriate.
See  for instructions.
lto-wrapper: e:\MyPack\MinGW\bin\i686-w64-mingw32-gcc.exe returned 1 exit
status
e:/mypack/mingw/bin/../lib/gcc/i686-w64-mingw32/4.6.2/../../../../i686-w64-mingw32/bin/ld.exe:
lto-wrapper failed
collect2: ld returned 1 exit status
make[2]: *** [cc1.exe] Error 1
make[2]: Leaving directory `/e/new/gcc/gcc4.6/build/host-i686-w64-mingw32/gcc'
make[1]: *** [all-gcc] Error 2
make[1]: Leaving directory `/e/new/gcc/gcc4.6/build'
make: *** [all] Error 2
---

my gcc is configured as:
---
Target: i686-w64-mingw32
Configured with: ./configure --prefix=/mingw --host=i686-w64-mingw32
--build=i686-w64-mingw32 --target=i686-w64-mingw32 --with-lto-plugin
--with-host-libstdcxx=-lstdc++ --disable-bootstrap --disable-werror
--with-arch=i686 --with-tune=generic --enable-languages=c,c++,fortran
--enable-libgomp --enable-threads=posix --enable-lto --with-system-zlib
--enable-libstdcxx-debug --enable-version-specific-runtime-libs
--enable-fully-dynamic-string --disable-sjlj-exceptions --with-dwarf2
--disable-symvers --enable-checking=release --enable-cloog-backend=isl
--enable-static --disable-shared --enable-cxx-flags='-fno-function-sections
-fno-data-sections' --enable-libquadmath-support --enable-libquadmath
--disable-multilib --disable-nls --disable-win32-registry
---

my binutils edition is :
2.21.53.20110910

Is this the LTO bug in gcc4.6 or ld bug?
The error seems likePR45475  , 
butPR45475    don't use LTO.

Thanks.

--
Best Regards,
PcX



Re: An internal compiler error when building gcc4.6 using "-flto -fuse-linker-plugin" on Win7 mingw64 target

2011-09-12 Thread PcX

I try to use gcc version 4.7.0 20110911 (experimental) to build gcc trunk, and
it also has the problem, but shows a different error:


i686-w64-mingw32-gcc   -pipe -g0 -O2 -fomit-frame-pointer -finline-functions
-minline-all-stringops -flto  -DIN_GCC   -W -Wall -Wwrite-strings -Wcast-qual
-Wstrict-prototypes -Wmissing-prototypes -Wmissing-format-attribute -pedantic
-Wno-long-long -Wno-variadic-macros -Wno-overlength-strings
-Wold-style-definition -Wc++-compat   -DHAVE_CONFIG_H -Wl,-O1 -Wl,--as-needed
-s -Wl,--large-address-aware  -flto -fuse-linker-plugin  -o cc1.exe c-lang.o
c-family/stub-objc.o attribs.o c-errors.o c-decl.o c-typeck.o c-convert.o
c-aux-info.o c-objc-common.o c-parser.o tree-mudflap.o c-family/c-common.o
c-family/c-cppbuiltin.o c-family/c-dump.o c-family/c-format.o
c-family/c-gimplify.o c-family/c-lex.o c-family/c-omp.o c-family/c-opts.o
c-family/c-pch.o c-family/c-ppoutput.o c-family/c-pragma.o
c-family/c-pretty-print.o c-family/c-semantics.o c-family/c-ada-spec.o i386-c.o
msformat-c.o default-c.o \
  cc1-checksum.o main.o  libbackend.a libcommon-target.a libcommon.a
../libcpp/libcpp.a ../libdecnumber/libdecnumber.a libcommon.a
../libcpp/libcpp.a   ../libiberty/libiberty.a ../libdecnumber/libdecnumber.a
-lstdc++ -lcloog-isl -lisl -lppl_c -lppl  -lgmpxx -lmpc -lmpfr -lgmp  -lstdc++
-lz
lto1.exe: internal compiler error: in lto_reissue_options, at lto-opts.c:414
Please submit a full bug report,
with preprocessed source if appropriate.
See  for instructions.
lto-wrapper: e:\MyPack\MinGW\bin\i686-w64-mingw32-gcc.exe returned 1 exit
status
e:/mypack/mingw/bin/../lib/gcc/i686-w64-mingw32/4.7.0/../../../../i686-w64-mingw32/bin/ld.exe:
lto-wrapper failed
collect2.exe: error: ld returned 1 exit status
make[2]: *** [cc1.exe] Error 1
make[2]: Leaving directory `/e/new/gcc/gcc4.6/build/gcc'
make[1]: *** [all-gcc] Error 2
make[1]: Leaving directory `/e/new/gcc/gcc4.6/build'
make: *** [all] Error 2




--
Best Regards,
PcX



Re: An internal compiler error when building gcc4.6 using "-flto -fuse-linker-plugin" on Win7 mingw64 target

2011-09-12 Thread Jonathan Wakely
On 12 September 2011 10:05, PcX wrote:
>
>
> Hi, all
>
>    I report it to gcc bugzilla
> :http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50351

Then there's no need to also email this list.

Discussion of the bug should take place in Bugzilla, not here.


[Ann] MELT plugin 0.9 rc1 for GCC 4.6

2011-09-12 Thread Basile Starynkevitch

Hello All,

It is my pleasure to announce the release candidate 1 of MELT plugin 0.9 for 
GCC 4.6

MELT is a high-level lisopy domain specific language to develop GCC extensions.


A release candidate 1 of MELT plugin 0.9 for gcc 4.6 is available, as a
gzipped source tar archive, from 
  http://gcc-melt.org/melt-0.9rc1-plugin-for-gcc-4.6.tgz 
of size 3767264 bytes, and md5sum 1142d37a7e8b22b87257c7964be4dcd5 (september 
12th
2011). It is extracted from MELT branch svn revision 178780. The version
number 0.9 of the MELT plugin is unrelated to the version of the GCC
compiler (4.6) for which it is supposed to work.


###
NEWS for 0.9 MELT plugin for gcc-4.6

September 2011: Release of MELT plugin 0.9 rc1 for gcc-4.6

New features:

Documentation is generated

The PLUGIN_PRE_GENERICIZE event is interfaced.

The build machinery and the binary module loading has been
significantly updated.  Modules shared objects are like
warmelt-macro.3461497d8ef7239dc1f2f132623e6dd5.quicklybuilt.so and
they contain the md5sum of the catenation of all C files. They
also come in various flavor: quicklybuilt (the generated C is
compiled with -O0 -DMELT_HAVE_DEBUG), optimized (the generated C
is compiled with -01 and without -DMELT_HAVE_DEBUG), debugnoline
(the generated C is compiled with -g and -DMELT_HAVE_DEBUG but no
#line directives).

Conceptually, a module is loaded by loading its +meltdesc.c
file. That file (e.g. warmelt-macro+meltdesc.c corresponding to
warmelt-macro.melt) should never be moved or even edited.  It is
parsed at module load time, and contains the various md5sum of
real generated C files.

New option -fplugin-arg-melt-workdir= for the work directory,
where every .c or .so files are generated.

The DISCR_BOX discriminant has been removed. Use containers instead.

Containers, that is instances of class_container having one single field 
:container_value, are supported by syntactic macros and sugar & function.
   (container V)   
  =equivalent=   (instance class_container :container_value V)
   (content C)
  =equivalent=   (get_field :container_value C)
   (set_content C V)
  =equivalent=   (put_fields C :container_value V)
You can write exclaim instead of content, and there is a new syntactic 
sugar
   !X

 is the same as (content X) - the exclamation mark should be
 followed by spaces, letters, or left parenthesis to be parsed as
 exclaim -that is as the content macro above.

In patterns, ?(container ?v) means 
?(instance class_container :container_value ?v)

Fields can be accessed by their name, so
  (:F C)
   is the same as (get_field :F C)
   Hence (:container_value foo) is the same as !foo or 
   (get_field :container_value foo)

   Experimental syntactic sugar: inside an s-expr, a macro string
   written ##{...}# is expanded as several components, not a single
   list.

   Slow boxed arithmetic operations are available (e.g. +iv gets two
   boxed integer and gives the boxed integer of their sum).

Many bug fixes.

The build system has been revamped. The generated .c files should be
available when running MELT.


Thanks to Pierre Vittet, Alexandre Lissy, Romain Geissler for
feedback, patches, suggestions.




The gcc-melt.org site contains also the HTML documentation which is produced 
when building that plugin.

Enjoy!

Regards.
-- 
Basile STARYNKEVITCH http://starynkevitch.net/Basile/
email: basilestarynkevitchnet mobile: +33 6 8501 2359
8, rue de la Faiencerie, 92340 Bourg La Reine, France
*** opinions {are only mines, sont seulement les miennes} ***


Re: GCC 4.7.0 Status Report (2011-09-09)

2011-09-12 Thread Aldy Hernandez
Jakub Jelinek  writes:

> In particular, is transactional-memory branch mergeable within
> a month and half, at least some parts of cxx-mem-model branch,
> bitfield lowering?  What is the status of lra, reload-2a, pph,

Torvald and I are looking into getting things merge read, but...

The main problem seems to be that the TM community still hasn't formally
submitted the C++ changes to the standards committee, and are still
swinging between two alternate syntax implementations.  So unless, this
gets resolved in the next month, transactional memory will have to sit
this one out.

Torvald, is this a fair assessment?


Re: GCC 4.7.0 Status Report (2011-09-09)

2011-09-12 Thread Jeff Law

On 09/09/2011 01:09 AM, Jakub Jelinek wrote:

bitfield lowering?  What is the status of lra, reload-2a, pph,
cilkplus, gupc (I assume at least some of these are 4.8+ material)?
The bits on reload-v2a provide range splitting and a second chance at 
assigning a hard reg for unallocated allocnos.  The code has been very 
stable and shows a small improvement on x86/x86_64.  The biggest concern 
I have with the code is the compile-time performance, which I know is 
bad, but I haven't done any serious measurements.


jeff



Re: Comparison of GCC-4.6.1 and LLVM-2.9 on x86/x86-64 targets

2011-09-12 Thread Vladimir Makarov

On 09/09/2011 07:30 PM, Lawrence Crowl wrote:

On 9/7/11, Vladimir Makarov  wrote:

Some people asked me to do comparison of  GCC-4.6 and LLVM-2.9 (both
released this spring) as I did GCC-LLVM comparison in previous year.

You can find it on http://vmakarov.fedorapeople.org/spec under
2011 GCC-LLVM comparison tab entry.

The format of these graphs exaggerates differences.  The reason is
that our hind brains cannot help but compare the heights of bars and
ignore the non-zero bases.  In short, non-zero based graphs are lies.
So, please 0-base all the graphs.  The graphs should show compilation
time from 0 up, execution time from 0 up, SPEC score from 0 up, etc.
A consequence is that you will get rid of the "change" graphs.


Thanks for the feedback, Lawrence.

I've changed the graphs which did not started with 0.

In my mind, an interesting graph would be to plot the execution
time of the benchmarks as a function of the compile time of the
benchmarks.  This graph would show you, in particular, what you
buy or lose by changing compilers and/or optimization/debug levels.


I'll think about it.




Re: should sync builtins be full optimization barriers?

2011-09-12 Thread Geert Bosch

On Sep 12, 2011, at 03:02, Paolo Bonzini wrote:

> On 09/11/2011 09:00 PM, Geert Bosch wrote:
>> So, if I understand correctly, then operations using relaxed memory
>> order will still need fences, but indeed do not require any
>> optimization barrier. For memory_order_seq_cst we'll need a full
>> barrier, and for the others there is a partial barrier.
> 
> If you do not need an optimization barrier, you do not need a processor 
> barrier either, and vice versa.  Optimizations are just another factor that 
> can lead to reordered loads and stores.

Assuming that statement is true, that would imply that even for relaxed 
ordering there has to be an optimization barrier. Clearly fences need to be 
used for any atomic accesses, including those with relaxed memory order.

Consider 4 threads and an atomic int x:

thread 1  thread 2  thread 3  thread 4
      
  x=1;  r1=x  x=3;  r3=x;
  x=2;  r2=x  x=4;  r4=x;

Even with relaxed memory ordering, all modifications to x have to occur in some 
particular total order, called  the modification order of x.

So, even if each thread preserves its store order, the modification order of x 
can be any of:
  1,2,3,4
  1,3,2,4
  1,3,4,2
  3,1,2,4
  3,1,4,2
  3,4,1,2

Because there is a single modification order for x, it would be an error for 
thread 2 and thread 4 to see a different update order.

So, if r1==2,r2==3 and r3==4,r4==1, that would be an error. However, without 
fences, this can easily happen on an SMP machine, even one with a nice memory 
model such as the x86.

IIUC, the relaxed memory model mostly seems to allow movement (by compiler and 
CPU) of unrelated memory operations, but still requires fences between 
subsequent atomic operations on the same object. 

In other words, while atomic operations with relaxed memory order on some 
atomic object X cannot be used to synchronize any operations on objects other 
than X, they themselves cannot cause data races.

  -Geert


Re: should sync builtins be full optimization barriers?

2011-09-12 Thread Paolo Bonzini
On Mon, Sep 12, 2011 at 20:40, Geert Bosch  wrote:

> Assuming that statement is true, that would imply that even for relaxed
> ordering there has to be an optimization barrier. Clearly fences need to be
> used for any atomic accesses, including those with relaxed memory order.
>
> Consider 4 threads and an atomic int x:
>
> thread 1  thread 2  thread 3  thread 4
>       
>  x=1;      r1=x      x=3;      r3=x;
>  x=2;      r2=x      x=4;      r4=x;
>
> Even with relaxed memory ordering, all modifications to x have to occur in 
> some particular total order, called  the modification order of x.
>
> So, if r1==2,r2==3 and r3==4,r4==1, that would be an error. However,
> without fences, this can easily happen on an SMP machine, even one with
> a nice memory model such as the x86.

How?  (Honest question).  All stores are to the same location.  I
don't see how that can happen without processor fences, much less
without optimization fences.

Paolo


Re: should sync builtins be full optimization barriers?

2011-09-12 Thread Andrew MacLeod

On 09/12/2011 02:40 PM, Geert Bosch wrote:



thread 1  thread 2  thread 3  thread 4
      
   x=1;  r1=x  x=3;  r3=x;
   x=2;  r2=x  x=4;  r4=x;

Even with relaxed memory ordering, all modifications to x have to occur in some 
particular total order, called  the modification order of x.

So, even if each thread preserves its store order, the modification order of x 
can be any of:
   1,2,3,4
   1,3,2,4
   1,3,4,2
   3,1,2,4
   3,1,4,2
   3,4,1,2

Because there is a single modification order for x, it would be an error for 
thread 2 and thread 4 to see a different update order.


Lets simplify it slightly.  The compiler can optimize away x=1 and x=3 
as dead stores (even valid on atomics!), leaving us with 2 modification 
orders..

 2,4 or 4,2
and what you are getting at is you don't think we should ever see
r1==2, r2==4  and r3==4, r4==2

lets say the order of the writes turns out to be  2,4...  is it possible 
for both writes to be travelling around some bus and have thread 4 
actually read the second one first, followed by the first one?   It 
would imply a lack of memory coherency in the system wouldn't it? My 
simple understanding is that the hardware gives us this sort of minimum 
guarantee on all shared memory. which means we should never see that happen.


And if we can't see that, then I don't see how we can see your example.. 
 *one* of those modification orders has to be what is actually written 
to x, and reads from that memory location will not be able to see an 
something else. (ie, if it was 1,2,3,4  then thread 4 would not be able 
to see r3==4,r4==1 thanks to memory coherency.


Andrew



Re: should sync builtins be full optimization barriers?

2011-09-12 Thread Ken Raeburn
On Sep 12, 2011, at 19:19, Andrew MacLeod wrote:
> lets say the order of the writes turns out to be  2,4...  is it possible for 
> both writes to be travelling around some bus and have thread 4 actually read 
> the second one first, followed by the first one?   It would imply a lack of 
> memory coherency in the system wouldn't it? My simple understanding is that 
> the hardware gives us this sort of minimum guarantee on all shared memory. 
> which means we should never see that happen.

According to section 8.2.3.5 "Intra-Processor Forwarding Is Allowed" of "Intel 
64 and IA-32 Architectures Software Developer's Manual" volume 3A, December 
2009, a processor can see its own store happening before another's, though the 
example works on two different memory locations.  If at least one of the 
threads reading the values was on the same processor as one of the writing 
threads, perhaps it could see the locally-issued store first, unless 
thread-switching is presumed to include a memory fence.  Consistency of order 
is guaranteed *from the point of view of other processors* (8.2.3.7), which is 
not necessarily the case here.  A total order across all processors is imposed 
for locked instructions (8.2.3.8), but I'm not sure whether their use is 
assumed here.  I'm still reading up on caching protocols, write-back memory, 
etc.  Still not sure either way whether the original example can work...

Ken


Re: should sync builtins be full optimization barriers?

2011-09-12 Thread Andy Lutomirski
On 09/12/2011 05:30 PM, Ken Raeburn wrote:
> On Sep 12, 2011, at 19:19, Andrew MacLeod wrote:
>> lets say the order of the writes turns out to be  2,4...  is it possible for 
>> both writes to be travelling around some bus and have thread 4 actually read 
>> the second one first, followed by the first one?   It would imply a lack of 
>> memory coherency in the system wouldn't it? My simple understanding is that 
>> the hardware gives us this sort of minimum guarantee on all shared memory. 
>> which means we should never see that happen.
> 
> According to section 8.2.3.5 "Intra-Processor Forwarding Is Allowed" of 
> "Intel 64 and IA-32 Architectures Software Developer's Manual" volume 3A, 
> December 2009, a processor can see its own store happening before another's, 
> though the example works on two different memory locations.  If at least one 
> of the threads reading the values was on the same processor as one of the 
> writing threads, perhaps it could see the locally-issued store first, unless 
> thread-switching is presumed to include a memory fence.  Consistency of order 
> is guaranteed *from the point of view of other processors* (8.2.3.7), which 
> is not necessarily the case here.  A total order across all processors is 
> imposed for locked instructions (8.2.3.8), but I'm not sure whether their use 
> is assumed here.  I'm still reading up on caching protocols, write-back 
> memory, etc.  Still not sure either way whether the original example can 
> work...

Presumably any sensible operating system insert a fence whenever it
switches between threads to prevent exactly this issue.  Otherwise it
could be nearly impossible to write correct code.

(TBH, it was never entirely clear to me that mfence is guaranteed to
flush the store buffer and force everything to be re-read from the
coherency domain, but if that's not true then it's pretty much
impossible to get this right.)

--Andy


Re: should sync builtins be full optimization barriers?

2011-09-12 Thread Geert Bosch

On Sep 12, 2011, at 19:19, Andrew MacLeod wrote:

> Lets simplify it slightly.  The compiler can optimize away x=1 and x=3 as 
> dead stores (even valid on atomics!), leaving us with 2 modification orders..
> 2,4 or 4,2
> and what you are getting at is you don't think we should ever see
> r1==2, r2==4  and r3==4, r4==2
Right, I agree that the compiler can optimize away both the
double writes and double reads.
> 
> lets say the order of the writes turns out to be  2,4...  is it possible for 
> both writes to be travelling around some bus and have thread 4 actually read 
> the second one first, followed by the first one?   It would imply a lack of 
> memory coherency in the system wouldn't it? My simple understanding is that 
> the hardware gives us this sort of minimum guarantee on all shared memory. 
> which means we should never see that happen.

No, it is possible, and actually likely. Basically, the issue is write buffers. 
The coherency mechanisms come into play at a lower level in the hierarchy 
(typically at the last-level cache), which is why we need fences to start with 
to implement things like spin locks.

Threads running on the same CPU may be share the same caches (think about a 
thread switch or hyper-threading). Now both processors may have a copy of the 
same cache line and both try to do a write to some location in that line. Then 
they'll both try to get exclusive access to the cache line. One CPU will 
succeed, the other will have a cache miss.

However, while all this is going on, the write is just sitting in a write 
buffer, and any references from the same processor will just get forwarded the 
value of the outstanding write. 

> And if we can't see that, then I don't see how we can see your example..  
> *one* of those modification orders has to be what is actually written to x, 
> and reads from that memory location will not be able to see an something 
> else. (ie, if it was 1,2,3,4  then thread 4 would not be able to see 
> r3==4,r4==1 thanks to memory coherency.

No that's false. Even on systems with nice memory models, such as x86 and SPARC 
with a TSO model, you need a fence to avoid that a write-load of the same 
location is forced to make it all the way to coherent memory and not forwarded 
directly from the write buffer or L1 cache. The reasons that fences are 
expensive is exactly that it requires system-wide agreement.

 -Geert


Re: should sync builtins be full optimization barriers?

2011-09-12 Thread Lawrence Crowl
On 9/9/11, Geert Bosch  wrote:
> To be honest, I can't quite see the use of completely unordered
> atomic operations, where we not even prohibit compiler optimizations.
> It would seem if we guarantee that a variable will not be accessed
> concurrently from any other thread, we wouldn't need the operation
> to be atomic in the first place. That said, it's quite likely I'm
> missing something here.

The memory_order_relaxed is useful in at least two situations.
First, when maintaining atomic counter of events, you may not need
to synchronize with any other atomic variables.  Second, some
algorithms need more than one atomic operation, but are only
'effective' on one of them, and only that one needs to synchronize
other memory.

-- 
Lawrence Crowl


Re: should sync builtins be full optimization barriers?

2011-09-12 Thread Lawrence Crowl
On 9/11/11, Andrew MacLeod  wrote:
> On 09/09/2011 09:09 PM, Geert Bosch wrote:
>> For the C++0x atomic types there are:
>>
>> void A::store(C desired, memory_order order = memory_order_seq_cst)
>> volatile;
>> void A::store(C desired, memory_order order = memory_order_seq_cst);
>>
>> where the first variant (with order = memory_order_relaxed)
>> would allow fences to be omitted, while still preventing the compiler from
>> reordering memory accesses, IIUC.
>
> I thought the volatile tags were actually for type correctness so the
> compiler wouldn't complain when used on volatile objects  Ie, you
> can't call a non volatile method with a volatile object, or something
> like that.

Volatile means the same thing with atomics that it does without the
atomics, that an external agent may affect the value or that writes
to the value may effect external agents.

The C++ standards committee has anticipated optimizing atomic
operations, and even eliminating them entirely when it eliminates
entire threads.  However, one cannot do that with volatile because
you might fail to coordinate with the external agent.

I think the most likely use of volatile atomics is in communicating
between to processes sharing memory.  Note, though, that such
atomics may need to be lock-free and/or address-free.

-- 
Lawrence Crowl


Re: should sync builtins be full optimization barriers?

2011-09-12 Thread Paolo Bonzini
On Tue, Sep 13, 2011 at 03:52, Geert Bosch  wrote:
> No, it is possible, and actually likely. Basically, the issue is write 
> buffers.
> The coherency mechanisms come into play at a lower level in the
> hierarchy (typically at the last-level cache), which is why we need fences
> to start with to implement things like spin locks.

You need fences on x86 to implement Petterson or Dekkar spin locks but
only because they involve write-read ordering to different memory
locations (I'm mentioning those spin lock algorithms because they do
not require locked memory accesses).  Write-write, read-read and for
the same location write-read ordering are guaranteed by the processor.
 Same for coherency which is a looser property.

However, accesses in the those spin lock algorithm are definitely
_not_ relaxed; not all of them, at least.

> No that's false. Even on systems with nice memory models, such as x86
> and SPARC with a TSO model, you need a fence to avoid that a write-load
> of the same location is forced to make it all the way to coherent memory
> and not forwarded directly from the write buffer or L1 cache.

Not sure about SPARC, but this is definitely false on x86.

Granted, even if you do not have to put fences those writes are likely
_not_ free.  The processor needs to do more than say on PPC, so I
wouldn't be surprised if conflicting memory accesses are quite more
expensive on x86 than PPC.  Recently, a colleague of mine tried
replacing optimization barriers with full barriers in one of two
threads implementing a ring buffer; that thread was now 30% slower,
but the other thread sped up by basically the same time.

Paolo