Re: Atomic builtins questions

2012-11-23 Thread Yvan Roux
Hi Ramana and Peter,


There is no issue in the first case. You are correct that the dmb’s there
> are to ensure the sequential consistency as you’d want to see with
> __ATOMIC_SEQ_CST in the call to the builtin. However what you must remember
> is that STRs are guaranteed to be single-copy atomic by the architecture on
> v7-a. In general on v7-a there is single-copy atomicity with loads and
> stores to byte, half-words (2 byte aligned) and word  (4 byte aligned)
> addresses i.e. ldrb, ldrh and ldr.
>

ok, but if I understand well the ARM ARM atomicity chapter (A3.5.3) the
single-copy atomicity
is not guarantee for unaligned access, so maybe I missed something, but if
it is allowed to call
an atomic builtin on an unaligned address, we could have an issue.


On systems with LPAE enabled, ldrd and strd are also guaranteed to be
> atomic copies from 64 bit aligned addresses, thus in general for v7-a you
> should be using the ldrexd / strexd variant in this particular case. Thus
> the code for the first example should work correctly as is and **not**
> require an ldrex / strex implementation. 
>
> ** **
>
> The second case always hurts my head – the short story is that you are
> right and it looks like a bug. The barrier needs to be after the load and
> not before because `Acquire’ semantics imply that no reads in the current
> thread dependent on the value currently loaded can be reordered before this
> load. What this means is that loads before this operation can percolate
> downwards towards the barrier . 
>
> ** **
>
>  The other clue to this is the definition around the `Release’ semantics,
> where no writes in the current thread can be reordered after this store.
> This implies that the write operation puts out barriers **before** the
> write which is what we do with __atomic_store_n (addr, 0,
> __ATOMIC_RELEASE); and that appears to be correct.
>


I am agree.


>  However what’s not clear to me is why there is this deliberate choice in
> the default implementation of expand_atomic_load to put out a memory fence
> before the load and that seems to have percolated back down into the
> backend for atomic_loaddi. We should take this up upstream as a bug.
>
> **
>

It is not clear for me too, I would have write the expand_atomic_load like
that :

if (model == MEMMODEL_SEQ_CST)
  expand_mem_thread_fence (model);

emit_move_insn (target, mem);

if (model == MEMMODEL_SEQ_CST || model == MEMMODEL_ACQUIRE)
  expand_mem_thread_fence (model);

I'll ask the question upstream.

Yvan
___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain


[ACTIVITY] 19-23 November 2012

2012-11-23 Thread Matthew Gretton-Dann
== Blueprints ==

InitialCurrentActual
initial-aarch64-backport31 Oct 201230 Nov 2012
aarch64-baremetal-testing   31 Oct 201230 Nov 2012
fix-gcc-multiarch-testing   31 Dec 201231 Dec 2012
backport-fma-intrinsic  31 Dec 201231 Dec 2012
fused-multiply-add-support  31 Dec 201231 Dec 2012
gcc-investigate-lra-for-arm 31 Dec 201231 Dec 2012

== Progress ==

 * Admin
   * Interviewing
 * Investigate patches for literal pool layout bug
   * Took longer than expected as the 'simple' fix is wrong due to GCC not
 knowing how large instructions actually are.
   * Patch posted upstream
 * Post triplet backport patches upstream
 * Other bug issues
   * Including an issue running SPEC2K on x86 with recent trunk

== Next Week ==

 * Run HOT/COLD partitioning benchmarks
   * Analyse ARM results
   * On x86_64 to see what the actual benefit we could get
 * initial-aarch64-backport & aarch64-baremetal-testing
   * Finish documentation
 * gcc-investigate-lra-for-arm
   * Analyse benchmarks
 * fix-gcc-multiarch-testing
   * Come up with strawman proposal for updating testsuite to handle
 testing with varying command-line options.

== Future ==

 * backport-fma-intrinsic & fused-multiply-add-support
   * Backport patches once fix-gcc-multiarch-testing has been done.

--
Matthew Gretton-Dann
Linaro Toolchain Working Group
matthew.gretton-d...@linaro.org

___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain


RE: Atomic builtins questions

2012-11-23 Thread Ramana Radhakrishnan
Yvan,

That is correct. The addresses need to be aligned as per the restrictions in 
the architecture. Yes we could have an issue but software writers need to deal 
with it because IIUC you either have a huge performance penalty (x86 / lock 
prefix) or correctness issues (ARM, Powerpc) .

Cheers,
Ramana

From: Yvan Roux [mailto:yvan.r...@linaro.org]
Sent: 23 November 2012 10:29
To: Ramana Radhakrishnan
Cc: linaro-toolchain@lists.linaro.org
Subject: Re: Atomic builtins questions

Hi Ramana and Peter,


There is no issue in the first case. You are correct that the dmb's there are 
to ensure the sequential consistency as you'd want to see with __ATOMIC_SEQ_CST 
in the call to the builtin. However what you must remember is that STRs are 
guaranteed to be single-copy atomic by the architecture on v7-a. In general on 
v7-a there is single-copy atomicity with loads and stores to byte, half-words 
(2 byte aligned) and word  (4 byte aligned) addresses i.e. ldrb, ldrh and ldr.

ok, but if I understand well the ARM ARM atomicity chapter (A3.5.3) the 
single-copy atomicity
is not guarantee for unaligned access, so maybe I missed something, but if it 
is allowed to call
an atomic builtin on an unaligned address, we could have an issue.


On systems with LPAE enabled, ldrd and strd are also guaranteed to be atomic 
copies from 64 bit aligned addresses, thus in general for v7-a you should be 
using the ldrexd / strexd variant in this particular case. Thus the code for 
the first example should work correctly as is and *not* require an ldrex / 
strex implementation.

The second case always hurts my head - the short story is that you are right 
and it looks like a bug. The barrier needs to be after the load and not before 
because `Acquire' semantics imply that no reads in the current thread dependent 
on the value currently loaded can be reordered before this load. What this 
means is that loads before this operation can percolate downwards towards the 
barrier .

The other clue to this is the definition around the `Release' semantics, where 
no writes in the current thread can be reordered after this store. This implies 
that the write operation puts out barriers *before* the write which is what we 
do with __atomic_store_n (addr, 0, __ATOMIC_RELEASE); and that appears to be 
correct.


I am agree.

However what's not clear to me is why there is this deliberate choice in the 
default implementation of expand_atomic_load to put out a memory fence before 
the load and that seems to have percolated back down into the backend for 
atomic_loaddi. We should take this up upstream as a bug.

It is not clear for me too, I would have write the expand_atomic_load like that 
:

if (model == MEMMODEL_SEQ_CST)
  expand_mem_thread_fence (model);

emit_move_insn (target, mem);

if (model == MEMMODEL_SEQ_CST || model == MEMMODEL_ACQUIRE)
  expand_mem_thread_fence (model);

I'll ask the question upstream.

Yvan___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain


[ACTIVITY] report week 46

2012-11-23 Thread Peter Maydell
 * meetings/discussions
   + upcoming architecture & h/w features
   + virtio patchset review/discussion
   + 1-2-1 Matt, Michael
   + arndale board secondary CPU boot
   + naming conventions for v8 qemu
   + v7 KVM blueprint status
   + posture assessment
   + KVM call
 * thinking through possible corner cases for KVM with stage2
   faults that don't tell the hypervisor the offending IPA
 * rebased qemu-linaro on qemu 1.3 rc0
 * fixed upstream regression in "-usb" command line

KVM blueprint progress tracker:
http://ex.seabright.co.nz/helpers/backlog?group_by=topic&colour_by=state&projects=linaro-kvm

-- PMM

___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain


[ACTIVITY] 19-23 November 2012

2012-11-23 Thread Christophe Lyon
== Progress ==
* Turn off 64-bits bitops in Neon: initial implementation under
benchmarking.
  Currently it modifies the handling of: add, sub, and, or, xor, shifts,
not. In some case the generated code is quite larger, so it will careful
benchmarking.

* Started looking at "disable peeling" blue-print. Reading GCC source code
to get more familiar with that area.

* Internal support
___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain