Re: [PATCH, AArch64 v2 05/11] aarch64: Emit LSE st instructions

2018-10-31 Thread Will Deacon
Hi Richard,

On Wed, Oct 31, 2018 at 10:27:59AM +, Richard Henderson wrote:
> On 10/30/18 8:32 PM, James Greenhalgh wrote:
> > On Tue, Oct 02, 2018 at 11:19:09AM -0500, Richard Henderson wrote:
> >> When the result of an operation is not used, we can ignore the
> >> result by storing to XZR.  For two of the memory models, using
> >> XZR with LD has a preferred assembler alias, ST.
> > 
> > ST has different semantics to LD, in particular, ST is not
> > ordered by a DMB LD; so this could weaken the LDADD and break C11 semantics.
> > 
> > The relevant Arm Arm text is:
> > 
> >   If the destination register is not one of WZR or XZR, LDADDA and
> >   LDADDAL load from memory with acquire semantics
> > 
> >   LDADDL and LDADDAL store to memory with release semantics.
> > 
> >   LDADD has no memory ordering requirements.
> > 
> > I'm taking this to mean that even if the result is unused, using XZR is not
> > a valid transformation; it weakens the expected acquire semantics to
> > unordered.
> > 
> > The example I have from Will Deacon on an internal bug database is:
> > 
> >   P0 (atomic_int* y,atomic_int* x) {
> > atomic_store_explicit(x,1,memory_order_relaxed);
> > atomic_thread_fence(memory_order_release);
> > atomic_store_explicit(y,1,memory_order_relaxed);
> >   }
> > 
> >   P1 (atomic_int* y,atomic_int* x) {
> > int r0 = atomic_fetch_add_explicit(y,1,memory_order_relaxed);
> > atomic_thread_fence(memory_order_acquire);
> > int r1 = atomic_load_explicit(x,memory_order_relaxed);
> >   }
> > 
> >   The outcome where y == 2 and P1 has r0 = 1 and r1 = 0 is illegal.
> > 
> > This example comes from a while back in my memory; so copying Will for
> > any more detailed questions.
> > 
> > My impression is that this transformation is not safe, and so the patch is
> > not OK.
> 
> Here's a revised patch.
> 
> Use ST for relaxed and release orderings, retain the (non-xzr) scratch
> register for other orderings.  But the scratch need not be early-clobber, 
> since
> there's no mid-point of half-consumed inputs.

The example test above uses relaxed atomics in conjunction with an acquire
fence, so I don't think we can actually use ST at all without a change
to the language specification. I previouslyyallocated P0861 for this purpose
but never got a chance to write it up...

Perhaps the issue is a bit clearer with an additional thread (not often I
say that!):


P0 (atomic_int* y,atomic_int* x) {
  atomic_store_explicit(x,1,memory_order_relaxed);
  atomic_thread_fence(memory_order_release);
  atomic_store_explicit(y,1,memory_order_relaxed);
}

P1 (atomic_int* y,atomic_int* x) {
  atomic_fetch_add_explicit(y,1,memory_order_relaxed);  // STADD
  atomic_thread_fence(memory_order_acquire);
  int r0 = atomic_load_explicit(x,memory_order_relaxed);
}

P2 (atomic_int* y) {
  int r1 = atomic_load_explicit(y,memory_order_relaxed);
}


My understanding is that it is forbidden for r0 == 0 and r1 == 2 after
this test has executed. However, if the relaxed add in P1 compiles to
STADD and the subsequent acquire fence is compiled as DMB LD, then we
don't have any ordering guarantees in P1 and the forbidden result could
be observed.

Will


Re: [PATCH, AArch64 v2 05/11] aarch64: Emit LSE st instructions

2018-10-31 Thread Will Deacon
On Wed, Oct 31, 2018 at 04:38:53PM +, Richard Henderson wrote:
> On 10/31/18 3:04 PM, Will Deacon wrote:
> > The example test above uses relaxed atomics in conjunction with an acquire
> > fence, so I don't think we can actually use ST at all without a change
> > to the language specification. I previouslyyallocated P0861 for this purpose
> > but never got a chance to write it up...
> > 
> > Perhaps the issue is a bit clearer with an additional thread (not often I
> > say that!):
> > 
> > 
> > P0 (atomic_int* y,atomic_int* x) {
> >   atomic_store_explicit(x,1,memory_order_relaxed);
> >   atomic_thread_fence(memory_order_release);
> >   atomic_store_explicit(y,1,memory_order_relaxed);
> > }
> > 
> > P1 (atomic_int* y,atomic_int* x) {
> >   atomic_fetch_add_explicit(y,1,memory_order_relaxed);  // STADD
> >   atomic_thread_fence(memory_order_acquire);
> >   int r0 = atomic_load_explicit(x,memory_order_relaxed);
> > }
> > 
> > P2 (atomic_int* y) {
> >   int r1 = atomic_load_explicit(y,memory_order_relaxed);
> > }
> > 
> > 
> > My understanding is that it is forbidden for r0 == 0 and r1 == 2 after
> > this test has executed. However, if the relaxed add in P1 compiles to
> > STADD and the subsequent acquire fence is compiled as DMB LD, then we
> > don't have any ordering guarantees in P1 and the forbidden result could
> > be observed.
> 
> I suppose I don't understand exactly what you're saying.

Apologies, I'm probably not explaining things very well. I'm trying to
avoid getting into the C11 memory model relations if I can help it, hence
the example.

> I can see that, yes, if you split the fetch-add from the acquire in P1 you get
> the incorrect results you describe.  But isn't that a bug in the test itself?

Per the C11 memory model, the test above is well-defined and if r1 == 2
then it is required that r0 == 1. With your proposal, this is not guaranteed
for AArch64, and it would be possible to end up with r1 == 2 and r0 == 0.

> Why would not the only correct version have
> 
> P1 (atomic_int* y, atomic_int* x) {
>   atomic_fetch_add_explicit(y, 1, memory_order_acquire);
>   int r0 = atomic_load_explicit(x, memory_order_relaxed);
> }
> 
> at which point we won't use STADD for the fetch-add, but LDADDA.

That would indeed work correctly, but the problem is that the C11 memory
model doesn't rule out the previous test as something which isn't portable.

> If the problem is more fundamental than this, would you have another go at
> explaining?  In particular, I don't see the difference between
> 
>   ldadd   val, scratch, [base]
>   vs
>   stadd   val, [base]
> 
> and
> 
>   ldaddl  val, scratch, [base]
>   vs
>   staddl  val, [base]
> 
> where both pairs of instructions have the same memory ordering semantics.
> Currently we are always producing the ld version of each pair.

Aha, maybe this is the problem. An acquire fence on AArch64 is implemented
using a DMB LD instruction, which orders prior reads against subsequent
reads and writes. However, the architecture says:

  | The ST instructions, and LD instructions where the destination
  | register is WZR or XZR, are not regarded as doing a read for the purpose
  | of a DMB LD barrier.

and so therefore an ST atomic is not affected by a subsequent acquire fence,
whereas an LD atomic is.

Does that help at all?

Will


Re: [RFC][AArch64] Add support for system register based stack protector canary access

2019-01-10 Thread Will Deacon
On Thu, Jan 10, 2019 at 03:49:27PM +, James Greenhalgh wrote:
> On Mon, Dec 03, 2018 at 03:55:36AM -0600, Ramana Radhakrishnan wrote:
> > For quite sometime the kernel guys, (more specifically Ard) have been 
> > talking about using a system register (sp_el0) and an offset from that 
> > for a canary based access. This patchset adds support for a new set of
> > command line options similar to how powerpc has done this.
> > 
> > I don't intend to change the defaults in userland, we've discussed this 
> > for user-land in the past and as far as glibc and userland is concerned 
> > we stick to the options as currently existing. The system register 
> > option is really for the kernel to use along with an offset as they 
> > control their ABI and this is a decision for them to make.
> > 
> > I did consider sticking this all under a mcmodel=kernel-small option but
> > thought that would be a bit too aggressive. There is very little error
> > checking I can do in terms of the system register being used and really
> > the assembler would barf quite quickly in case things go wrong. I've
> > managed to rebuild Ard's kernel tree with an additional patch that
> > I will send to him. I haven't managed to boot this kernel.
> > 
> > There was an additional question asked about the performance 
> > characteristics of this but it's a security feature and the kernel 
> > doesn't have the luxury of a hidden symbol. Further since the kernel 
> > uses sp_el0 for access everywhere and if they choose to use the same 
> > register I don't think the performance characteristics would be too bad, 
> > but that's a decision for the kernel folks to make when taking in the 
> > feature into the kernel.
> > 
> > I still need to add some tests and documentation in invoke.texi but
> > this is at the stage where it would be nice for some other folks
> > to look at this.
> > 
> > The difference in code generated is as below.
> > 
> > extern void bar (char *);
> > int foo (void)
> > {
> >char a[100];
> >bar (&a);
> > }
> > 
> > $GCC -O2  -fstack-protector-strong  vs 
> > -mstack-protector-guard-reg=sp_el0 -mstack-protector-guard=sysreg 
> > -mstack-protector-guard-offset=1024 -fstack-protector-strong
> > 
> > 
> > --- tst.s   2018-12-03 09:46:21.174167443 +
> > +++ tst.s.1 2018-12-03 09:46:03.546257203 +
> > @@ -15,15 +15,14 @@
> > mov x29, sp
> > str x19, [sp, 16]
> > .cfi_offset 19, -128
> > -   adrpx19, __stack_chk_guard
> > -   add x19, x19, :lo12:__stack_chk_guard
> > -   ldr x0, [x19]
> > -   str x0, [sp, 136]
> > -   mov x0,0
> > +   mrs x19, sp_el0
> > add x0, sp, 32
> > +   ldr x1, [x19, 1024]
> > +   str x1, [sp, 136]
> > +   mov x1,0
> > bl  bar
> > ldr x0, [sp, 136]
> > -   ldr x1, [x19]
> > +   ldr x1, [x19, 1024]
> > eor x1, x0, x1
> > cbnzx1, .L5
> > 
> > 
> > 
> > 
> > I will be afk tomorrow and day after but this is to elicit some comments 
> > and for Ard to try this out with his kernel patches.
> > 
> > Thoughts ?
> 
> I didn't see ananswer on list to Ard's questions about the command-line logic.

FWIW: the kernel-side is now merged upstream in 5.0-rc1:

http://git.kernel.org/linus/0a1213fa7432

where we ended up checking for the presence of all three options to be
on the safe side.

Will


Re: [PATCH] ARM: Weaker memory barriers

2014-03-11 Thread Will Deacon

Hi John,

On Tue, Mar 11, 2014 at 02:54:18AM +, John Carr wrote:
> A comment in arm/sync.md notes "We should consider issuing a inner
> shareability zone barrier here instead."  Here is my first attempt
> at a patch to emit weaker memory barriers.  Three instructions seem
> to be relevant for user mode code on my Cortex A9 Linux box:
> 
> dmb ishst, dmb ish, dmb sy
> 
> I believe these correspond to a release barrier, a full barrier
> with respect to other CPUs, and a full barrier that also orders
> relative to I/O.

Not quite; DMB ISHST only orders writes with other writes, so loads can move
across it in both directions. That means it's not sufficient for releasing a
lock, for example.



I gave a presentation at ELCE about the various ARMv7 barrier options (from
a kernel perspective):

  https://www.youtube.com/watch?v=6ORn6_35kKo



> Consider this a request for comments on whether the approach is correct.
> I haven't done any testing yet (beyond eyeballing the assembly output).

You'll probably find that a lot of ARM micro-architectures treat them
equivalently, which makes this a good source of potentially latent bugs if
you get the wrong options. When I added these options to the kernel, I
tested with a big/little A7/A15 setup using a CCI400 (the cache-coherent
interconnect) to try and tickle things a bit better, but it's by no means
definitive.

>  (define_insn "*memory_barrier"
>[(set (match_operand:BLK 0 "" "")
> - (unspec:BLK [(match_dup 0)] UNSPEC_MEMORY_BARRIER))]
> -  "TARGET_HAVE_MEMORY_BARRIER"
> + (unspec:BLK
> +  [(match_dup 0) (match_operand:SI 1 "const_int_operand")]
> +  UNSPEC_MEMORY_BARRIER))]
> +  "TARGET_HAVE_DMB || TARGET_HAVE_MEMORY_BARRIER"
>{
>  if (TARGET_HAVE_DMB)
>{
> - /* Note we issue a system level barrier. We should consider issuing
> -a inner shareabilty zone barrier here instead, ie. "DMB ISH".  */
> - /* ??? Differentiate based on SEQ_CST vs less strict?  */
> - return "dmb\tsy";
> +switch (INTVAL(operands[1]))
> + {
> + case MEMMODEL_RELEASE:
> +  return "dmb\tishst";

As stated above, this isn't correct.

> + case MEMMODEL_CONSUME:

You might be able to build this one out of  + ISB (in
lieu of a true address dependency). However, given the recent monolithic C11
thread that took over this list and others, let's play it safe for now and
stick with DMB ISH.

> + case MEMMODEL_ACQUIRE:
> + case MEMMODEL_ACQ_REL:
> +   return "dmb\tish";

I think you probably *always* want ISH for GCC. You assume normal, cacheable
, coherent memory right? That also mirrors the first comment you delete
above.

> + case MEMMODEL_SEQ_CST:
> +   return "dmb\tsy";

Again, ISH here should be fine. SY is for things like non-coherent devices,
which I don't think GCC has a concept of (the second comment you delete
doesn't make any sense and has probably tricked you).

Hope that helps,

Will


Re: [PATCH] ARM: Weaker memory barriers

2014-03-12 Thread Will Deacon
On Tue, Mar 11, 2014 at 09:12:53PM +, John Carr wrote:
> Will Deacon  wrote:
> > On Tue, Mar 11, 2014 at 02:54:18AM +, John Carr wrote:
> > > A comment in arm/sync.md notes "We should consider issuing a inner
> > > shareability zone barrier here instead."  Here is my first attempt
> > > at a patch to emit weaker memory barriers.  Three instructions seem
> > > to be relevant for user mode code on my Cortex A9 Linux box:
> > > 
> > > dmb ishst, dmb ish, dmb sy
> > > 
> > > I believe these correspond to a release barrier, a full barrier
> > > with respect to other CPUs, and a full barrier that also orders
> > > relative to I/O.
> > 
> > Not quite; DMB ISHST only orders writes with other writes, so loads can move
> > across it in both directions. That means it's not sufficient for releasing a
> > lock, for example.
> 
> Release in this context doesn't mean "lock release".  I understand
> it to mean release in the specific context of the C++11 memory model.
> (Similarly, if you're arguing standards compliance "inline" really
> means "relax the one definition rule for this function.")
> 
> I don't see a prohibition on moving non-atomic loads across a release
> store.  Can you point to an analysis that shows a full barrier is needed?

Well, you can use acquire/release to implement a lock easily enough. For
example, try feeding the following to cppmem:

  int main() {
int x = 0, y = 0;
atomic_int z = 0;

{{{ { r1 = x; y = 1;
  z.store(1, memory_order_release); }
||| { r0 = z.load(memory_order_acquire).readsvalue(1);
  r1 = y; x = 1;}
}}}

return 0;
  }

There is one consistent execution, which requires the first thread to have
r1 == 0 (i.e. read x as zero) and the second thread to have r1 == 1 (i.e.
read y as 1).

If we implement store-release using DMB ISHST, the assembly code would look
something like the following (I've treated the atomic accesses like normal
load/store instructions for clarity, since they don't affect the ordering
here):

  T0:

  LDR r1, [x]
  STR #1, [y]
  DMB ISHST
  STR #1, [z]

  T1:

  LDR r0, [z] // Reads 1
  DMB ISH
  LDR r1, [y]
  STR #1, [x]

The problem with this is that the LDR in T0 can be re-ordered *past* the
rest of the sequence, potentially resulting in r1 == 1, which is forbidden.
It's just like reading from a shared, lock-protected data structure without
the lock held.

> If we assume that gcc is used to generate code for processes running
> within a single inner shareable domain, then we can start by demoting
> "dmb sy" to "dmb ish" for the memory barrier with no other change.

I'm all for such a change.

> If a store-store barrier has no place in the gcc atomic memory model,
> that supports my hypothesis that a twisty maze of ifdefs is superior to
> a "portable" attractive nuisance.

I don't understand your point here.

Will