Optimization Option Question

2018-12-19 Thread Tangnianyao (ICT)
Greetings All,
I am dealing with compile optimization comparison between arm64 and intel 
platform, with g++ (version 4.9.4).

Compile the following c++ code,

uint32 Witness::getEntityVolatileDataUpdateFlags(Entity* otherEntity)
{
 uint32 flags = UPDATE_FLAG_NULL;


 if (otherEntity->controlledBy() && pEntity_->id() == 
otherEntity->controlledBy()->id())
   return flags;

 ...
}

with successive load memory operations at the entry of a function, where the 
latter load memory operation has dependency on the former one.
Compiling result on intel x86-64 platform, we find that g++ will put one load 
memory instrution in front of push stack instructions of function call. It can 
save some time waiting the former load to finish on an out-of-order processor.  
We use these optimization options O1, -fpartial-inlining, -fschedule-insns2, 
-ftree-pre.
On arm64 platform, We use the same optimization options to compile the same 
code and find that there is no similar results. No load memory instructions is 
put before push stack instructions. Nor we get that result using O2, O3, or 
Ofast to complie on arm64.

Did we have similar compiling optimization on arm64 g++?
If yes, which optimization options can I use?

Thanks,
-Nianyao Tang



Re: Spectre V1 diagnostic / mitigation

2018-12-19 Thread Richard Biener
On Tue, 18 Dec 2018, Jeff Law wrote:

> On 12/18/18 8:36 AM, Richard Biener wrote:
> > 
> > Hi,
> > 
> > in the past weeks I've been looking into prototyping both spectre V1 
> > (speculative array bound bypass) diagnostics and mitigation in an
> > architecture independent manner to assess feasability and some kind
> > of upper bound on the performance impact one can expect.
> > https://lists.llvm.org/pipermail/llvm-dev/2018-March/122085.html is
> > an interesting read in this context as well.
> > 
> > For simplicity I have implemented mitigation on GIMPLE right before
> > RTL expansion and have chosen TLS to do mitigation across function
> > boundaries.  Diagnostics sit in the same place but both are not in
> > any way dependent on each other.
> > 
> > The mitigation strategy chosen is that of tracking speculation
> > state via a mask that can be used to zero parts of the addresses
> > that leak the actual data.  That's similar to what aarch64 does
> > with -mtrack-speculation (but oddly there's no mitigation there).
> > 
> > I've optimized things to the point that is reasonable when working
> > target independent on GIMPLE but I've only looked at x86 assembly
> > and performance.  I expect any "final" mitigation if we choose to
> > implement and integrate such would be after RTL expansion since
> > RTL expansion can end up introducing quite some control flow whose
> > speculation state is not properly tracked by the prototype.
> > 
> > I'm cut&pasting single-runs of SPEC INT 2006/2017 here, the runs
> > were done with -O2 [-fspectre-v1={2,3}] where =2 is function-local
> > mitigation and =3 does mitigation global with passing the state
> > via TLS memory.
> > 
> > The following was measured on a Haswell desktop CPU:
> [ ... ]
> Interesting.  So we'd been kicking this issue around a bit internally.
> 
> The number of packages where we'd want to turn this on was very small
> and thus it was difficult to justify burning resources in this space.
> LLVM might be an option for those limited packages, but LLVM is missing
> other security things we don't want to lose (such as stack clash
> mitigation).
> 
> In the end we punted for the immediate future.  We'll almost certainly
> revisit at some point and your prototype would obviously factor into the
> calculus around future decisions.
> 
> [ ... ]
> 
> 
> > 
> > 
> > The patch relies heavily on RTL optimizations for DCE purposes.  At the
> > same time we rely on RTL not statically computing the mask (RTL has no
> > conditional constant propagation).  Full instrumentation of the classic
> > Spectre V1 testcase
> Right. But it does do constant propagation into arms of conditionals as
> well as jump threading.  I'd fear they might compromise things.

jump threading shouldn't be an issue since that elides the conditional.
I didn't see constant propagation into arms of conditionals happening.
We don't do that on GIMPLE either ;)  I guess I have avoided this
by making the condition data dependent on the mask.  That is, I
transform

  if (a > b)

to

  mask = a > b ? -1 : 0;
  if (mask)
...

so one need to replace the condition with the mask computation
conditional.

But yes, for a "final" solution that also gives more control to
targets I thought of allowing (with fallback doing sth like above)
the targets to supply a set-mask-and-jump pattern combining
conditional, mask generation and jump.  I guess those would look
similar to the -fwrapv plusv patterns we have in i386.md.

> Obviously we'd need to look further into those issues.  But even if they
> do, something like what you've done may mitigate enough vulnerable
> sequences that it's worth doing, even if there's some gaps due to "over"
> optimization in the RTL space.

Yeah.  Note I was just lazy and thus didn't elide useless loads/stores
of the TLS var for adjacent calls or avoided instrumenting cases
where there will be no uses of the mask, etc.  With some simple
(even non-LCM) insertion optimization the dependence on dce/dse
can be avoided.

Richard.

> [  ... ]
> 
> > 
> > so the generated GIMPLE was "tuned" for reasonable x86 assembler outcome.
> > 
> > Patch below for reference (and your own testing in case you are curious).
> > I do not plan to pursue this further at this point.
> Understood.  Thanks for posting it.  We're not currently working in this
> space, but again, we may re-evaluate that stance in the future.


Re: Spectre V1 diagnostic / mitigation

2018-12-19 Thread Richard Biener
On Tue, 18 Dec 2018, Richard Earnshaw (lists) wrote:

> On 18/12/2018 15:36, Richard Biener wrote:
> > 
> > Hi,
> > 
> > in the past weeks I've been looking into prototyping both spectre V1 
> > (speculative array bound bypass) diagnostics and mitigation in an
> > architecture independent manner to assess feasability and some kind
> > of upper bound on the performance impact one can expect.
> > https://lists.llvm.org/pipermail/llvm-dev/2018-March/122085.html is
> > an interesting read in this context as well.
> 
> Interesting, thanks for posting this.
> 
> > 
> > For simplicity I have implemented mitigation on GIMPLE right before
> > RTL expansion and have chosen TLS to do mitigation across function
> > boundaries.  Diagnostics sit in the same place but both are not in
> > any way dependent on each other.
> 
> We considered using TLS for propagating the state across call-boundaries
> on AArch64, but rejected it for several reasons.
> 
> - It's quite expensive to have to set up the TLS state in every function;
> - It requires some global code to initialize the state variable - that's
> kind of ABI;

The cost is probably target dependent - on x86 it's simply a $fs based
load/store.  For initialization a static initializer seemed to work
for me (but honestly I didn't do any testing besides running the
testsuite for correctness - so at least the mask wasn't zero initialized).
Note the LLVM people use an inverted mask and cancel values by
OR-ing -1 instead of AND-ing 0.  At least default zero-initialization
should be possible with TLS vars.

That said, my choice of TLS was to make this trivially work across
targets - if a target can do better then it should.  And of course
the target may not have any TLS support besides emultls which would
be prohibitly expensive.

> - It also seems likely to be vulnerable to Spectre variant 4 - unless
> the CPU can always correctly store-to-load forward the speculation
> state, then you have the situation where the load may see an old value
> of the state - and that's almost certain to say "we're not speculating".
> 
> The last one is really the killer here.

Hmm, as far as I understood v4 only happens when store-forwarding
doesn't work.  And I hope it doesn't fail "randomly" but works
reliable when all accesses to the memory are aligned and have
the same size as is the case with these compiler-generated TLS
accesses.  But yes, if that's not guaranteed then using memory
doesn't work at all.  Not sure what else target independent there
is though that doesn't break the ABI like simply adding another
parameter.  And even adding a parameter might not work in case
there's only stack passing and V4 happens on the stack accesses...

> > 
> > The mitigation strategy chosen is that of tracking speculation
> > state via a mask that can be used to zero parts of the addresses
> > that leak the actual data.  That's similar to what aarch64 does
> > with -mtrack-speculation (but oddly there's no mitigation there).
> 
> We rely on the user inserting the new builtin, which we can more
> effectively optimize if the compiler is generating speculation state
> tracking data.  That doesn't preclude a full solution at a later date,
> but it looked like it was likely overkill for protecting every load and
> safely pruning the loads is not an easy problem to solve.  Of course,
> the builtin does require the programmer to do some work to identify
> which memory accesses might be vulnerable.

My main question was how in earth the -mtrack-speculation overhead
is reasonable for the very few expected explicit builtin uses...

Richard.

> R.
> 
> 
> > 
> > I've optimized things to the point that is reasonable when working
> > target independent on GIMPLE but I've only looked at x86 assembly
> > and performance.  I expect any "final" mitigation if we choose to
> > implement and integrate such would be after RTL expansion since
> > RTL expansion can end up introducing quite some control flow whose
> > speculation state is not properly tracked by the prototype.
> > 
> > I'm cut&pasting single-runs of SPEC INT 2006/2017 here, the runs
> > were done with -O2 [-fspectre-v1={2,3}] where =2 is function-local
> > mitigation and =3 does mitigation global with passing the state
> > via TLS memory.
> > 
> > The following was measured on a Haswell desktop CPU:
> > 
> > -O2 vs. -O2 -fspectre-v1=2
> > 
> >   Estimated   Estimated
> > Base Base   BasePeak Peak   Peak
> > Benchmarks  Ref.   Run Time Ratio   Ref.   Run Time Ratio
> > -- --  -  ---  -  -
> > 400.perlbench9770245   39.8 *9770452   21.6 
> > *  184%
> > 401.bzip29650378   25.5 *9650726   13.3 
> > *  192%
> > 403.gcc  8050236   34.2 *8050352   22.8 
> > *  149%
> > 429.mcf  9120

Re: Spectre V1 diagnostic / mitigation

2018-12-19 Thread Florian Weimer
* Richard Biener:

> The cost is probably target dependent - on x86 it's simply a $fs based
> load/store.

Do you need to reserve something in the TCB for this?

Thanks,
Florian


Re: Spectre V1 diagnostic / mitigation

2018-12-19 Thread Richard Biener
On Wed, 19 Dec 2018, Florian Weimer wrote:

> * Richard Biener:
> 
> > The cost is probably target dependent - on x86 it's simply a $fs based
> > load/store.
> 
> Do you need to reserve something in the TCB for this?

No idea.  But I figured using TLS with the patch only works when
optimizing and not with -O0.  Huh.  Anyway, it should be equivalent
to what presence of

__thread void *_SV1MASK = (void *)-1l;

requires (plus I make that symbol GNU_UNIQUE).  I see this
allocates -1 in the .tdata section marked TLS.

Richard.

-- 
Richard Biener 
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)


For OpenACC libgomp entry points, redefine the "int device" argument to "unsigned int flags" (was: OpenACC 2.6 "host_data" construct, "if_present" clause)

2018-12-19 Thread Thomas Schwinge
Hi!

On Wed, 19 Dec 2018 00:24:00 +0100, I wrote:
> OpenACC 2.6 adds a new clause to the "host_data" construct:
> 2.8.3. "if_present clause".  Gergő (in CC) is working on that.
> 
> When an 'if_present' clause appears on the directive, the compiler
> will only change the address of any variable or array which appears
> in _var-list_ that is present on the current device.
> 
> So, basically:
> 
> --- a/libgomp/target.c
> +++ b/libgomp/target.c
> @@ -1130,13 +1130,17 @@ gomp_map_vars_async (struct gomp_device_descr 
> *devicep,
>else if ((kind & typemask) == GOMP_MAP_USE_DEVICE_PTR)
>  {
>cur_node.host_start = (uintptr_t) hostaddrs[i];
>cur_node.host_end = cur_node.host_start;
>splay_tree_key n = gomp_map_lookup (mem_map, &cur_node);
>if (n == NULL)
>  {
> +  if ([...])
> +/* No error, continue using the host address.  */
> +continue;
>gomp_mutex_unlock (&devicep->lock);
>gomp_fatal ("use_device_ptr pointer wasn't mapped");
>  }
> 
> Note that this clause applies to *all* "use_device"
> ("GOMP_MAP_USE_DEVICE_PTR") clauses present on the "host_data" construct,
> so it's just a single bit flag for the construct.
> 
> Do you suggest we yet add a new mapping kind
> "GOMP_MAP_USE_DEVICE_PTR_IF_PRESENT" for that?  And, any preference about
> the specific value to use?  Gergő proposed:
> 
> --- a/include/gomp-constants.h
> +++ b/include/gomp-constants.h
> @@ -80,6 +80,10 @@ enum gomp_map_kind
>  GOMP_MAP_DEVICE_RESIDENT =   (GOMP_MAP_FLAG_SPECIAL_1 | 1),
>  /* OpenACC link.  */
>  GOMP_MAP_LINK =  (GOMP_MAP_FLAG_SPECIAL_1 | 2),
> +/* Like GOMP_MAP_USE_DEVICE_PTR below, translate a host to a device
> +   address.  If translation fails because the target is not mapped,
> +   continue using the host address. */
> +GOMP_MAP_USE_DEVICE_PTR_IF_PRESENT = 
> (GOMP_MAP_FLAG_SPECIAL_1 | 3),
>  /* Allocate.  */
>  GOMP_MAP_FIRSTPRIVATE =  (GOMP_MAP_FLAG_SPECIAL | 0),
>  /* Similarly, but store the value in the pointer rather than
> 
> Or, I had the idea that we could avoid that, instead continue using
> "GOMP_MAP_USE_DEVICE_PTR", and transmit the "if_present" flag through the
> "int device" argument of "GOACC_data_start" (making sure that old
> executables continue to function as before).  For OpenACC, that argument
> is only ever set to "GOMP_DEVICE_ICV" or "GOMP_DEVICE_HOST_FALLBACK" (for
> "if" clause evaluating to "false"), so has some bits to spare for that.
> However, I've not been able to convince myself that this solution would
> be any much prettier than adding a new mapping kind...  ;-)

Having thought about it some more, the idea doesn't actually seem so bad
anymore.  :-) Just don't think of it as 'merging stuff into "int
device"', but rather 'for OpenACC libgomp entry points, redefine the "int
device" argument to "unsigned int flags"' -- see attached WIP (for GCC
trunk, testing).

Jakub, what do you think?


For the "if_present" clause, we'd then initialize "tree flags" not to
zero but to "omp_find_clause (OMP_CLAUSE_IF_PRESENT) ?
GOACC_FLAG_IF_PRESENT : 0" or similar, and then handle that in libgomp.


Grüße
 Thomas


>From 1a45ffa3da47f1ec2e62badc31ae31454f4351d9 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge 
Date: Wed, 19 Dec 2018 13:37:00 +0100
Subject: [PATCH] For OpenACC libgomp entry points, redefine the "int device"
 argument to "unsigned int flags"

---
 gcc/builtin-types.def  | 10 +++---
 gcc/fortran/types.def  | 10 +++---
 gcc/omp-builtins.def   | 10 +++---
 gcc/omp-expand.c   | 68 +-
 gcc/tree-ssa-structalias.c |  4 +--
 include/gomp-constants.h   |  3 ++
 libgomp/libgomp_g.h| 12 +++
 libgomp/oacc-parallel.c| 53 -
 8 files changed, 133 insertions(+), 37 deletions(-)

diff --git a/gcc/builtin-types.def b/gcc/builtin-types.def
index 685b22f975a1..f92580d5c649 100644
--- a/gcc/builtin-types.def
+++ b/gcc/builtin-types.def
@@ -684,6 +684,8 @@ DEF_FUNCTION_TYPE_5 (BT_FN_BOOL_VPTR_PTR_I16_INT_INT,
 		 BT_BOOL, BT_VOLATILE_PTR, BT_PTR, BT_I16, BT_INT, BT_INT)
 DEF_FUNCTION_TYPE_5 (BT_FN_VOID_INT_SIZE_PTR_PTR_PTR,
 		 BT_VOID, BT_INT, BT_SIZE, BT_PTR, BT_PTR, BT_PTR)
+DEF_FUNCTION_TYPE_5 (BT_FN_VOID_UINT_SIZE_PTR_PTR_PTR,
+		 BT_VOID, BT_UINT, BT_SIZE, BT_PTR, BT_PTR, BT_PTR)
 DEF_FUNCTION_TYPE_5 (BT_FN_VOID_OMPFN_PTR_UINT_UINT_UINT,
 		 BT_VOID, BT_PTR_FN_VOID_PTR, BT_PTR, BT_UINT, BT_UINT,
 		 BT_UINT)
@@ -822,12 +824,12 @@ DEF_FUNCTION_TYPE_VAR_5 (BT_FN_INT_STRING_SIZE_INT_SIZE_CONST_STRING_VAR,
 DEF_FUNCTION_TYPE_VAR_5 (BT_FN_INT_INT_INT_INT_INT_INT_VAR,
 			 BT_INT, BT_INT, BT_INT, BT_INT, BT_INT, BT_INT)
 
-DEF_F

Re: Optimization Option Question

2018-12-19 Thread David Brown
On 19/12/18 09:10, Tangnianyao (ICT) wrote:
> Greetings All,
> I am dealing with compile optimization comparison between arm64 and intel 
> platform, with g++ (version 4.9.4).
> 
> Compile the following c++ code,
> 
> uint32 Witness::getEntityVolatileDataUpdateFlags(Entity* otherEntity)
> {
>  uint32 flags = UPDATE_FLAG_NULL;
> 
> 
>  if (otherEntity->controlledBy() && pEntity_->id() == 
> otherEntity->controlledBy()->id())
>return flags;
> 
>  ...
> }
> 
> with successive load memory operations at the entry of a function, where the 
> latter load memory operation has dependency on the former one.
> Compiling result on intel x86-64 platform, we find that g++ will put one load 
> memory instrution in front of push stack instructions of function call. It 
> can save some time waiting the former load to finish on an out-of-order 
> processor.  We use these optimization options O1, -fpartial-inlining, 
> -fschedule-insns2, -ftree-pre.
> On arm64 platform, We use the same optimization options to compile the same 
> code and find that there is no similar results. No load memory instructions 
> is put before push stack instructions. Nor we get that result using O2, O3, 
> or Ofast to complie on arm64.
> 
> Did we have similar compiling optimization on arm64 g++?
> If yes, which optimization options can I use?
> 

I think this is the wrong list - you probably want the gcc-help list.

Have you tried using a more recent version of gcc?  AArch64 was new in
gcc 4.8 - major new architectures usually take a few versions before
they have acquired good optimised code.




Re: Spectre V1 diagnostic / mitigation

2018-12-19 Thread Florian Weimer
* Richard Biener:

> On Wed, 19 Dec 2018, Florian Weimer wrote:
>
>> * Richard Biener:
>> 
>> > The cost is probably target dependent - on x86 it's simply a $fs based
>> > load/store.
>> 
>> Do you need to reserve something in the TCB for this?
>
> No idea.  But I figured using TLS with the patch only works when
> optimizing and not with -O0.  Huh.  Anyway, it should be equivalent
> to what presence of
>
> __thread void *_SV1MASK = (void *)-1l;
>
> requires (plus I make that symbol GNU_UNIQUE).  I see this
> allocates -1 in the .tdata section marked TLS.

Oh.  That's going to be substantially worse for PIC, even with the
initial-exec model, especially on architectures which do not have
arbitrary PC-relative loads.  Which is why I'm asking about the TCB
reservation.

Thanks,
Florian


Re: For OpenACC libgomp entry points, redefine the "int device" argument to "unsigned int flags" (was: OpenACC 2.6 "host_data" construct, "if_present" clause)

2018-12-19 Thread Jakub Jelinek
On Wed, Dec 19, 2018 at 01:46:06PM +0100, Thomas Schwinge wrote:
> > Or, I had the idea that we could avoid that, instead continue using
> > "GOMP_MAP_USE_DEVICE_PTR", and transmit the "if_present" flag through the
> > "int device" argument of "GOACC_data_start" (making sure that old
> > executables continue to function as before).  For OpenACC, that argument
> > is only ever set to "GOMP_DEVICE_ICV" or "GOMP_DEVICE_HOST_FALLBACK" (for
> > "if" clause evaluating to "false"), so has some bits to spare for that.
> > However, I've not been able to convince myself that this solution would
> > be any much prettier than adding a new mapping kind...  ;-)
> 
> Having thought about it some more, the idea doesn't actually seem so bad
> anymore.  :-) Just don't think of it as 'merging stuff into "int
> device"', but rather 'for OpenACC libgomp entry points, redefine the "int
> device" argument to "unsigned int flags"' -- see attached WIP (for GCC
> trunk, testing).
> 
> Jakub, what do you think?

So, what values you were passing in before as the argument?  Just 0 or -1
or something similar and nothing else?  Just wondering if the change isn't
an ABI change.
In OpenMP we are passing a device number (from device clause), or -1, if
no device clause was used and so ICV should be checked and -2 if it is
if (false) and therefore should be always host fallback.

Jakub


Re: Spectre V1 diagnostic / mitigation

2018-12-19 Thread Richard Biener
On Wed, 19 Dec 2018, Florian Weimer wrote:

> * Richard Biener:
> 
> > On Wed, 19 Dec 2018, Florian Weimer wrote:
> >
> >> * Richard Biener:
> >> 
> >> > The cost is probably target dependent - on x86 it's simply a $fs based
> >> > load/store.
> >> 
> >> Do you need to reserve something in the TCB for this?
> >
> > No idea.  But I figured using TLS with the patch only works when
> > optimizing and not with -O0.  Huh.  Anyway, it should be equivalent
> > to what presence of
> >
> > __thread void *_SV1MASK = (void *)-1l;
> >
> > requires (plus I make that symbol GNU_UNIQUE).  I see this
> > allocates -1 in the .tdata section marked TLS.
> 
> Oh.  That's going to be substantially worse for PIC, even with the
> initial-exec model, especially on architectures which do not have
> arbitrary PC-relative loads.  Which is why I'm asking about the TCB
> reservation.

Sure, if we'd ever deploy this in production placing this in the
TCB for glibc targets might be beneifical.  But as said the
current implementation was just an experiment intended to be
maximum portable.  I suppose the dynamic loader takes care
of initializing the TCB data?

I would expect most targets to use tricks with the stack pointer
for passing around the mask in any case (using the msb is sth
that was suggested for example).

Richard.

> Thanks,
> Florian
> 
> 

-- 
Richard Biener 
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)


Re: For OpenACC libgomp entry points, redefine the "int device" argument to "unsigned int flags"

2018-12-19 Thread Thomas Schwinge
Hi Jakub!

On Wed, 19 Dec 2018 14:38:38 +0100, Jakub Jelinek  wrote:
> On Wed, Dec 19, 2018 at 01:46:06PM +0100, Thomas Schwinge wrote:
> > > Or, I had the idea that we could avoid that, instead continue using
> > > "GOMP_MAP_USE_DEVICE_PTR", and transmit the "if_present" flag through the
> > > "int device" argument of "GOACC_data_start" (making sure that old
> > > executables continue to function as before).  For OpenACC, that argument
> > > is only ever set to "GOMP_DEVICE_ICV" or "GOMP_DEVICE_HOST_FALLBACK" (for
> > > "if" clause evaluating to "false"), so has some bits to spare for that.
> > > However, I've not been able to convince myself that this solution would
> > > be any much prettier than adding a new mapping kind...  ;-)
> > 
> > Having thought about it some more, the idea doesn't actually seem so bad
> > anymore.  :-) Just don't think of it as 'merging stuff into "int
> > device"', but rather 'for OpenACC libgomp entry points, redefine the "int
> > device" argument to "unsigned int flags"' -- see attached WIP (for GCC
> > trunk, testing).
> > 
> > Jakub, what do you think?
> 
> So, what values you were passing in before as the argument?  Just 0 or -1
> or something similar and nothing else?  Just wondering if the change isn't
> an ABI change.
> In OpenMP we are passing a device number (from device clause), or -1, if
> no device clause was used and so ICV should be checked and -2 if it is
> if (false) and therefore should be always host fallback.

Right.  For OpenACC, there's no "device" clause, so we only ever passed
in "GOMP_DEVICE_ICV" (default), or "GOMP_DEVICE_HOST_FALLBACK" ("if
(false)" clause).  Therefore, the libgomp "resolve_legacy_flags" function
added to make sure that these two values (as used by old executables)
continue to work as before (with new libgomp).  (And, we have to make
sure that no (new) "GOACC_FLAG_*" combination ever results in these
values; will document that.)

I'm currently doing a verification run of the libgomp testsuite (had one
detail botched up in the patch that I sent).

And just to make sure: as recently discussed in a different thread, we
don't have to support the case of new executables built that are
dynamically linking against old libgomp versions (where the latter won't
understand the new "flags" value "GOACC_FLAG_HOST_FALLBACK", would ignore
that flag).


Grüße
 Thomas


Re: Spectre V1 diagnostic / mitigation

2018-12-19 Thread Florian Weimer
* Richard Biener:

> Sure, if we'd ever deploy this in production placing this in the
> TCB for glibc targets might be beneifical.  But as said the
> current implementation was just an experiment intended to be
> maximum portable.  I suppose the dynamic loader takes care
> of initializing the TCB data?

Yes, the dynamic linker will initialize it.  If you need 100% reliable
initialization with something that is not zero, it's going to be tricky
though.  Initial-exec TLS memory has this covered, but in the TCB, we
only have zeroed-out reservations today.

Thanks,
Florian


Re: For OpenACC libgomp entry points, redefine the "int device" argument to "unsigned int flags"

2018-12-19 Thread Jakub Jelinek
On Wed, Dec 19, 2018 at 02:59:54PM +0100, Thomas Schwinge wrote:
> Right.  For OpenACC, there's no "device" clause, so we only ever passed
> in "GOMP_DEVICE_ICV" (default), or "GOMP_DEVICE_HOST_FALLBACK" ("if
> (false)" clause).  Therefore, the libgomp "resolve_legacy_flags" function
> added to make sure that these two values (as used by old executables)
> continue to work as before (with new libgomp).  (And, we have to make
> sure that no (new) "GOACC_FLAG_*" combination ever results in these
> values; will document that.)
> 
> I'm currently doing a verification run of the libgomp testsuite (had one
> detail botched up in the patch that I sent).
> 
> And just to make sure: as recently discussed in a different thread, we
> don't have to support the case of new executables built that are
> dynamically linking against old libgomp versions (where the latter won't
> understand the new "flags" value "GOACC_FLAG_HOST_FALLBACK", would ignore
> that flag).

LGTM then in principle.

Jakub


Re: For OpenACC libgomp entry points, redefine the "int device" argument to "unsigned int flags"

2018-12-19 Thread Jakub Jelinek
On Wed, Dec 19, 2018 at 03:03:42PM +0100, Jakub Jelinek wrote:
> On Wed, Dec 19, 2018 at 02:59:54PM +0100, Thomas Schwinge wrote:
> > Right.  For OpenACC, there's no "device" clause, so we only ever passed
> > in "GOMP_DEVICE_ICV" (default), or "GOMP_DEVICE_HOST_FALLBACK" ("if
> > (false)" clause).  Therefore, the libgomp "resolve_legacy_flags" function
> > added to make sure that these two values (as used by old executables)
> > continue to work as before (with new libgomp).  (And, we have to make
> > sure that no (new) "GOACC_FLAG_*" combination ever results in these
> > values; will document that.)
> > 
> > I'm currently doing a verification run of the libgomp testsuite (had one
> > detail botched up in the patch that I sent).
> > 
> > And just to make sure: as recently discussed in a different thread, we
> > don't have to support the case of new executables built that are
> > dynamically linking against old libgomp versions (where the latter won't
> > understand the new "flags" value "GOACC_FLAG_HOST_FALLBACK", would ignore
> > that flag).
> 
> LGTM then in principle.

Or keep it int and use inverted bitmask, thus when bit is 1, it represents
the default state and when bit is 0, it is something different from it.
If you passed before just -1 and -2 and because we are only supporting two's
complement, the host fallback test would be (flags & 1) == 0.
Then you don't need to at runtime transform from legacy to non-legacy.

Jakub


Re: Spectre V1 diagnostic / mitigation

2018-12-19 Thread Peter Bergner
On 12/19/18 7:59 AM, Florian Weimer wrote:
> * Richard Biener:
> 
>> Sure, if we'd ever deploy this in production placing this in the
>> TCB for glibc targets might be beneifical.  But as said the
>> current implementation was just an experiment intended to be
>> maximum portable.  I suppose the dynamic loader takes care
>> of initializing the TCB data?
> 
> Yes, the dynamic linker will initialize it.  If you need 100% reliable
> initialization with something that is not zero, it's going to be tricky
> though.  Initial-exec TLS memory has this covered, but in the TCB, we
> only have zeroed-out reservations today.

We have non-zero initialized TCB entries on powerpc*-linux which are used
for the GCC __builtin_cpu_is() and __builtin_cpu_supports() builtin
functions.  Tulio would know the magic that was used to get them setup.

Peter





Re: Spectre V1 diagnostic / mitigation

2018-12-19 Thread Richard Earnshaw (lists)
On 19/12/2018 11:25, Richard Biener wrote:
> On Tue, 18 Dec 2018, Richard Earnshaw (lists) wrote:
> 
>> On 18/12/2018 15:36, Richard Biener wrote:
>>>
>>> Hi,
>>>
>>> in the past weeks I've been looking into prototyping both spectre V1 
>>> (speculative array bound bypass) diagnostics and mitigation in an
>>> architecture independent manner to assess feasability and some kind
>>> of upper bound on the performance impact one can expect.
>>> https://lists.llvm.org/pipermail/llvm-dev/2018-March/122085.html is
>>> an interesting read in this context as well.
>>
>> Interesting, thanks for posting this.
>>
>>>
>>> For simplicity I have implemented mitigation on GIMPLE right before
>>> RTL expansion and have chosen TLS to do mitigation across function
>>> boundaries.  Diagnostics sit in the same place but both are not in
>>> any way dependent on each other.
>>
>> We considered using TLS for propagating the state across call-boundaries
>> on AArch64, but rejected it for several reasons.
>>
>> - It's quite expensive to have to set up the TLS state in every function;
>> - It requires some global code to initialize the state variable - that's
>> kind of ABI;
> 
> The cost is probably target dependent - on x86 it's simply a $fs based
> load/store.  For initialization a static initializer seemed to work
> for me (but honestly I didn't do any testing besides running the
> testsuite for correctness - so at least the mask wasn't zero initialized).
> Note the LLVM people use an inverted mask and cancel values by
> OR-ing -1 instead of AND-ing 0.  At least default zero-initialization
> should be possible with TLS vars.
> 
> That said, my choice of TLS was to make this trivially work across
> targets - if a target can do better then it should.  And of course
> the target may not have any TLS support besides emultls which would
> be prohibitly expensive.
> 
>> - It also seems likely to be vulnerable to Spectre variant 4 - unless
>> the CPU can always correctly store-to-load forward the speculation
>> state, then you have the situation where the load may see an old value
>> of the state - and that's almost certain to say "we're not speculating".
>>
>> The last one is really the killer here.
> 
> Hmm, as far as I understood v4 only happens when store-forwarding
> doesn't work.  And I hope it doesn't fail "randomly" but works
> reliable when all accesses to the memory are aligned and have
> the same size as is the case with these compiler-generated TLS
> accesses.  But yes, if that's not guaranteed then using memory
> doesn't work at all.  

The problem is that you can't prove this through realistic testing.
Architecturally, the result has to come out the same in the end in that
if the load does bypass the store, eventually the hardware has to replay
the instruction with the correct data and cancel any operations that
were dependent on the earlier execution.  Only side-channel data will be
left after that.

> Not sure what else target independent there
> is though that doesn't break the ABI like simply adding another
> parameter.  And even adding a parameter might not work in case
> there's only stack passing and V4 happens on the stack accesses...

Yep, exactly.

> 
>>>
>>> The mitigation strategy chosen is that of tracking speculation
>>> state via a mask that can be used to zero parts of the addresses
>>> that leak the actual data.  That's similar to what aarch64 does
>>> with -mtrack-speculation (but oddly there's no mitigation there).
>>
>> We rely on the user inserting the new builtin, which we can more
>> effectively optimize if the compiler is generating speculation state
>> tracking data.  That doesn't preclude a full solution at a later date,
>> but it looked like it was likely overkill for protecting every load and
>> safely pruning the loads is not an easy problem to solve.  Of course,
>> the builtin does require the programmer to do some work to identify
>> which memory accesses might be vulnerable.
> 
> My main question was how in earth the -mtrack-speculation overhead
> is reasonable for the very few expected explicit builtin uses...

Ultimately that will depend on what the user wants and the level of
protection needed.  The builtin gives the choice: get a hard barrier if
tracking has not been enabled, with a very high hit at the point of
execution; or take a much lower hit at that point if tracking has been
enabled.  That's a trade-off between how often you hit the barrier vs
how much you hit the tracking events to no benefit.

Your code, however, doesn't work at present.  This example shows that
the mitigation code is just optimized away by the rtl passes, at least
for -fspectre-v1=2.

int f (int a, int b, int c, char *d)
{
  if (a > 10)
return 0;

  if (b > 64)
return 0;

  if (c > 96)
return 0;

  return d[a] + d[b] + d[c];
}

It works ok at level 3 because then the compiler can't prove the logical
truth of the speculation variable on the path from TLS memory and that's
sufficient to defeat the opt

Re: Spectre V1 diagnostic / mitigation

2018-12-19 Thread Florian Weimer
* Peter Bergner:

> On 12/19/18 7:59 AM, Florian Weimer wrote:
>> * Richard Biener:
>> 
>>> Sure, if we'd ever deploy this in production placing this in the
>>> TCB for glibc targets might be beneifical.  But as said the
>>> current implementation was just an experiment intended to be
>>> maximum portable.  I suppose the dynamic loader takes care
>>> of initializing the TCB data?
>> 
>> Yes, the dynamic linker will initialize it.  If you need 100% reliable
>> initialization with something that is not zero, it's going to be tricky
>> though.  Initial-exec TLS memory has this covered, but in the TCB, we
>> only have zeroed-out reservations today.
>
> We have non-zero initialized TCB entries on powerpc*-linux which are used
> for the GCC __builtin_cpu_is() and __builtin_cpu_supports() builtin
> functions.  Tulio would know the magic that was used to get them setup.

Yes, there's a special symbol, __parse_hwcap_and_convert_at_platform, to
verify that the dynamic linker sets up the TCB as required.  This way,
binaries which need the feature will fail to run on older loaders.  This
is why I said it's a bit tricky to implement this.  It's even more
complicated if you want to backport this into released glibcs, where we
normally do not accept ABI changes (not even ABI additions).

Thanks,
Florian


Re: Spectre V1 diagnostic / mitigation

2018-12-19 Thread Richard Biener
On Wed, 19 Dec 2018, Florian Weimer wrote:

> * Peter Bergner:
> 
> > On 12/19/18 7:59 AM, Florian Weimer wrote:
> >> * Richard Biener:
> >> 
> >>> Sure, if we'd ever deploy this in production placing this in the
> >>> TCB for glibc targets might be beneifical.  But as said the
> >>> current implementation was just an experiment intended to be
> >>> maximum portable.  I suppose the dynamic loader takes care
> >>> of initializing the TCB data?
> >> 
> >> Yes, the dynamic linker will initialize it.  If you need 100% reliable
> >> initialization with something that is not zero, it's going to be tricky
> >> though.  Initial-exec TLS memory has this covered, but in the TCB, we
> >> only have zeroed-out reservations today.
> >
> > We have non-zero initialized TCB entries on powerpc*-linux which are used
> > for the GCC __builtin_cpu_is() and __builtin_cpu_supports() builtin
> > functions.  Tulio would know the magic that was used to get them setup.
> 
> Yes, there's a special symbol, __parse_hwcap_and_convert_at_platform, to
> verify that the dynamic linker sets up the TCB as required.  This way,
> binaries which need the feature will fail to run on older loaders.  This
> is why I said it's a bit tricky to implement this.  It's even more
> complicated if you want to backport this into released glibcs, where we
> normally do not accept ABI changes (not even ABI additions).

It's easy to change the mitigation scheme to use a zero for the
non-speculated path, you'd simply replace ands with zero by
ors with -1.  For address parts that gets you some possible overflows
you do not want though.

Richard.


Re: Spectre V1 diagnostic / mitigation

2018-12-19 Thread Richard Biener
On Wed, 19 Dec 2018, Richard Earnshaw (lists) wrote:

> On 19/12/2018 11:25, Richard Biener wrote:
> > On Tue, 18 Dec 2018, Richard Earnshaw (lists) wrote:
> > 
> >> On 18/12/2018 15:36, Richard Biener wrote:
> >>>
> >>> Hi,
> >>>
> >>> in the past weeks I've been looking into prototyping both spectre V1 
> >>> (speculative array bound bypass) diagnostics and mitigation in an
> >>> architecture independent manner to assess feasability and some kind
> >>> of upper bound on the performance impact one can expect.
> >>> https://lists.llvm.org/pipermail/llvm-dev/2018-March/122085.html is
> >>> an interesting read in this context as well.
> >>
> >> Interesting, thanks for posting this.
> >>
> >>>
> >>> For simplicity I have implemented mitigation on GIMPLE right before
> >>> RTL expansion and have chosen TLS to do mitigation across function
> >>> boundaries.  Diagnostics sit in the same place but both are not in
> >>> any way dependent on each other.
> >>
> >> We considered using TLS for propagating the state across call-boundaries
> >> on AArch64, but rejected it for several reasons.
> >>
> >> - It's quite expensive to have to set up the TLS state in every function;
> >> - It requires some global code to initialize the state variable - that's
> >> kind of ABI;
> > 
> > The cost is probably target dependent - on x86 it's simply a $fs based
> > load/store.  For initialization a static initializer seemed to work
> > for me (but honestly I didn't do any testing besides running the
> > testsuite for correctness - so at least the mask wasn't zero initialized).
> > Note the LLVM people use an inverted mask and cancel values by
> > OR-ing -1 instead of AND-ing 0.  At least default zero-initialization
> > should be possible with TLS vars.
> > 
> > That said, my choice of TLS was to make this trivially work across
> > targets - if a target can do better then it should.  And of course
> > the target may not have any TLS support besides emultls which would
> > be prohibitly expensive.
> > 
> >> - It also seems likely to be vulnerable to Spectre variant 4 - unless
> >> the CPU can always correctly store-to-load forward the speculation
> >> state, then you have the situation where the load may see an old value
> >> of the state - and that's almost certain to say "we're not speculating".
> >>
> >> The last one is really the killer here.
> > 
> > Hmm, as far as I understood v4 only happens when store-forwarding
> > doesn't work.  And I hope it doesn't fail "randomly" but works
> > reliable when all accesses to the memory are aligned and have
> > the same size as is the case with these compiler-generated TLS
> > accesses.  But yes, if that's not guaranteed then using memory
> > doesn't work at all.  
> 
> The problem is that you can't prove this through realistic testing.
> Architecturally, the result has to come out the same in the end in that
> if the load does bypass the store, eventually the hardware has to replay
> the instruction with the correct data and cancel any operations that
> were dependent on the earlier execution.  Only side-channel data will be
> left after that.
> 
> > Not sure what else target independent there
> > is though that doesn't break the ABI like simply adding another
> > parameter.  And even adding a parameter might not work in case
> > there's only stack passing and V4 happens on the stack accesses...
> 
> Yep, exactly.
> 
> > 
> >>>
> >>> The mitigation strategy chosen is that of tracking speculation
> >>> state via a mask that can be used to zero parts of the addresses
> >>> that leak the actual data.  That's similar to what aarch64 does
> >>> with -mtrack-speculation (but oddly there's no mitigation there).
> >>
> >> We rely on the user inserting the new builtin, which we can more
> >> effectively optimize if the compiler is generating speculation state
> >> tracking data.  That doesn't preclude a full solution at a later date,
> >> but it looked like it was likely overkill for protecting every load and
> >> safely pruning the loads is not an easy problem to solve.  Of course,
> >> the builtin does require the programmer to do some work to identify
> >> which memory accesses might be vulnerable.
> > 
> > My main question was how in earth the -mtrack-speculation overhead
> > is reasonable for the very few expected explicit builtin uses...
> 
> Ultimately that will depend on what the user wants and the level of
> protection needed.  The builtin gives the choice: get a hard barrier if
> tracking has not been enabled, with a very high hit at the point of
> execution; or take a much lower hit at that point if tracking has been
> enabled.  That's a trade-off between how often you hit the barrier vs
> how much you hit the tracking events to no benefit.
> 
> Your code, however, doesn't work at present.  This example shows that
> the mitigation code is just optimized away by the rtl passes, at least
> for -fspectre-v1=2.
> 
> int f (int a, int b, int c, char *d)
> {
>   if (a > 10)
> return 0;
> 
> 

Re: Spectre V1 diagnostic / mitigation

2018-12-19 Thread Richard Earnshaw (lists)
On 19/12/2018 17:17, Richard Biener wrote:
> On Wed, 19 Dec 2018, Florian Weimer wrote:
> 
>> * Peter Bergner:
>>
>>> On 12/19/18 7:59 AM, Florian Weimer wrote:
 * Richard Biener:

> Sure, if we'd ever deploy this in production placing this in the
> TCB for glibc targets might be beneifical.  But as said the
> current implementation was just an experiment intended to be
> maximum portable.  I suppose the dynamic loader takes care
> of initializing the TCB data?

 Yes, the dynamic linker will initialize it.  If you need 100% reliable
 initialization with something that is not zero, it's going to be tricky
 though.  Initial-exec TLS memory has this covered, but in the TCB, we
 only have zeroed-out reservations today.
>>>
>>> We have non-zero initialized TCB entries on powerpc*-linux which are used
>>> for the GCC __builtin_cpu_is() and __builtin_cpu_supports() builtin
>>> functions.  Tulio would know the magic that was used to get them setup.
>>
>> Yes, there's a special symbol, __parse_hwcap_and_convert_at_platform, to
>> verify that the dynamic linker sets up the TCB as required.  This way,
>> binaries which need the feature will fail to run on older loaders.  This
>> is why I said it's a bit tricky to implement this.  It's even more
>> complicated if you want to backport this into released glibcs, where we
>> normally do not accept ABI changes (not even ABI additions).
> 
> It's easy to change the mitigation scheme to use a zero for the
> non-speculated path, you'd simply replace ands with zero by
> ors with -1.  For address parts that gets you some possible overflows
> you do not want though.

And you have to invert the value before using it as a mask.

R.

> 
> Richard.
> 



Re: Spectre V1 diagnostic / mitigation

2018-12-19 Thread Richard Biener
On Wed, 19 Dec 2018, Richard Earnshaw (lists) wrote:

> On 19/12/2018 17:17, Richard Biener wrote:
> > On Wed, 19 Dec 2018, Florian Weimer wrote:
> > 
> >> * Peter Bergner:
> >>
> >>> On 12/19/18 7:59 AM, Florian Weimer wrote:
>  * Richard Biener:
> 
> > Sure, if we'd ever deploy this in production placing this in the
> > TCB for glibc targets might be beneifical.  But as said the
> > current implementation was just an experiment intended to be
> > maximum portable.  I suppose the dynamic loader takes care
> > of initializing the TCB data?
> 
>  Yes, the dynamic linker will initialize it.  If you need 100% reliable
>  initialization with something that is not zero, it's going to be tricky
>  though.  Initial-exec TLS memory has this covered, but in the TCB, we
>  only have zeroed-out reservations today.
> >>>
> >>> We have non-zero initialized TCB entries on powerpc*-linux which are used
> >>> for the GCC __builtin_cpu_is() and __builtin_cpu_supports() builtin
> >>> functions.  Tulio would know the magic that was used to get them setup.
> >>
> >> Yes, there's a special symbol, __parse_hwcap_and_convert_at_platform, to
> >> verify that the dynamic linker sets up the TCB as required.  This way,
> >> binaries which need the feature will fail to run on older loaders.  This
> >> is why I said it's a bit tricky to implement this.  It's even more
> >> complicated if you want to backport this into released glibcs, where we
> >> normally do not accept ABI changes (not even ABI additions).
> > 
> > It's easy to change the mitigation scheme to use a zero for the
> > non-speculated path, you'd simply replace ands with zero by
> > ors with -1.  For address parts that gets you some possible overflows
> > you do not want though.
> 
> And you have to invert the value before using it as a mask.

For the cases an OR doesn't work yes.  Fortunately most targets
have an and-not instruction.  (x86 doesn't unless you have BMI)

Richard.


For libgomp OpenACC entry points, redefine the "device" argument to "flags"

2018-12-19 Thread Thomas Schwinge
Hi Jakub!

On Wed, 19 Dec 2018 15:18:12 +0100, Jakub Jelinek  wrote:
> On Wed, Dec 19, 2018 at 03:03:42PM +0100, Jakub Jelinek wrote:
> > On Wed, Dec 19, 2018 at 02:59:54PM +0100, Thomas Schwinge wrote:
> > > Right.  For OpenACC, there's no "device" clause, so we only ever passed
> > > in "GOMP_DEVICE_ICV" (default), or "GOMP_DEVICE_HOST_FALLBACK" ("if
> > > (false)" clause).  Therefore, the libgomp "resolve_legacy_flags" function
> > > added to make sure that these two values (as used by old executables)
> > > continue to work as before (with new libgomp).  (And, we have to make
> > > sure that no (new) "GOACC_FLAG_*" combination ever results in these
> > > values; will document that.)

> > LGTM then in principle.
> 
> Or keep it int and use inverted bitmask, thus when bit is 1, it represents
> the default state and when bit is 0, it is something different from it.

Ha, I too had that idea after thinking some more about the "-1" and "-2"
values/representations...  :-)

> If you passed before just -1 and -2 and because we are only supporting two's
> complement, the host fallback test would be (flags & 1) == 0.

I structured that a bit more conveniently.  That's especially useful once
additional flags added, where you want to just do "flags |= [flag]", etc.

> Then you don't need to at runtime transform from legacy to non-legacy.

Right.

Is the attached OK for trunk?  If approving this patch, please respond
with "Reviewed-by: NAME " so that your effort will be recorded in
the commit log, see .

For your review convenience, here's the "gcc/omp-expand.c" changes with
"--ignore-space-change" (as I slightly restructured OpenACC vs. OpenMP
code paths):

@@ -7536,49 +7536,62 @@ expand_omp_target (struct omp_region *region)
 
   clauses = gimple_omp_target_clauses (entry_stmt);
 
-  /* By default, the value of DEVICE is GOMP_DEVICE_ICV (let runtime
- library choose) and there is no conditional.  */
-  cond = NULL_TREE;
-  device = build_int_cst (integer_type_node, GOMP_DEVICE_ICV);
-
-  c = omp_find_clause (clauses, OMP_CLAUSE_IF);
-  if (c)
-cond = OMP_CLAUSE_IF_EXPR (c);
-
+  device = NULL_TREE;
+  tree goacc_flags = NULL_TREE;
+  if (is_gimple_omp_oacc (entry_stmt))
+{
+  /* By default, no GOACC_FLAGs are set.  */
+  goacc_flags = integer_zero_node;
+}
+  else
+{
   c = omp_find_clause (clauses, OMP_CLAUSE_DEVICE);
   if (c)
{
-  /* Even if we pass it to all library function calls, it is currently 
only
-defined/used for the OpenMP target ones.  */
-  gcc_checking_assert (start_ix == BUILT_IN_GOMP_TARGET
-  || start_ix == BUILT_IN_GOMP_TARGET_DATA
-  || start_ix == BUILT_IN_GOMP_TARGET_UPDATE
-  || start_ix == BUILT_IN_GOMP_TARGET_ENTER_EXIT_DATA);
-
  device = OMP_CLAUSE_DEVICE_ID (c);
  clause_loc = OMP_CLAUSE_LOCATION (c);
}
   else
+   {
+ /* By default, the value of DEVICE is GOMP_DEVICE_ICV (let runtime
+library choose).  */
+ device = build_int_cst (integer_type_node, GOMP_DEVICE_ICV);
  clause_loc = gimple_location (entry_stmt);
+   }
 
   c = omp_find_clause (clauses, OMP_CLAUSE_NOWAIT);
   if (c)
flags_i |= GOMP_TARGET_FLAG_NOWAIT;
+}
 
+  /* By default, there is no conditional.  */
+  cond = NULL_TREE;
+  c = omp_find_clause (clauses, OMP_CLAUSE_IF);
+  if (c)
+cond = OMP_CLAUSE_IF_EXPR (c);
+  /* If we found the clause 'if (cond)', build:
+ OpenACC: goacc_flags = (cond ? goacc_flags : flags | 
GOACC_FLAG_HOST_FALLBACK)
+ OpenMP: device = (cond ? device : GOMP_DEVICE_HOST_FALLBACK) */
+  if (cond)
+{
+  tree *tp;
+  if (is_gimple_omp_oacc (entry_stmt))
+   tp = &goacc_flags;
+  else
+   {
  /* Ensure 'device' is of the correct type.  */
  device = fold_convert_loc (clause_loc, integer_type_node, device);
 
-  /* If we found the clause 'if (cond)', build
- (cond ? device : GOMP_DEVICE_HOST_FALLBACK).  */
-  if (cond)
-{
+ tp = &device;
+   }
+
   cond = gimple_boolify (cond);
 
   basic_block cond_bb, then_bb, else_bb;
   edge e;
   tree tmp_var;
 
-  tmp_var = create_tmp_var (TREE_TYPE (device));
+  tmp_var = create_tmp_var (TREE_TYPE (*tp));
   if (offloaded)
e = split_block_after_labels (new_bb);
   else
@@ -7601,10 +7614,17 @@ expand_omp_target (struct omp_region *region)
   gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
 
   gsi = gsi_start_bb (then_bb);
-  stmt = gimple_build_assign (tmp_var, device);
+  stmt = gimple_build_assign (tmp_var