Re: %fs and %gs segments on x86/x86-64

2015-07-03 Thread Richard Biener
On Thu, Jul 2, 2015 at 5:57 PM, Armin Rigo  wrote:
> Hi all,
>
> I implemented support for %fs and %gs segment prefixes on the x86 and
> x86-64 platforms, in what turns out to be a small patch.
>
> For those not familiar with it, at least on x86-64, %fs and %gs are
> two special registers that a user program can ask be added to any
> address machine instruction.  This is done with a one-byte instruction
> prefix, "%fs:" or "%gs:".  The actual value stored in these two
> registers cannot quickly be modified (at least before the Haswell
> CPU), but the general idea is that they are rarely modified.
> Speed-wise, though, an instruction like "movq %gs:(%rdx), %rax" runs
> at the same speed as a "movq (%rdx), %rax" would.  (I failed to
> measure any difference, but I guess that the instruction is one more
> byte in length, which means that a large quantity of them would tax
> the instruction caches a bit more.)
>
> For reference, the pthread library on x86-64 uses %fs to point to
> thread-local variables.  There are a number of special modes in gcc to
> already produce instructions like "movq %fs:(16), %rax" to load
> thread-local variables (declared with __thread).  However, this
> support is special-case only.  The %gs register is free to use.  (On
> x86, %gs is used by pthread and %fs is free to use.)
>
>
> So what I did is to add the __seg_fs and __seg_gs address spaces.  It
> is used like this, for example:
>
> typedef __seg_gs struct myobject_s {
> int a, b, c;
> } myobject_t;
>
> You can then use variables of type "struct myobject_s *o1" as regular
> pointers, and "myobject_t *o2" as %gs-based pointers.  Accesses to
> "o2->a" are compiled to instructions that use the %gs prefix; accesses
> to "o1->a" are compiled as usual.  These two pointer types are
> incompatible.  The way you obtain %gs-based pointers, or control the
> value of %gs itself, is out of the scope of gcc; you do that by using
> the correct system calls and by manual arithmetic.  There is no
> automatic conversion; the C code can contain casts between the three
> address spaces (regular, %fs and %gs) which, like regular pointer
> casts, are no-ops.
>
>
> My motivation comes from the PyPy-STM project ("removing the Global
> Interpreter Lock" for this Python interpreter).  In this project, I
> want *almost all* pointer manipulations to resolve to different
> addresses depending on which thread runs the code.  The idea is to use
> mmap() tricks to ensure that the actual memory usage remains
> reasonable, by sharing most of the pages (but not all of them) between
> each thread's "segment".  So most accesses to a %gs-prefixed address
> actually access the same physical memory in all threads; but not all
> of them.  This gives me a dynamic way to have a large quantity of data
> which every thread can read, and by changing occasionally the mapping
> of a single page, I can make some changes be thread-local, i.e.
> invisible to other threads.
>
> Of course, the same effect can be achieved in other ways, like
> declaring a regular "__thread intptr_t base;" and adding the "base"
> explicitly to every pointer access.  Clearly, this would have a large
> performance impact.  The %gs solution comes at almost no cost.  The
> patched gcc is able to compile the hundreds of MBs of (generated) C
> code with systematic %gs usage and seems to work well (with one
> exception, see below).
>
>
> Is there interest in that?  And if so, how to progress?

It's nice to have the ability to test address-space issues on a
commonly available target at least (not sure if adding runtime
testcases is easy though).

> * The patch included here is very minimal.  It is against the
> gcc_5_1_0_release branch but adapting it to "trunk" should be
> straightforward.
>
> * I'm unclear if target_default_pointer_address_modes_p() should
> return "true" or not in this situation: i386-c.c now defines more than
> the default address mode, but the new ones also use pointers of the
> same standard size.
>
> * One case in which this patched gcc miscompiles code is found in the
> attached bug1.c/bug1.s.  (This case almost never occurs in PyPy-STM,
> so I could work around it easily.)  I think that some early, pre-RTL
> optimization is to "blame" here, possibly getting confused because the
> nonstandard address spaces also use the same size for pointers.  Of
> course it is also possible that I messed up somewhere, or that the
> whole idea is doomed because many optimizations make a similar
> assumption.  Hopefully not: it is the only issue I encountered.

Hmm, without being able to dive into it with a debugger it's hard to tell ;)
You might want to open a bugreport in bugzilla for this at least.

> * The extra byte needed for the "%gs:" prefix is not explicitly
> accounted for.  Is it only by chance that I did not observe gcc
> underestimating how large the code it writes is, and then e.g. use
> jump instructions that would be rejected by the assembler?

Yes, I think you are j

Re: making the new if-converter not mangle IR that is already vectorizer-friendly

2015-07-03 Thread Alan Lawrence

Abe wrote:


In other words, the problem about which I was concerned is not going to be triggered by 
e.g. "if (c)  x = ..."
which lacks an attached "else  x = ..." in a multithreaded program without 
enough locking just because 'x' is global/static.

The only remaining case to consider is if some code being compiler takes the address of 
something thread-local and then "gives"
that pointer to another thread.  Even for _that_ extreme case, Sebastian says 
that the gimplifier will detect this
"address has been taken" situation and do the right thing such that the new if 
converter also does the right thing.


Great :). I don't understand much/anything about how gcc deals with 
thread-locals, but everything before that, all sounds good...



[Alan wrote:]


Can you give an example?


The test cases in the GCC tree at "gcc.dg/vect/pr61194.c" and 
"gcc.dg/vect/vect-mask-load-1.c"
currently test as: the new if-converter is "converting" something that`s 
already vectorizer-friendly...

> [snip]

However, TTBOMK the vectorizer already "understands" that in cases where its 
input looks like:

   x = c ? y : z;

... and 'y' and 'z' are both pure [side-effect-free] -- including, but not limited to, 
they must be non-"volatile" --
it may vectorize a loop containing code like the preceding, ignoring for this 
particular instance the C mandate
that only one of {y, z} be evaluated...


My understanding, is that any decision as to whether one or both of y or z is 
evaluated (when 'evaluation' involves doing any work, e.g. a load), has already 
been encoded into the gimple/tree IR. Thus, if we are to only evaluate one of 
'y' or 'z' in your example, the IR will (prior to if-conversion), contain basic 
blocks and control flow, that means we jump around the one that's not evaluated.


This appears to be the case in pr61194.c: prior to if-conversion, the IR for the 
loop in barX is


 :
  # i_16 = PHI 
  # ivtmp_21 = PHI 
  _5 = x[i_16];
  _6 = _5 > 0.0;
  _7 = w[i_16];
  _8 = _7 < 0.0;
  _9 = _6 & _8;
  if (_9 != 0)
goto ;
  else
goto ;

  :
  iftmp.0_10 = z[i_16];
  goto ;

  :
  iftmp.0_11 = y[i_16];

  :
  # iftmp.0_2 = PHI 
  z[i_16] = iftmp.0_2;
  i_13 = i_16 + 1;
  ivtmp_20 = ivtmp_21 - 1;
  if (ivtmp_20 != 0)
goto ;
  else
goto ;

  :
  goto ;

which clearly contains (unvectorizable!) control flow. Without 
-ftree-loop-if-convert-stores, if-conversion leaves this alone, and 
vectorization fails (i.e. the vectorizer bails out because the loop has >2 basic 
blocks). With -ftree-loop-if-convert-stores, if-conversion produces


 :
  # i_16 = PHI 
  # ivtmp_21 = PHI 
  _5 = x[i_16];
  _6 = _5 > 0.0;
  _7 = w[i_16];
  _8 = _7 < 0.0;
  _9 = _6 & _8;
  iftmp.0_10 = z[i_16]; // <== here
  iftmp.0_11 = y[i_16]; // <== here
  iftmp.0_2 = _9 ? iftmp.0_10 : iftmp.0_11;
  z[i_16] = iftmp.0_2;
  i_13 = i_16 + 1;
  ivtmp_20 = ivtmp_21 - 1;
  if (ivtmp_20 != 0)
goto ;
  else
goto ;

  :
  goto ;

where I have commented the conditional loads that have become unconditional. 
(Hence, "-ftree-loop-if-convert-stores" looks misnamed - it affects how the 
if-conversion phase converts loads, too - please correct me if I misunderstand 
(Richard?) ?!) This contains no control flow, and so is vectorizable.


(This is all without your scratchpad patch, of course.) IOW this being 
vectorized, or not, relies upon the preceding if-conversion phase removing the 
control flow.


HTH
Alan



Possible issue with ARC gcc 4.8

2015-07-03 Thread Vineet Gupta
Hi,

I have the following test case (reduced from Linux kernel sources) and it seems
gcc is optimizing away the first loop iteration.

arc-linux-gcc -c -O2 star-9000857057.c -fno-branch-count-reg --save-temps -mA7

--->8-
static inline int __test_bit(unsigned int nr, const volatile unsigned long 
*addr)
{
 unsigned long mask;

 addr += nr >> 5;
#if 0
nr &= 0x1f;
#endif
 mask = 1UL << nr;
 return ((mask & *addr) != 0);
}

int foo (int a, unsigned long *p)
{
  int i;
  for (i = 63; i>=0; i--)
  {
  if (!(__test_bit(i, p)))
   continue;
  a += i;
  }
  return a;
}
--->8-

gcc generates following

--->8-
.global foo
.type   foo, @function
foo:
ld_s r2,[r1,4]  < dead code
mov_s r2,63 
.align 4
.L2:
sub r2,r2,1<-SUB first
cmp r2,-1
jeq.d [blink]
lsr r3,r2,5   <- BUG: first @mask is (1 << 62) NOT (1 << 63)
.align 2
.L4:
ld.as r3,[r1,r3]
bbit0.nd r3,r2,@.L2
add_s r0,r0,r2
sub r2,r2,1
cmp r2,-1
bne.d @.L4
lsr r3,r2,5
j_s [blink]
.size   foo, .-foo
.ident  "GCC: (ARCv2 ISA Linux uClibc toolchain 
arc-2015.06-rc1-21-g21b2c4b83dfa)
4.8.4"
--->8-

For initial 32 loop operations, this test is effectively doing 64 bit operation,
e.g. (1 << 63) in 32 bit regime. Is this supposed to be undefined, truncated to
zero or port specific.

If it is truncate to zero then generated code below is not correct as it needs 
to
elide not just the first iteration (corresponding to i = 63) but 63..32

Further ARCompact ISA provides that instructions involving bitpos operands BSET,
BCLR, LSL can any number whatsoever, but core will only use the lower 5 bits (so
clamping the bitpos to 0..31 w/o need for doing that in code.

So is this a gcc bug, or some spec misinterpretation,.

TIA,
-Vineet


Re: Possible issue with ARC gcc 4.8

2015-07-03 Thread Richard Biener
On Fri, Jul 3, 2015 at 3:10 PM, Vineet Gupta  wrote:
> Hi,
>
> I have the following test case (reduced from Linux kernel sources) and it 
> seems
> gcc is optimizing away the first loop iteration.
>
> arc-linux-gcc -c -O2 star-9000857057.c -fno-branch-count-reg --save-temps -mA7
>
> --->8-
> static inline int __test_bit(unsigned int nr, const volatile unsigned long 
> *addr)
> {
>  unsigned long mask;
>
>  addr += nr >> 5;
> #if 0
> nr &= 0x1f;
> #endif
>  mask = 1UL << nr;
>  return ((mask & *addr) != 0);
> }
>
> int foo (int a, unsigned long *p)
> {
>   int i;
>   for (i = 63; i>=0; i--)
>   {
>   if (!(__test_bit(i, p)))
>continue;
>   a += i;
>   }
>   return a;
> }
> --->8-
>
> gcc generates following
>
> --->8-
> .global foo
> .type   foo, @function
> foo:
> ld_s r2,[r1,4]  < dead code
> mov_s r2,63
> .align 4
> .L2:
> sub r2,r2,1<-SUB first
> cmp r2,-1
> jeq.d [blink]
> lsr r3,r2,5   <- BUG: first @mask is (1 << 62) NOT (1 << 63)
> .align 2
> .L4:
> ld.as r3,[r1,r3]
> bbit0.nd r3,r2,@.L2
> add_s r0,r0,r2
> sub r2,r2,1
> cmp r2,-1
> bne.d @.L4
> lsr r3,r2,5
> j_s [blink]
> .size   foo, .-foo
> .ident  "GCC: (ARCv2 ISA Linux uClibc toolchain 
> arc-2015.06-rc1-21-g21b2c4b83dfa)
> 4.8.4"
> --->8-
>
> For initial 32 loop operations, this test is effectively doing 64 bit 
> operation,
> e.g. (1 << 63) in 32 bit regime. Is this supposed to be undefined, truncated 
> to
> zero or port specific.
>
> If it is truncate to zero then generated code below is not correct as it 
> needs to
> elide not just the first iteration (corresponding to i = 63) but 63..32
>
> Further ARCompact ISA provides that instructions involving bitpos operands 
> BSET,
> BCLR, LSL can any number whatsoever, but core will only use the lower 5 bits 
> (so
> clamping the bitpos to 0..31 w/o need for doing that in code.
>
> So is this a gcc bug, or some spec misinterpretation,.

It is the C language standard that says that shifts like this invoke
undefined behavior.

Richard.

> TIA,
> -Vineet


Proposed AAPCS update - parameter passing types with modified alignments

2015-07-03 Thread Richard Earnshaw
Since it may take some time before an official update to the ARM AAPCS
document can be made, I'm publishing a proposed change here for advanced
notice.  Alan will follow up with some GCC patches shortly to implement
these changes.

The proposed changes should deal with types that have been either
under-aligned (packed) or over-aligned by language extensions or
language defined alignment modifiers.  They work by assuming that the
values passed to a procedure are *copies* of values and that these
copies can safely have alignments that differ from both the source of
the copy and also from the target use inside the called procedure (in
the latter case a second copy to suitably aligned memory might be
necessary).

Since the ABI has not previously defined rules for parameter passing of
values with alignment modifiers it is possible that existing
implementations will not be 100% compatible with all these rules.
Modifying the compiler to conform may result in a silent code-generation
change.  (There should be no change for types that are naturally
aligned).  We believe this should be very rare and because the ABI has
not previously sanctioned such types they are unlikely to appear at
shared library boundaries.  It may help if compilers could emit a
warning should they detect that a parameter may cause such change in
behaviour.

R.

Definitions used in this description:


'Alignment-Adjusted'

An Alignment-Adjusted type is a type to which a language alignment
modifier has been applied.


'Member Alignment'

The Member Alignment of an element of an aggregate type is the
alignment of that member /after/ the application of any language
alignment modifiers to that member.


'Natural Alignment'

The Natural Alignment of an aggregate type is the maximum of each of
the Member Alignments of the 'top-level' members of the aggregate
type (ie before any alignment adjustment of the entire aggregate is
applied).

The Natural Alignment of all fundamental data types is that specified
in Table 1 of the AAPCS.





For the purposes of passing alignment-adjusted types as parameters,
the following rules are proposed as additions to Stage B of Section
5.5 of the AAPCS:



* Values of alignment-adjusted types passed as parameters are /copies/
  of the actual values used in the call list.

* The copies have alignments defined as follows:

 - For fundamental types, the alignment is the natural alignment of
   that type (after any promotions).

 - For aggregate types the copy of the aggregate type has 4-byte
   alignment if its natural alignment is <= 4 and 8-byte alignment if
   its natural alignment is >=8.

* The alignment of the copy is used for applying the argument
  marshalling rules.


--

Similar changes will be published for AArch64, except that the lower and
upper bounds for the natural alignment checks are are changed from (4,8)
to (8, 16) bytes.


Re: %fs and %gs segments on x86/x86-64

2015-07-03 Thread Jay
FYI similarly, fs: is special on NT/x86 & gs: is special on NT/amd64. 


In both cases they point to "mostly private builtin" thread locals and from 
there "publically extensible" thread locals -- TlsGetValue & __declspec(thread) 
are accessed, & x86 exception handling frame chain, which is just another 
builtin thread local.


(fs: retains the same meaning for NT/x86-on-ia64-or-amd64 as on native NT/x86.)

 - Jay

On Jul 3, 2015, at 1:29 AM, Richard Biener  wrote:

> On Thu, Jul 2, 2015 at 5:57 PM, Armin Rigo  wrote:
>> Hi all,
>> 
>> I implemented support for %fs and %gs segment prefixes on the x86 and
>> x86-64 platforms, in what turns out to be a small patch.
>> 
>> For those not familiar with it, at least on x86-64, %fs and %gs are
>> two special registers that a user program can ask be added to any
>> address machine ...