%fs and %gs segments on x86/x86-64

2015-07-02 Thread Armin Rigo
Hi all,

I implemented support for %fs and %gs segment prefixes on the x86 and
x86-64 platforms, in what turns out to be a small patch.

For those not familiar with it, at least on x86-64, %fs and %gs are
two special registers that a user program can ask be added to any
address machine instruction.  This is done with a one-byte instruction
prefix, "%fs:" or "%gs:".  The actual value stored in these two
registers cannot quickly be modified (at least before the Haswell
CPU), but the general idea is that they are rarely modified.
Speed-wise, though, an instruction like "movq %gs:(%rdx), %rax" runs
at the same speed as a "movq (%rdx), %rax" would.  (I failed to
measure any difference, but I guess that the instruction is one more
byte in length, which means that a large quantity of them would tax
the instruction caches a bit more.)

For reference, the pthread library on x86-64 uses %fs to point to
thread-local variables.  There are a number of special modes in gcc to
already produce instructions like "movq %fs:(16), %rax" to load
thread-local variables (declared with __thread).  However, this
support is special-case only.  The %gs register is free to use.  (On
x86, %gs is used by pthread and %fs is free to use.)


So what I did is to add the __seg_fs and __seg_gs address spaces.  It
is used like this, for example:

typedef __seg_gs struct myobject_s {
int a, b, c;
} myobject_t;

You can then use variables of type "struct myobject_s *o1" as regular
pointers, and "myobject_t *o2" as %gs-based pointers.  Accesses to
"o2->a" are compiled to instructions that use the %gs prefix; accesses
to "o1->a" are compiled as usual.  These two pointer types are
incompatible.  The way you obtain %gs-based pointers, or control the
value of %gs itself, is out of the scope of gcc; you do that by using
the correct system calls and by manual arithmetic.  There is no
automatic conversion; the C code can contain casts between the three
address spaces (regular, %fs and %gs) which, like regular pointer
casts, are no-ops.


My motivation comes from the PyPy-STM project ("removing the Global
Interpreter Lock" for this Python interpreter).  In this project, I
want *almost all* pointer manipulations to resolve to different
addresses depending on which thread runs the code.  The idea is to use
mmap() tricks to ensure that the actual memory usage remains
reasonable, by sharing most of the pages (but not all of them) between
each thread's "segment".  So most accesses to a %gs-prefixed address
actually access the same physical memory in all threads; but not all
of them.  This gives me a dynamic way to have a large quantity of data
which every thread can read, and by changing occasionally the mapping
of a single page, I can make some changes be thread-local, i.e.
invisible to other threads.

Of course, the same effect can be achieved in other ways, like
declaring a regular "__thread intptr_t base;" and adding the "base"
explicitly to every pointer access.  Clearly, this would have a large
performance impact.  The %gs solution comes at almost no cost.  The
patched gcc is able to compile the hundreds of MBs of (generated) C
code with systematic %gs usage and seems to work well (with one
exception, see below).


Is there interest in that?  And if so, how to progress?

* The patch included here is very minimal.  It is against the
gcc_5_1_0_release branch but adapting it to "trunk" should be
straightforward.

* I'm unclear if target_default_pointer_address_modes_p() should
return "true" or not in this situation: i386-c.c now defines more than
the default address mode, but the new ones also use pointers of the
same standard size.

* One case in which this patched gcc miscompiles code is found in the
attached bug1.c/bug1.s.  (This case almost never occurs in PyPy-STM,
so I could work around it easily.)  I think that some early, pre-RTL
optimization is to "blame" here, possibly getting confused because the
nonstandard address spaces also use the same size for pointers.  Of
course it is also possible that I messed up somewhere, or that the
whole idea is doomed because many optimizations make a similar
assumption.  Hopefully not: it is the only issue I encountered.

* The extra byte needed for the "%gs:" prefix is not explicitly
accounted for.  Is it only by chance that I did not observe gcc
underestimating how large the code it writes is, and then e.g. use
jump instructions that would be rejected by the assembler?

* For completeness: this is very similar to clang's
__attribute__((addressspace(256))) but a few details differ.  (Also,
not to discredit other projects in their concurrent's mailing list,
but I had to fix three distinct bugs in llvm before I could use it.
It contributes to me having more trust in gcc...)


Links for more info about pypy-stm:

* http://morepypy.blogspot.ch/2015/03/pypy-stm-251-released.html
* https://bitbucket.org/pypy/stmgc/src/use-gcc/gcc-seg-gs/
* https://bitbucket.org/pypy/stmgc/src/use-gcc/c8/stmgc.h


Than

Re: %fs and %gs segments on x86/x86-64

2015-07-04 Thread Armin Rigo
Hi Richard,

On 3 July 2015 at 10:29, Richard Biener  wrote:
> It's nice to have the ability to test address-space issues on a
> commonly available target at least (not sure if adding runtime
> testcases is easy though).

It should be easy to add testcases that run only on CPUs with the
"fsgsbase" feature, using __builtin_ia32_wrgsbase64().  Either that,
or we have to rely on the Linux-specific system call
arch_prctl(ARCH_SET_GS).  Which option is preferred?  Of course
we can also try both depending on what is available.

Once %gs can be set, the test case can be as simple as setting it to
the address of some 4096-bytes array and checking that various ways to
access small %gs-based addresses really access the array instead of
segfaulting.

>> * One case in which this patched gcc miscompiles code is found in the
>> attached bug1.c/bug1.s.
>
> Hmm, without being able to dive into it with a debugger it's hard to tell ;)
> You might want to open a bugreport in bugzilla for this at least.

Ok, I will.  For reference, I'm not sure why you are not able to dive
into it with a debugger: the gcc patch and the test were included as
attachements...

>> * The extra byte needed for the "%gs:" prefix is not explicitly
>> accounted for.  Is it only by chance that I did not observe gcc
>> underestimating how large the code it writes is, and then e.g. use
>> jump instructions that would be rejected by the assembler?
>
> Yes, I think you are just lucky here.

Note that I suspect that gcc does overestimates that end up
compensating, otherwise pure luck would likely run out before the end
of the hundreds of MBs of C code.  But I agree it is still a bug.  I
will look into that more.


A bientôt,

Armin.


Re: %fs and %gs segments on x86/x86-64

2015-07-09 Thread Armin Rigo
Hi all,

Here is an updated patch (attached) for __seg_fs and __seg_gs:

* added a target hook "default_pointer_address_modes" to avoid
disabling a few gcc optimizations which, according to my reading of
the documentation, should continue to work even in the presence of
multiple address spaces as long as they all use the same mode for
pointers.

* account for the extra byte in "%gs:(...)" addresses.

* added one test case (better than none!) using "scan-assembler".  If
people agree that this is the style of test that we need here, then I
could add more of them.

The diff is against trunk.  The tests don't all pass; the failures
really seem unrelated, but I guess I should grab the same revision
without the patch, compile it, try to run all the tests on the same
machine, and compare the list of failures... it just takes a serious
amount of time to do so...

I also reported the bug I got previously
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66768) and it seems to
occur already in other targets with address spaces.


A bientôt,

Armin.
Index: gcc/config/i386/i386-c.c
===
--- gcc/config/i386/i386-c.c(revision 225561)
+++ gcc/config/i386/i386-c.c(working copy)
@@ -576,6 +576,9 @@ ix86_target_macros (void)
   ix86_tune,
   ix86_fpmath,
   cpp_define);
+
+  cpp_define (parse_in, "__SEG_FS");
+  cpp_define (parse_in, "__SEG_GS");
 }
 
 
@@ -590,6 +593,9 @@ ix86_register_pragmas (void)
   /* Update pragma hook to allow parsing #pragma GCC target.  */
   targetm.target_option.pragma_parse = ix86_pragma_target_parse;
 
+  c_register_addr_space ("__seg_fs", ADDR_SPACE_SEG_FS);
+  c_register_addr_space ("__seg_gs", ADDR_SPACE_SEG_GS);
+
 #ifdef REGISTER_SUBTARGET_PRAGMAS
   REGISTER_SUBTARGET_PRAGMAS ();
 #endif
Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 225561)
+++ gcc/config/i386/i386.c  (working copy)
@@ -16059,6 +16059,18 @@ ix86_print_operand (FILE *file, rtx x, i
  fputs (" PTR ", file);
}
 
+  switch (MEM_ADDR_SPACE(x))
+   {
+   case ADDR_SPACE_SEG_FS:
+ fputs (ASSEMBLER_DIALECT == ASM_ATT ? "%fs:" : "fs:", file);
+ break;
+   case ADDR_SPACE_SEG_GS:
+ fputs (ASSEMBLER_DIALECT == ASM_ATT ? "%gs:" : "gs:", file);
+ break;
+   default:
+ break;
+   }
+
   x = XEXP (x, 0);
   /* Avoid (%rip) for call operands.  */
   if (CONSTANT_ADDRESS_P (x) && code == 'P'
@@ -26133,6 +26145,7 @@ ix86_attr_length_address_default (rtx_in
   for (i = recog_data.n_operands - 1; i >= 0; --i)
 if (MEM_P (recog_data.operand[i]))
   {
+   int addr_space, len;
 constrain_operands_cached (insn, reload_completed);
 if (which_alternative != -1)
  {
@@ -26148,7 +26161,15 @@ ix86_attr_length_address_default (rtx_in
if (*constraints == 'X')
  continue;
  }
-   return memory_address_length (XEXP (recog_data.operand[i], 0), false);
+
+   len = memory_address_length (XEXP (recog_data.operand[i], 0), false);
+
+   /* account for one byte segment prefix for SEG_FS/SEG_GS addr spaces */
+   addr_space = MEM_ADDR_SPACE(recog_data.operand[i]);
+   if (addr_space != ADDR_SPACE_GENERIC)
+ len++;
+
+   return len;
   }
   return 0;
 }
@@ -52205,6 +52226,126 @@ ix86_operands_ok_for_move_multiple (rtx
   return true;
 }
 
+
+/*** FS/GS segment register addressing mode ***/
+
+static machine_mode
+ix86_addr_space_pointer_mode (addr_space_t as)
+{
+  gcc_assert (as == ADDR_SPACE_GENERIC ||
+ as == ADDR_SPACE_SEG_FS ||
+ as == ADDR_SPACE_SEG_GS);
+  return ptr_mode;
+}
+
+/* Return the appropriate mode for a named address address.  */
+static machine_mode
+ix86_addr_space_address_mode (addr_space_t as)
+{
+  gcc_assert (as == ADDR_SPACE_GENERIC ||
+ as == ADDR_SPACE_SEG_FS ||
+ as == ADDR_SPACE_SEG_GS);
+  return Pmode;
+}
+
+/* Named address space version of valid_pointer_mode.  */
+static bool
+ix86_addr_space_valid_pointer_mode (machine_mode mode, addr_space_t as)
+{
+  gcc_assert (as == ADDR_SPACE_GENERIC ||
+ as == ADDR_SPACE_SEG_FS ||
+ as == ADDR_SPACE_SEG_GS);
+  return targetm.valid_pointer_mode (mode);
+}
+
+/* Like ix86_legitimate_address_p, except with named addresses.  */
+static bool
+ix86_addr_space_legitimate_address_p (machine_mode mode, rtx x,
+ bool reg_ok_strict, addr_space_t as)
+{
+  gcc_assert (as == ADDR_SPACE_GENERIC ||
+ as == ADDR_SPACE_SEG_FS ||
+ as == ADDR_SPACE_SEG_GS);
+  return ix86_legitimate_address_p (mode, x, reg_ok_strict);
+}
+
+/* Named address space version of LEGITIMIZE_ADDRESS.  */
+static rtx
+ix86_addr_space_legitimize_address (rtx x, rt

Re: GCC/JIT and precise garbage collection support?

2015-07-10 Thread Armin Rigo
Hi David, hi Basile,

On 10 July 2015 at 03:53, David Malcolm  wrote:
> FWIW PyPy (an implementation of Python) defaults to using true GC, and
> could benefit from GC support in GCC; currently PyPy has a nasty hack
> for locating on-stack GC roots, by compiling to assembler, then carving
> up the assembler with regexes to build GC metadata.

A first note: write barriers, stack walking, and so on can all be
implemented manually.  The only thing that cannot be implemented
easily is stack maps.

Here's in more details how the PyPy hacks work, in case there is
interest.  It might be possible to do it cleanly with minimal changes
in GCC (hopefully?).

The goal: when a garbage collection occurs, we need to locate and
possibly change the GC pointers in the stack.  (They may have been
originally in callee-saved registers, saved by some callee.)  So this
is about writing some "stack map" that describes where the values are
around all calls in the stack.  To do that, we put in the C sources "v
= pypy_asm_gcroot(v);" for all GC-pointer variables after each call
(at least each call that can recursively end up collecting):


/* The following pseudo-instruction is used by --gcrootfinder=asmgcc
   just after a call to tell gcc to put a GCROOT mark on each gc-pointer
   local variable.  All such local variables need to go through a "v =
   pypy_asm_gcroot(v)".  The old value should not be used any more by
   the C code; this prevents the following case from occurring: gcc
   could make two copies of the local variable (e.g. one in the stack
   and one in a register), pass one to GCROOT, and later use the other
   one.  In practice the pypy_asm_gcroot() is often a no-op in the final
   machine code and doesn't prevent most optimizations. */

/* With gcc, getting the asm() right was tricky, though.  The asm() is
   not volatile so that gcc is free to delete it if the output variable
   is not used at all.  We need to prevent gcc from moving the asm()
   *before* the call that could cause a collection; this is the purpose
   of the (unused) __gcnoreorderhack input argument.  Any memory input
   argument would have this effect: as far as gcc knows the call
   instruction can modify arbitrary memory, thus creating the order
   dependency that we want. */

#define pypy_asm_gcroot(p) ({void*_r; \
asm ("/* GCROOT %0 */" : "=g" (_r) :   \
 "0" (p), "m" (__gcnoreorderhack));\
_r; })


This puts a comment in the .s file, which we post-process.  The goal
of this post-processing is to find the GCROOT comments, see what value
they mention, and track where this value comes from at the preceding
call.  This is the messy part, because the value can often move
around, sometimes across jumps.

We also track if and where the callee-saved registers end up being saved.

At the end we generate some static data: a map from every CALL
location to a list of GC pointers which are live across this call,
written out as a list of callee-saved registers and stack locations.
This static data is read by custom platform-specific code in the stack
walker.

This works well enough because, from gcc's point of view, all GC
pointers after a CALL are only used as arguments to "v2 =
pypy_asm_gcroot(v)".  GCC is not allowed to do things like precompute
offsets inside GC objects---because v2 != v (which is true if the GC
moved the object) and v2 is only created by the pypy_asm_gcroot()
after the call.

The drawback of this "asm" statement (besides being detached from the
CALL) is that, even though we say "=g", a stack pointer will often be
loaded into a register just before the "asm" and spilled again to a
(likely different) stack location afterwards.  This creates some
pointless data movements.  This seems to degrade performance by at
most a few percents, so it's fine for us.

So how would a GCC-supported solution look like?  Maybe a single
builtin that does a call and at the same time "marks" some local
variables (for read/write).  It would be enough if a CALL emitted from
this built-in is immediately followed by an assembler
pseudo-instruction that describe the location of all the local
variables listed (plus context information: the current stack frame's
depth, and where callee-saved registers have been saved).  This would
mean the user of this builtin still needs to come up with custom tools
to post-process the assembler, but it is probably the simplest and
most flexible solution.  I may be wrong about thinking any of this
would be easy, though...


A bientôt,

Armin.


Re: GCC/JIT and precise garbage collection support?

2015-07-10 Thread Armin Rigo
Hi David,

On 10 July 2015 at 16:11, David Malcolm  wrote:
> AIUI, we have CALL_INSN instructions all the way through the RTL phase
> of the backend, so we can identify which locations in the generated code
> are calls; presumably we'd need at each CALL_INSN to determine somehow
> which RTL expressions tagged as being GC-aware are live (perhaps a
> mixture of registers and fp-offset expressions?)
>
> So presumably we could use that information (maybe in the final pass) to
> write out some metadata describing for each %pc callsite the relevant GC
> roots.
>
> Armin: does this sound like what you need?

Not quite.  I can understand that you're trying to find some solution
with automatic discovery of the live variables of a "GC pointer" type
and so on.  This is more than we need, and if
we had that, then we'd need to work harder to remove the extra stuff.
We only want the end result: attach to each CALL_INSN a list of
variables which should be stored in the stack map for that call, and
be ready to see these locations be modified from outside across the
call if a GC occurs.


A bientôt,

Armin.