%fs and %gs segments on x86/x86-64
Hi all, I implemented support for %fs and %gs segment prefixes on the x86 and x86-64 platforms, in what turns out to be a small patch. For those not familiar with it, at least on x86-64, %fs and %gs are two special registers that a user program can ask be added to any address machine instruction. This is done with a one-byte instruction prefix, "%fs:" or "%gs:". The actual value stored in these two registers cannot quickly be modified (at least before the Haswell CPU), but the general idea is that they are rarely modified. Speed-wise, though, an instruction like "movq %gs:(%rdx), %rax" runs at the same speed as a "movq (%rdx), %rax" would. (I failed to measure any difference, but I guess that the instruction is one more byte in length, which means that a large quantity of them would tax the instruction caches a bit more.) For reference, the pthread library on x86-64 uses %fs to point to thread-local variables. There are a number of special modes in gcc to already produce instructions like "movq %fs:(16), %rax" to load thread-local variables (declared with __thread). However, this support is special-case only. The %gs register is free to use. (On x86, %gs is used by pthread and %fs is free to use.) So what I did is to add the __seg_fs and __seg_gs address spaces. It is used like this, for example: typedef __seg_gs struct myobject_s { int a, b, c; } myobject_t; You can then use variables of type "struct myobject_s *o1" as regular pointers, and "myobject_t *o2" as %gs-based pointers. Accesses to "o2->a" are compiled to instructions that use the %gs prefix; accesses to "o1->a" are compiled as usual. These two pointer types are incompatible. The way you obtain %gs-based pointers, or control the value of %gs itself, is out of the scope of gcc; you do that by using the correct system calls and by manual arithmetic. There is no automatic conversion; the C code can contain casts between the three address spaces (regular, %fs and %gs) which, like regular pointer casts, are no-ops. My motivation comes from the PyPy-STM project ("removing the Global Interpreter Lock" for this Python interpreter). In this project, I want *almost all* pointer manipulations to resolve to different addresses depending on which thread runs the code. The idea is to use mmap() tricks to ensure that the actual memory usage remains reasonable, by sharing most of the pages (but not all of them) between each thread's "segment". So most accesses to a %gs-prefixed address actually access the same physical memory in all threads; but not all of them. This gives me a dynamic way to have a large quantity of data which every thread can read, and by changing occasionally the mapping of a single page, I can make some changes be thread-local, i.e. invisible to other threads. Of course, the same effect can be achieved in other ways, like declaring a regular "__thread intptr_t base;" and adding the "base" explicitly to every pointer access. Clearly, this would have a large performance impact. The %gs solution comes at almost no cost. The patched gcc is able to compile the hundreds of MBs of (generated) C code with systematic %gs usage and seems to work well (with one exception, see below). Is there interest in that? And if so, how to progress? * The patch included here is very minimal. It is against the gcc_5_1_0_release branch but adapting it to "trunk" should be straightforward. * I'm unclear if target_default_pointer_address_modes_p() should return "true" or not in this situation: i386-c.c now defines more than the default address mode, but the new ones also use pointers of the same standard size. * One case in which this patched gcc miscompiles code is found in the attached bug1.c/bug1.s. (This case almost never occurs in PyPy-STM, so I could work around it easily.) I think that some early, pre-RTL optimization is to "blame" here, possibly getting confused because the nonstandard address spaces also use the same size for pointers. Of course it is also possible that I messed up somewhere, or that the whole idea is doomed because many optimizations make a similar assumption. Hopefully not: it is the only issue I encountered. * The extra byte needed for the "%gs:" prefix is not explicitly accounted for. Is it only by chance that I did not observe gcc underestimating how large the code it writes is, and then e.g. use jump instructions that would be rejected by the assembler? * For completeness: this is very similar to clang's __attribute__((addressspace(256))) but a few details differ. (Also, not to discredit other projects in their concurrent's mailing list, but I had to fix three distinct bugs in llvm before I could use it. It contributes to me having more trust in gcc...) Links for more info about pypy-stm: * http://morepypy.blogspot.ch/2015/03/pypy-stm-251-released.html * https://bitbucket.org/pypy/stmgc/src/use-gcc/gcc-seg-gs/ * https://bitbucket.org/pypy/stmgc/src/use-gcc/c8/stmgc.h Than
Re: %fs and %gs segments on x86/x86-64
Hi Richard, On 3 July 2015 at 10:29, Richard Biener wrote: > It's nice to have the ability to test address-space issues on a > commonly available target at least (not sure if adding runtime > testcases is easy though). It should be easy to add testcases that run only on CPUs with the "fsgsbase" feature, using __builtin_ia32_wrgsbase64(). Either that, or we have to rely on the Linux-specific system call arch_prctl(ARCH_SET_GS). Which option is preferred? Of course we can also try both depending on what is available. Once %gs can be set, the test case can be as simple as setting it to the address of some 4096-bytes array and checking that various ways to access small %gs-based addresses really access the array instead of segfaulting. >> * One case in which this patched gcc miscompiles code is found in the >> attached bug1.c/bug1.s. > > Hmm, without being able to dive into it with a debugger it's hard to tell ;) > You might want to open a bugreport in bugzilla for this at least. Ok, I will. For reference, I'm not sure why you are not able to dive into it with a debugger: the gcc patch and the test were included as attachements... >> * The extra byte needed for the "%gs:" prefix is not explicitly >> accounted for. Is it only by chance that I did not observe gcc >> underestimating how large the code it writes is, and then e.g. use >> jump instructions that would be rejected by the assembler? > > Yes, I think you are just lucky here. Note that I suspect that gcc does overestimates that end up compensating, otherwise pure luck would likely run out before the end of the hundreds of MBs of C code. But I agree it is still a bug. I will look into that more. A bientôt, Armin.
Re: %fs and %gs segments on x86/x86-64
Hi all, Here is an updated patch (attached) for __seg_fs and __seg_gs: * added a target hook "default_pointer_address_modes" to avoid disabling a few gcc optimizations which, according to my reading of the documentation, should continue to work even in the presence of multiple address spaces as long as they all use the same mode for pointers. * account for the extra byte in "%gs:(...)" addresses. * added one test case (better than none!) using "scan-assembler". If people agree that this is the style of test that we need here, then I could add more of them. The diff is against trunk. The tests don't all pass; the failures really seem unrelated, but I guess I should grab the same revision without the patch, compile it, try to run all the tests on the same machine, and compare the list of failures... it just takes a serious amount of time to do so... I also reported the bug I got previously (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66768) and it seems to occur already in other targets with address spaces. A bientôt, Armin. Index: gcc/config/i386/i386-c.c === --- gcc/config/i386/i386-c.c(revision 225561) +++ gcc/config/i386/i386-c.c(working copy) @@ -576,6 +576,9 @@ ix86_target_macros (void) ix86_tune, ix86_fpmath, cpp_define); + + cpp_define (parse_in, "__SEG_FS"); + cpp_define (parse_in, "__SEG_GS"); } @@ -590,6 +593,9 @@ ix86_register_pragmas (void) /* Update pragma hook to allow parsing #pragma GCC target. */ targetm.target_option.pragma_parse = ix86_pragma_target_parse; + c_register_addr_space ("__seg_fs", ADDR_SPACE_SEG_FS); + c_register_addr_space ("__seg_gs", ADDR_SPACE_SEG_GS); + #ifdef REGISTER_SUBTARGET_PRAGMAS REGISTER_SUBTARGET_PRAGMAS (); #endif Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 225561) +++ gcc/config/i386/i386.c (working copy) @@ -16059,6 +16059,18 @@ ix86_print_operand (FILE *file, rtx x, i fputs (" PTR ", file); } + switch (MEM_ADDR_SPACE(x)) + { + case ADDR_SPACE_SEG_FS: + fputs (ASSEMBLER_DIALECT == ASM_ATT ? "%fs:" : "fs:", file); + break; + case ADDR_SPACE_SEG_GS: + fputs (ASSEMBLER_DIALECT == ASM_ATT ? "%gs:" : "gs:", file); + break; + default: + break; + } + x = XEXP (x, 0); /* Avoid (%rip) for call operands. */ if (CONSTANT_ADDRESS_P (x) && code == 'P' @@ -26133,6 +26145,7 @@ ix86_attr_length_address_default (rtx_in for (i = recog_data.n_operands - 1; i >= 0; --i) if (MEM_P (recog_data.operand[i])) { + int addr_space, len; constrain_operands_cached (insn, reload_completed); if (which_alternative != -1) { @@ -26148,7 +26161,15 @@ ix86_attr_length_address_default (rtx_in if (*constraints == 'X') continue; } - return memory_address_length (XEXP (recog_data.operand[i], 0), false); + + len = memory_address_length (XEXP (recog_data.operand[i], 0), false); + + /* account for one byte segment prefix for SEG_FS/SEG_GS addr spaces */ + addr_space = MEM_ADDR_SPACE(recog_data.operand[i]); + if (addr_space != ADDR_SPACE_GENERIC) + len++; + + return len; } return 0; } @@ -52205,6 +52226,126 @@ ix86_operands_ok_for_move_multiple (rtx return true; } + +/*** FS/GS segment register addressing mode ***/ + +static machine_mode +ix86_addr_space_pointer_mode (addr_space_t as) +{ + gcc_assert (as == ADDR_SPACE_GENERIC || + as == ADDR_SPACE_SEG_FS || + as == ADDR_SPACE_SEG_GS); + return ptr_mode; +} + +/* Return the appropriate mode for a named address address. */ +static machine_mode +ix86_addr_space_address_mode (addr_space_t as) +{ + gcc_assert (as == ADDR_SPACE_GENERIC || + as == ADDR_SPACE_SEG_FS || + as == ADDR_SPACE_SEG_GS); + return Pmode; +} + +/* Named address space version of valid_pointer_mode. */ +static bool +ix86_addr_space_valid_pointer_mode (machine_mode mode, addr_space_t as) +{ + gcc_assert (as == ADDR_SPACE_GENERIC || + as == ADDR_SPACE_SEG_FS || + as == ADDR_SPACE_SEG_GS); + return targetm.valid_pointer_mode (mode); +} + +/* Like ix86_legitimate_address_p, except with named addresses. */ +static bool +ix86_addr_space_legitimate_address_p (machine_mode mode, rtx x, + bool reg_ok_strict, addr_space_t as) +{ + gcc_assert (as == ADDR_SPACE_GENERIC || + as == ADDR_SPACE_SEG_FS || + as == ADDR_SPACE_SEG_GS); + return ix86_legitimate_address_p (mode, x, reg_ok_strict); +} + +/* Named address space version of LEGITIMIZE_ADDRESS. */ +static rtx +ix86_addr_space_legitimize_address (rtx x, rt
Re: GCC/JIT and precise garbage collection support?
Hi David, hi Basile, On 10 July 2015 at 03:53, David Malcolm wrote: > FWIW PyPy (an implementation of Python) defaults to using true GC, and > could benefit from GC support in GCC; currently PyPy has a nasty hack > for locating on-stack GC roots, by compiling to assembler, then carving > up the assembler with regexes to build GC metadata. A first note: write barriers, stack walking, and so on can all be implemented manually. The only thing that cannot be implemented easily is stack maps. Here's in more details how the PyPy hacks work, in case there is interest. It might be possible to do it cleanly with minimal changes in GCC (hopefully?). The goal: when a garbage collection occurs, we need to locate and possibly change the GC pointers in the stack. (They may have been originally in callee-saved registers, saved by some callee.) So this is about writing some "stack map" that describes where the values are around all calls in the stack. To do that, we put in the C sources "v = pypy_asm_gcroot(v);" for all GC-pointer variables after each call (at least each call that can recursively end up collecting): /* The following pseudo-instruction is used by --gcrootfinder=asmgcc just after a call to tell gcc to put a GCROOT mark on each gc-pointer local variable. All such local variables need to go through a "v = pypy_asm_gcroot(v)". The old value should not be used any more by the C code; this prevents the following case from occurring: gcc could make two copies of the local variable (e.g. one in the stack and one in a register), pass one to GCROOT, and later use the other one. In practice the pypy_asm_gcroot() is often a no-op in the final machine code and doesn't prevent most optimizations. */ /* With gcc, getting the asm() right was tricky, though. The asm() is not volatile so that gcc is free to delete it if the output variable is not used at all. We need to prevent gcc from moving the asm() *before* the call that could cause a collection; this is the purpose of the (unused) __gcnoreorderhack input argument. Any memory input argument would have this effect: as far as gcc knows the call instruction can modify arbitrary memory, thus creating the order dependency that we want. */ #define pypy_asm_gcroot(p) ({void*_r; \ asm ("/* GCROOT %0 */" : "=g" (_r) : \ "0" (p), "m" (__gcnoreorderhack));\ _r; }) This puts a comment in the .s file, which we post-process. The goal of this post-processing is to find the GCROOT comments, see what value they mention, and track where this value comes from at the preceding call. This is the messy part, because the value can often move around, sometimes across jumps. We also track if and where the callee-saved registers end up being saved. At the end we generate some static data: a map from every CALL location to a list of GC pointers which are live across this call, written out as a list of callee-saved registers and stack locations. This static data is read by custom platform-specific code in the stack walker. This works well enough because, from gcc's point of view, all GC pointers after a CALL are only used as arguments to "v2 = pypy_asm_gcroot(v)". GCC is not allowed to do things like precompute offsets inside GC objects---because v2 != v (which is true if the GC moved the object) and v2 is only created by the pypy_asm_gcroot() after the call. The drawback of this "asm" statement (besides being detached from the CALL) is that, even though we say "=g", a stack pointer will often be loaded into a register just before the "asm" and spilled again to a (likely different) stack location afterwards. This creates some pointless data movements. This seems to degrade performance by at most a few percents, so it's fine for us. So how would a GCC-supported solution look like? Maybe a single builtin that does a call and at the same time "marks" some local variables (for read/write). It would be enough if a CALL emitted from this built-in is immediately followed by an assembler pseudo-instruction that describe the location of all the local variables listed (plus context information: the current stack frame's depth, and where callee-saved registers have been saved). This would mean the user of this builtin still needs to come up with custom tools to post-process the assembler, but it is probably the simplest and most flexible solution. I may be wrong about thinking any of this would be easy, though... A bientôt, Armin.
Re: GCC/JIT and precise garbage collection support?
Hi David, On 10 July 2015 at 16:11, David Malcolm wrote: > AIUI, we have CALL_INSN instructions all the way through the RTL phase > of the backend, so we can identify which locations in the generated code > are calls; presumably we'd need at each CALL_INSN to determine somehow > which RTL expressions tagged as being GC-aware are live (perhaps a > mixture of registers and fp-offset expressions?) > > So presumably we could use that information (maybe in the final pass) to > write out some metadata describing for each %pc callsite the relevant GC > roots. > > Armin: does this sound like what you need? Not quite. I can understand that you're trying to find some solution with automatic discovery of the live variables of a "GC pointer" type and so on. This is more than we need, and if we had that, then we'd need to work harder to remove the extra stuff. We only want the end result: attach to each CALL_INSN a list of variables which should be stored in the stack map for that call, and be ready to see these locations be modified from outside across the call if a GC occurs. A bientôt, Armin.