Uninitialized stack gaps and conservative garbage collection
Greetings! GCL is a lisp system that compiles to native object code via the intermediary of gcc. It uses a conservative garbage collection algorithm, which means that it walks the C stack to find likely objects in use by automatic variables, and holds on to these. This works quite well in practice. For very large systems, the likelihood of holding onto objects which should be collected increases. In looking into this, it has come to my attention that regardless of how carefully the C programmer initializes variables on the stack, gcc will quite commonly allocate extra space inaccessbile via any C variable. These 'stack gaps' can wind up permanently preventing a large portion of user memory from ever being collected with this algorithm. Ideally, I'd like to be able to have each C function carefully initialize a contiguous stack to minimize or eliminate this problem. Even better would be a gcc switch which would simply zero out each stack frame at the beginning of each function, though this could be expensive in terms of performance. Advice is most appreciated. I'd most like the solution to be portable. I know about -mpreferred-stack-boundary, but I am unsure if its absense on other architectures means there is no stack padding there anyway. Here is my ridiculous kludge which nevertheless works at least here: void wipe_stack(VOL void *) __attribute__ ((noinline)); void wipe_stack(VOL void *l) { #if CSTACK_DIRECTION == -1 if (l>(void *)&l) bzero((void *)&l,l-(void *)&l); #else l+=sizeof(l); if ((void *)&l>l) bzero(l,(void *)&l-l); #endif } object fLfuncall(object fun,...) { va_list ap; object *new; int i,n = VFUN_NARGS-1; if (n>=65) FEerror("arg limit exceeded",0); new=ZALLOCA(n*sizeof(*new)); /* There are 3 unused words on top of esp before and after the alloca, as well as extra esp allocation in the alloca itself, even though the alloca call is padded to 16 bytes. So let's wipe the crud from the stack... */ wipe_stack(&n); ... } Take care, -- Camm Maguire[EMAIL PROTECTED] == "The earth is but one country, and mankind its citizens." -- Baha'u'llah
Re: [Fwd: Uninitialized stack gaps and conservative garbage collection]
Greetings! "Boehm, Hans" <[EMAIL PROTECTED]> writes: > The collector also tries to address this with a complicated hack that > occasionally checks for opportune times to clear sections of the > stack that are not currently in use. (See GC_clear_stack() > in misc.c.) > BTW, where is the official repository of your source these days? > I think this is currently much more sophisticated in the single-threaded > case. It may be possible to enhance this code to deal with the > problems that were observed. But it has some inherent limitations. > It only rarely gets control, and it can't clear areas in frames that > are always live when it gets control. It should handle deep recursions > somewhat acceptably. > I have no problem with accidental junk on the stack, nor values left over after function return. The issue to me is that when one pushes up the stack a second time, one does not wipe out the old values, but rather they could fall into 'stack holes'. For GCL, this is most critical above toplevel, as one never clears this stack again once entered. It also appears important in alloca, which appears to 16byte pad the stack increment regardless of -mprefered-stack-boundary. Beyond this, I think the best strategy, if any additional strategy is needed, is to randomly allocate a little zeroed out stack in certain key functions just to ensure that large loops don't preserve the location of the stack holes forever. Given the stack clearing algorithm, my other question is how to portably implement it. Here again is my trial function, which appears quite dangerous. It assumes that anything above the address of the first argument passed to a function belongs to the parent stack frame, for example. void wipe_stack(VOL void *) __attribute__ ((noinline)); void wipe_stack(VOL void *l) { #if CSTACK_DIRECTION == -1 if (l>(void *)&l) bzero((void *)&l,l-(void *)&l); #else l+=sizeof(l); if ((void *)&l>l) bzero(l,(void *)&l-l); #endif } void clear_c_stack(VOL unsigned) __attribute__ ((noinline)); void clear_c_stack(VOL unsigned n) { void *v=OBJNULL; alloca(n); wipe_stack(&v); } void funcall(object fun) { object VOL sfirst=NULL; wipe_stack(&sfirst); { object temporary=OBJNULL; object x=OBJNULL; object * VOL top=NULL; object *lex=NULL; bds_ptr old_bds_top=NULL; VOL bool b=0; bool c=0; DEBUG_AVMA }} Raymond Toy writes: >On the sparc port, this area can be zeroed out with appropriate >optimization settings. I ran some tests using Eric Marsden's >cl-bench. If the stack is always cleared, the cost of some >benchmarks go up, but some go down, because the cost of GC is >decreased. (The benchmarks include GC time.) Thanks for the tip -- does this use gcc, and if so, what is the relevant switch? Take care, > Hans > > > -Original Message- > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > > On Behalf Of Bryce McKinlay > > Sent: Wednesday, May 25, 2005 6:55 AM > > To: Ranjit Mathew > > Cc: [EMAIL PROTECTED] > > Subject: Re: [Fwd: Uninitialized stack gaps and conservative > > garbage collection] > > > > > > Ranjit Mathew wrote: > > > > >Interesting stuff found on the GCC list. Doesn't this > > >affect GCJ as well where we use the conservative > > >Boehm GC? > > > > > > > > > Original Message > > >Subject: Uninitialized stack gaps and conservative garbage collection > > >Date: Tue, 24 May 2005 17:40:46 -0400 > > >From: Camm Maguire <[EMAIL PROTECTED]> > > >Newsgroups: gmane.comp.gcc.devel,gmane.lisp.gcl.devel > > > > > >Greetings! GCL is a lisp system that compiles to native object code > > >via the intermediary of gcc. It uses a conservative garbage > > collection > > >algorithm, which means that it walks the C stack to find > > likely objects > > >in use by automatic variables, and holds on to these. This > > works quite > > >well in practice. > > > > > >For very large systems, the likelihood of holding onto objects which > > >should be collected increases. In looking into this, it has > > come to my > > >attention that regardless of how carefully the C programmer > > initializes > > >variables on the stack, gcc will quite commonly allocate extra space > > >inaccessbile via any C variable. These 'stack gaps' can wind up > > >permanently preventing a large portion of user memory from > > ever being > > >collected with this algorithm. > > > > > > > >
Re: [Fwd: Uninitialized stack gaps and conservative garbage collection]
Greetings! Raymond Toy <[EMAIL PROTECTED]> writes: > Camm Maguire wrote: > > Raymond Toy writes: > > > >>On the sparc port, this area can be zeroed out with appropriate > >>optimization settings. I ran some tests using Eric Marsden's > >>cl-bench. If the stack is always cleared, the cost of some > >>benchmarks go up, but some go down, because the cost of GC is > >>decreased. (The benchmarks include GC time.) > > Thanks for the tip -- does this use gcc, and if so, what is the > > You know, of course, that CMUCL is a native compiler that doesn't use > gcc. I hand-wrote the assembly code to clear the stack area. :-) > :-) I forgot. I think I've discovered a way to do the same from :within C, but I don't like it. Right now we've addressed most issues :by pushing the stack mark origin to the frame just above toplevel -- at least a toplevel gc should be able to reclaim all memory. The stress test here is repeatedly making a list the size of the whole heap :-). Take care, > Ray > > > -- Camm Maguire[EMAIL PROTECTED] == "The earth is but one country, and mankind its citizens." -- Baha'u'llah
Incremental gcc
Greetings! GCL is a lisp compiler system which outputs C code normally compiled by gcc into an object, which is then loaded and relocated into the running GCL image. In lisp, compiling is a very incremental process, with many, often thousands of small functions compiled one at a time. GCL/gcc compilation speed would be greatly improved if gcc could be run in some sort of incremental mode, perhaps piping from stdin and outputting assemly only on stdout, if there was some way to instruct gcc to flush intermediary results. gas, which requires a seekable bfd device realistically on some sort of file system, could be run separately. Is anything like this conceivable? To Robert Dodier -- yes I agree, filesystem i/o is negligible as it is all cached anyway on typical modern machines. To Robert Boyer -- what an idea! Eventually, the proper way is likely to write GCL as a gcc frontend, as previously discussed briefly on both lists. Now is not the time, however, alas. Take care, Robert Boyer <[EMAIL PROTECTED]> writes: > Let me make the following zany suggestion. The first time gcl starts gcc, do > it under gdb. Give the gcc command two files, not one, with one of them > being the cmpinclude.h and the other being the ordinary user's gazonk.c; I > think gcc can work this way on several files to be compiled to one. Before > running, in gdb, put in a break point trap for the opening of files; in this > way, trap the opening of the typical, second user file. When that file > opening point is reached after starting gcc under gdb, force feed GDB a > machine code "fork" instruction. Keep one of two forked processes alive but > waiting as the "warm start" version for use later, and use the other fork to > continue compiling gazonk.c as usual. When the second fork later starts up, > force feed it the new correct file and force it to fork itself. I couldn't > do any of this myself in 100 years, but maybe it would be possible for > someone! Just ignore this message as soon as it sounds insane. > > Bob > > > > > -- Camm Maguire[EMAIL PROTECTED] == "The earth is but one country, and mankind its citizens." -- Baha'u'llah
Re: Incremental gcc
Greetings, and thanks for your reply! Mike Stump <[EMAIL PROTECTED]> writes: > On Mar 25, 2006, at 9:14 PM, Camm Maguire wrote: > > Greetings! GCL is a lisp compiler system which outputs C code normally > > compiled by gcc into an object, which is then loaded and relocated > > into the running GCL image. In lisp, compiling is a very incremental > > process, with many, often thousands of small functions compiled one at > > a time. GCL/gcc compilation speed would be greatly improved if gcc > > could be run in some sort of incremental mode > > Ignoring an ideal world for a second, if you have large amounts of headers > that are stable that are needed for compilation, be sure to > precompile them. After that, batch up all code to be compiled into larger > unit and compile it all in one go, as it saves a substantial Thanks! We've done this just recently to some benefit. > amount of time (2x faster maybe), this way startup costs are amortized. > This is often impossible to arrange at the compiler level in lisp. > Longer term, it would be nice to have someone from your camp layout where > the time is spent and what changes might be worth while in gcc to > make it more suitable for that type of work. This would be interesting, how does one benchmark gcc performance in this way? > > You can run the compiler via stdin and stdout: > > $ gcc -x c - -o - -S > int i; > main() { } > ^D .section __TEXT,__text,regular,pure_instructions > .section __TEXT,__picsymbolstub1,symbol_stubs,pure_instructions,32 > .machine ppc > .text > .align 2 > .globl _main > _main: > stmw r30,-8(r1) > stwu r1,-48(r1) > mr r30,r1 > lwz r1,0(r1) > lmw r30,-8(r1) > blr > .comm _i,4 > .subsections_via_symbols > > if you want, so that isn't very hard to do. The sub-problem of an as that > can operate this way is off-topic for this list, so I won't This was the most promising. If I could run gcc as a pipe with assembler only output, all I need is a 'flush instruction on stdin to get the assemly of the function(s) input thus far. Then the pipe should be ready for another function definition followed by a flush, etc. Is there such? > address it. The assembly problem is different, as you state. Among other reasons, the output must be seekable as it is bfd based. Coincidentally, GCL's binary input is bfd based, so it should be possible to incorporate the assembler and leave the output in memory. > > Long term, might be nice to have a way to wire in the assemblers into gcc > directly. If you guys feel ambitious, that might be doable; but > since gc is driven by people that want to code, you'll have to find someone > that wants to code it up. The java folks might be interested > in JITifying the compiler as well, if enough people are interested in doing > that, it might be an interesting direction to take gcc, though > building in as strikes me as a prerequisite. > Thanks again. There is interest, but it might be a ways off before sufficient time becomes available. Take care, > > -- Camm Maguire[EMAIL PROTECTED] == "The earth is but one country, and mankind its citizens." -- Baha'u'llah
Re: clear_cache on Alpha architecture not implemented?
Greetings, and thanks foro looking into this! Witold Baryluk writes: > On 05-03 09:19, Camm Maguire wrote: >> Greetings! The build succeeded, which, alas, means the failure at >> >> http://buildd.debian-ports.org/status/fetch.php?pkg=axiom&arch=alpha&ver=20120301-3&stamp=1335722507 >> >> is again unreproducible, like the kfree_amd64 and ppc examples. Sigh. >> Are there known differences in the cpu cache clearing instructions >> between imago and your machine? >> >> In any case, can one upload by hand builds to debian-ports like one can >> into the official repository? >> >> Take care, >> -- >> Camm Maguire c...@maguirefamily.org >> == >> "The earth is but one country, and mankind its citizens." -- Baha'u'llah > > After lots of investigation, it looks that imb (I-cache memory barrier, > used for flushing instruction cache on single cpu), is not emited or > called when doing __builtin___clear_cache! > > It shouldn't be hard to add, for example using (define_expand > "clear_cache" ... in gcc/config/alpha/alpha.md in gcc, or in > libgcc/config/alpha/linux.h > as #define CLEAR_INSN_CACHE(beg, end) ... > > Either of them should just emit, "call_pal 0x86". It is better than > "imb" because imb is not implemented in Tru64 assembler, and better than > "call_pal pal_imb", because pal_imb (which is equal 134), requires > include . > > Beyond that it would be naccassary to update kernel, to detect that user > code called imb, by checking if proper flag in HWPCB structure is set, > and call smp_imb() if naccassary on multi-processor system (probably > when rescheduling this process on different cpu) and clear hwpcb flag, > becuause imb invalidated Icache only on running cpu. > > > Build of axiom package successed on my machine probably by luck, and by > the fact it is single CPU machine with smaller cache than imago (and > older subarchitecture, thus probably less agressive caching). > > What alpha maintainers thinks? Pending a true fix, here is the gcl situation when built on alpha: checking for main in -lX11... yes checking for xdr_double... yes checking __builtin___clear_cache... yes and the relvant gcl code: #define PAL_imb 134 #define imb() \ __asm__ __volatile__ ("call_pal %0 #imb" : : "i" (PAL_imb) : "memory") #define CLEAR_CACHE imb() #ifdef HAVE_BUILTIN_CLEAR_CACHE static int clear_protect_memory(object memory) { void *p,*pe; int i; p=(void *)((unsigned long)memory->cfd.cfd_start & ~(PAGESIZE-1)); pe=(void *)((unsigned long)(memory->cfd.cfd_start+memory->cfd.cfd_size) & ~(PAGESIZE-1)) + PAGESIZE-1; i=mprotect(p,pe-p,PROT_READ|PROT_WRITE|PROT_EXEC); __builtin___clear_cache((void *)memory->cfd.cfd_start,(void *)memory->cfd.cfd_start+memory->cfd.cfd_size); return i; } #endif #ifdef HAVE_BUILTIN_CLEAR_CACHE massert(!clear_protect_memory(memory)); #else #ifdef CLEAR_CACHE CLEAR_CACHE; #endif #endif The goal was to exercise the very helpful gcc __builtin___clear_cache support, and to avoid having to maintain our own assembler for all the different cpus in this regard. Clearly, it is easy to revert this on a per architecture basis if absolutely necessary. If gcc does or does not plan on fixing this, please let me know so gcl can adjust as needed. Take care, -- Camm Maguirec...@maguirefamily.org == "The earth is but one country, and mankind its citizens." -- Baha'u'llah
Re: clear_cache on Alpha architecture not implemented?
Greetings, and thanks for this very helpful synopsis. I'm wondering if there is a simple configure time test to detect when this has been fixed. If I just aborted using __builtin___clear_cache if it is in fact a noop on alpha, ppc, ppc64, and ia64, would this suffice? Take care, Richard Henderson writes: > On 05/03/2012 10:51 AM, Camm Maguire wrote: >> The goal was to exercise the very helpful gcc __builtin___clear_cache >> support, and to avoid having to maintain our own assembler for all the >> different cpus in this regard. Clearly, it is easy to revert this on a >> per architecture basis if absolutely necessary. If gcc does or does not >> plan on fixing this, please let me know so gcl can adjust as needed. > > While we can probably fix this, you should know that __builtin_clear_cache > is highly tied to the implementation of trampolines for the target. Thus > there are at least 3 targets that do not handle this "properly": > > For alpha, we emit imb directly during the trampoline_init target hook. > > For powerpc32, the libgcc routine __clear_cache is unimplemented, but the > cache flushing for trampolines is inside the __trampoline_setup routine. > > For powerpc64 and ia64, the ABI for function calls allows trampolines to > be implemented without emitting any insns, and thus the icache need not be > flushed at all. And thus we never bothered implementing > __builtin_clear_cache. > > So, the fact of the matter is that you can't reliably use this builtin for > arbitrary targets for any gcc version up to 4.7. Feel free to submit an > enhancement request via bugzilla so that we can remember to address this > for gcc 4.8. > -- Camm Maguirec...@maguirefamily.org == "The earth is but one country, and mankind its citizens." -- Baha'u'llah
Re: clear_cache on Alpha architecture not implemented?
Greetings! As a followup, I should note that sh4 and hppa are also broken. I currently have this in configure.in case $use in sh4*) ;; #FIXME hppa*) ;; #FIXME *) AC_MSG_CHECKING(__builtin___clear_cache) AC_TRY_COMPILE([], [void *v,*ve; __builtin___clear_cache(v,ve); ], [AC_DEFINE(HAVE_BUILTIN_CLEAR_CACHE) AC_MSG_RESULT(yes)], AC_MSG_RESULT(no));; esac Perhaps I should just add alpha, powerpc, ppc64, and ia64 to the exception list and deal with this in the future if memory serves. Take care, Richard Henderson writes: > On 05/03/2012 10:51 AM, Camm Maguire wrote: >> The goal was to exercise the very helpful gcc __builtin___clear_cache >> support, and to avoid having to maintain our own assembler for all the >> different cpus in this regard. Clearly, it is easy to revert this on a >> per architecture basis if absolutely necessary. If gcc does or does not >> plan on fixing this, please let me know so gcl can adjust as needed. > > While we can probably fix this, you should know that __builtin_clear_cache > is highly tied to the implementation of trampolines for the target. Thus > there are at least 3 targets that do not handle this "properly": > > For alpha, we emit imb directly during the trampoline_init target hook. > > For powerpc32, the libgcc routine __clear_cache is unimplemented, but the > cache flushing for trampolines is inside the __trampoline_setup routine. > > For powerpc64 and ia64, the ABI for function calls allows trampolines to > be implemented without emitting any insns, and thus the icache need not be > flushed at all. And thus we never bothered implementing > __builtin_clear_cache. > > So, the fact of the matter is that you can't reliably use this builtin for > arbitrary targets for any gcc version up to 4.7. Feel free to submit an > enhancement request via bugzilla so that we can remember to address this > for gcc 4.8. > > > > r~ > > > > -- Camm Maguirec...@maguirefamily.org == "The earth is but one country, and mankind its citizens." -- Baha'u'llah
Re: clear_cache on Alpha architecture not implemented?
Greetings! Last followup: Nothing here should affect any kfreebsd_amd64 machine, right? My understanding is that these do not require cache flushing. Thus the failure: https://buildd.debian.org/status/fetch.php?pkg=acl2&arch=kfreebsd-amd64&ver=4.3-1&stamp=1326315213 mv *saved_acl2.gcl saved_acl2 /usr/bin/make mini-proveall make[1]: Entering directory `/build/buildd-acl2_4.3-1-kfreebsd-amd64-SScGlk/acl2-4.3' Aborted make[1]: *** [mini-proveall] Error 134 make[1]: Leaving directory `/build/buildd-acl2_4.3-1-kfreebsd-amd64-SScGlk/acl2-4.3' make: *** [debian/mini-proveall.out] Error 2 which has shown up on * Host name:fasch.debian.org * Architecture:kfreebsd-amd64 * Distribution:Debian GNU/Linux * Sponsor: + Greek Research and Technology Network (GRNET) (hosting) + Nordic Gaming (hardware) * Processor:PowerEdge 1855 and * Host name:fano.debian.org * Architecture:kfreebsd-amd64 * Distribution:Debian GNU/kFreeBSD * Sponsor: + University of British Columbia - Department of Electrical and Computer Engineering (hosting) + Hewlett-Packard (hardware) but is not reproducible on * Host name:asdfasdf.debian.net * Architecture:kfreebsd-amd64 * Distribution:Debian GNU/kFreeBSD * Sponsor: + ETH Zurich - Department of Physics (hosting+hw) * Processor:AMD Sempron 3000+ 1600 MHz is not related? Thanks so much! Richard Henderson writes: > On 05/03/2012 10:51 AM, Camm Maguire wrote: >> The goal was to exercise the very helpful gcc __builtin___clear_cache >> support, and to avoid having to maintain our own assembler for all the >> different cpus in this regard. Clearly, it is easy to revert this on a >> per architecture basis if absolutely necessary. If gcc does or does not >> plan on fixing this, please let me know so gcl can adjust as needed. > > While we can probably fix this, you should know that __builtin_clear_cache > is highly tied to the implementation of trampolines for the target. Thus > there are at least 3 targets that do not handle this "properly": > > For alpha, we emit imb directly during the trampoline_init target hook. > > For powerpc32, the libgcc routine __clear_cache is unimplemented, but the > cache flushing for trampolines is inside the __trampoline_setup routine. > > For powerpc64 and ia64, the ABI for function calls allows trampolines to > be implemented without emitting any insns, and thus the icache need not be > flushed at all. And thus we never bothered implementing > __builtin_clear_cache. > > So, the fact of the matter is that you can't reliably use this builtin for > arbitrary targets for any gcc version up to 4.7. Feel free to submit an > enhancement request via bugzilla so that we can remember to address this > for gcc 4.8. > > > > r~ > > > > -- Camm Maguirec...@maguirefamily.org == "The earth is but one country, and mankind its citizens." -- Baha'u'llah