Hi, Camm -- Congratulations! I did timings using ACL2 Version 6.3 (which was released just this morning), and here are the results, which indeed show some speed-up.
; GCL 2.6.8 CLtL1 30144.603u 1176.089s 1:19:27.64 656.9% 0+0k 484552+4273744io 381pf+0w ; GCL 2.6.10pre CLtL1 ; (from today: git commit 7e321ad46dbb96449960e5c4cd6605b2af3a971c) 29950.747u 1029.940s 1:18:17.05 659.5% 0+0k 155464+6549560io 5pf+0w I've included both of the above results in the timing results for various Lisps, here: http://www.cs.utexas.edu/users/moore/acl2/v6-3/new.html#performance Mostly I don't have much insight from your extensive data to add beyond what you are seeing. But I might have a useful comment or two. For starters, consider these two lines from your message: -188.79 books/rtl/rel9/support/lib3.delta1/seed.cert.out 365.55 88 2103 2 554.34 141 3726 3 -127.23 books/tau/bounders/elementary-bounders.cert.out 912.70 97 6315 2 1039.93 198 12635 4 I think this says that 2.6.10pre is 188.79 seconds faster than 2.6.8 for the first book (188.79 is (- 554.34 365.55)) and 44.1 seconds faster for the second. I suspect that these proofs generate rather large amounts of "output", which we're not seeing but which is being stored by ACL2's "gag-mode" feature for potential future replay (more on that below). So maybe there's an unusually large amount of garbage, and your improvements for garbage collection and/or reduction of cons words are helping. (The figure of -44.1 for "r2.6.10pre vs r2.6.10pre.widecons" on elementary-bounders might back that up, unless I'm reading it backwards!) (Details about the "gag-mode" issue above: I plan to fix that problem in the future, so that storing that output doesn't happen when proof output is inhibited, as is the case during book certification. Actually, I made such a fix just to the above elementary-bounders book so that I could certify it on my Mac, using 2.6.8 with its limited maxpages. But your tests probably don't use that fix. When I install a fix in the ACL2 sources I'll send you a patch to play with, if you ask me to do so.) Here's another that I can probably explain: -107.32 books/workshops/2013/greve-slind/defung/defung-test.cert.out 67.23 20 210 2 174.55 87 3042 2 I think that one makes VERY heavy use of fixnums, especially the form (def::ung rectest (x) ...), without declarations or such to help. Finally, thanks for the fix to save-gprof.lsp, which I haven't looked at for many years. I'm pretty sure that instead of adding the argument :unix, the added argument should be the name of the current os, which in the ACL2 package is (os (w *the-live-state*)). If you think I'm wrong, please let me know. Note that this expression does indeed evaluate to :unix on a linux machine. Thanks -- -- Matt From: Camm Maguire <[email protected]> Cc: [email protected] Date: Tue, 01 Oct 2013 15:04:27 -0400 Greetings! And thank you so much for this very helpful report! This post is long, so feel free to skip over any uninteresting detail. The short news is, 2.6.10pre is now even or better with 2.6.8. ============================================================================= In the analysis below, several regressions have been run collecting the following statistics: i=$(find books -name "*.cert.out"); for j in $i; do awk '/seconds runtime/ {k=$(NF-2);} /\[S?GC for/ {i++;a=gensub(".*T=([0-9]*).*","\\1","g")+0;j+=a} /\[SGC on/ {l++} END {print k,i,j,l,m}' m=$j $j;done | sort -n giving the runtime seconds, the number of gc calls, the amount of gc time, and the number of sgc on/off calls, which can be compared using (e.g. files r2.6.10pre vs r2.6.8) join <(awk '{print $5,$1,$2,$3,$4}' r2.6.10pre|sort) <(awk '{print $5,$1,$2,$3,$4}' r2.6.8|sort) | awk '{print $2-$6,$0}' | sort -n in turn giving a report like -188.79 books/rtl/rel9/support/lib3.delta1/seed.cert.out 365.55 88 2103 2 554.34 141 3726 3 -127.23 books/tau/bounders/elementary-bounders.cert.out 912.70 97 6315 2 1039.93 198 12635 4 -84.02 books/workshops/2013/greve-slind/defung/defung-test.cert.out 67.23 20 210 2 151.25 113 3797 4 ... 16.2 books/misc/misc2/reverse-by-separation.cert.out 187.42 26 467 2 171.22 44 817 2 16.56 books/workshops/2013/greve-slind/defung/defung-stress.cert.out 486.32 50 1378 2 469.76 103 2498 3 20.79 books/centaur/defrstobj/basic-tests.cert.out 462.67 51 1471 2 441.88 80 1868 3 Perhaps your knowledge of the various books and their algorithms might shed light as to why certain factors produce these particular results in the discussion below. I'll be placing the results at http://people.debian.org/~camm/acl2/ with hopefully self-explanatory names. ============================================================================= Performance issues: ======================================= 1) beginning relblock allocation Issue: when gcl saves an image, it eliminates the hole and minimizes the relblock space to make the disk image smaller. All gc statistics used in balancing the heap and reducing overall gc time (e.g. si::*optimize-maximum-pages*) are reset to 0. This seems to be appropriate, as the type of calculation involved in building an image is not necessarily representative of that done is using it. So on startup, one has somewhat of a 'shrink wrapped' image. It takes some time for gcl to gather new statistics and expand the heap from this state. The idea is to expand each page type so that its size vis a vis the rate of allocation is equivalent to the others. Thus, if for some reason one starts with a larger allocation of a given type by fiat, the gc will be triggered by the other types, and they will scale according to the fiat type. As 2.6.8 did not shrink wrap the relblock, it starts with a very large allocation, causing the heap to grow more quickly and save gc time relative to 2.6.10. Current status: At startup, 2.6.10 now scales the new hole size as a set fraction of the number of (dynamically determined) available pages, and the relblock size as a multiple of the non-relocatable heap. Both of these operations cost nothing, but are somewhat ad-hoc. The functions #'si::set-starting-hole-divisor and #'si::set-starting-relblock-heap-multiple are provided to tune if necessary the defaults of 10 and 2. 2) cons size Issue: 2.6.9 and forward defaults to a two word cons, whereas 2.6.8 has a three word cons, the first word being a type word. This is to save space and restore the build on more limited 32bit machines, which has now been achieved. (acl2 6.2 is now in Debian testing). It should also help with memory bandwidth. There is however an extra branch required in typing an object, and more if immediate fixnums are in force. In principle, this should easily be dominated by the cost of referencing the pointer, but this should be tested. Current status: we have an extra configure switch --enable-widecons ; which defaults to "no". Thankfully widecons is a net loss of some 7 min, with a profile like (r2.6.10pre vs r2.6.10pre.widecons): -44.1 books/tau/bounders/elementary-bounders.cert.out 912.70 97 6315 2 956.80 114 7651 2 -15.9 books/centaur/vl/transforms/xf-sizing.cert.out 134.27 36 790 2 150.17 37 880 2 -11.77 books/centaur/regression/common.cert.out 177.72 66 2074 2 189.49 51 1516 2 ... 15.44 books/centaur/defrstobj/basic-tests.cert.out 462.67 51 1471 2 447.23 52 1537 2 17.25 books/workshops/2004/legato/support/generic-theory-tail-recursion-mult.cert.out 194.77 34 624 2 177.52 31 566 2 33.19 books/workshops/2013/greve-slind/defung/defung-stress.cert.out 486.32 50 1378 2 453.13 56 1483 2 3) immediate fixnums Issue: 2.6.9 and forward support immediate fixnums, the rationale having been to lower memory requirements as described above. These provide faster inlined arithmetic and comparisons, though require extra branching when typing an object. When a two word cons and immediate fixnums are both present, up to four branches can be required in a typing, as immediate fixnums can now appear in the cdr of the cons inspected to determine the type. We should see if this is a win too. Current status: We now have the following configure switches --enable-immfix ; defaults to yes, can be disabled with --disable-immfix --enable-fastimmfix=xx ; try to get at least an xx bit wide fixnum table centered ; on the NULL address, which has no boxing cost ; default is 64, meaning use high memory ; immediate fixnums requiring arithmetic to box. --enable-safecdr ; do not place immediate fixnums in cdr, but ; boxed versions instead, and speed up typing ; accordingly. Defaults to "no" --enable-safecdrdbg ; debug the above algorithm and error on failure The dominant fixnum cost is allocation, and this has already been virtually eliminated in acl2 via its use of #'allocate-bigger-fixnum-range. So for acl2, its really just the typing cost vs. the arithmetic acceleration and saving of a pointer dereference. --disable-immfix is a slight net gain (30 sec), with a profile like (r2.6.10pre vs r2.6.10pre.no-immfix): -107.32 books/workshops/2013/greve-slind/defung/defung-test.cert.out 67.23 20 210 2 174.55 87 3042 2 -10.62 books/unicode/utf8-decode.cert.out 116.08 26 325 2 126.70 24 316 2 -9.86 books/centaur/tutorial/intro.cert.out 159.80 68 1905 2 169.66 60 1793 2 ... 12.08 books/workshops/1999/ste/inference.cert.out 149.30 20 287 2 137.22 24 294 2 13.54 books/misc/misc2/reverse-by-separation.cert.out 187.42 26 467 2 173.88 27 463 2 16.07 books/rtl/rel9/support/lib3.delta1/seed.cert.out 365.55 88 2103 2 349.48 88 1950 2 I anticipate this is not likely to hold up when other immediate fixnum accelerations are backported from master, e.g. accelerating eql. Several of these experiments are still running. I'll post more if desired. 4) general gc: Several inner gc loops have been optimized, significantly speeding up contiguous gc in particular, and to a lesser extent, relocatable. All contiguous gc used to also collect relocatable by default -- this is now separated for efficiency. Traditionally, (si::gbc t) collected everything, (si::gbc 1) relocatable, and (si::gbc nil) cells only. This is still the case, with (si::gbc 0) added to collect contiguous only. echo '(time (progn (setq si::*optimize-maximum-pages* nil)(dotimes (i 1000) (si::gbc ???))))' | ./saved_ansi_gcl |grep "^gbc time" ??? 2.6.8 2.6.10pre === ===== ========= t 9.500 7.510 1 7.250 6.980 0 ----- 6.699 nil 7.210 6.050 5) contblock allocation some sgc bit logic has changed, particularly regarding contiguous blocks, making the large contiguous allocation I recommended earlier now possibly obsolete. I have not tested this, and leaving it in should cause minimal harm. When I get around to it, I really need to implement a trampoline at the end of .data for all pc32 relocations, rendering the whole issue mute. This will likely complicate many platforms, however. Finally, I've used your save-gprof.lsp file, and have found one bug: diff -u save-gprof.lsp.ori save-gprof.lsp --- save-gprof.lsp.ori 2013-10-01 13:56:55.739771000 -0500 +++ save-gprof.lsp 2013-10-01 13:57:15.712921000 -0500 @@ -302,7 +302,7 @@ `(setq ,sym-initial-cbd (,sym-pathname-os-to-unix - (namestring (truename "")) + (namestring (truename "")) :unix ,sym-state)) `(,sym-f-put-global ',sym-cbd Take care, -- Camm Maguire [email protected] ========================================================================== "The earth is but one country, and mankind its citizens." -- Baha'u'llah _______________________________________________ Gcl-devel mailing list [email protected] https://lists.gnu.org/mailman/listinfo/gcl-devel
