Re: [Gcl-devel] Omnibus gcl/acl2 performance post

Matt Kaufmann Tue, 01 Oct 2013 19:23:49 -0700

Hi, Camm --

Congratulations!  I did timings using ACL2 Version 6.3 (which was
released just this morning), and here are the results, which indeed
show some speed-up.


; GCL 2.6.8 CLtL1
30144.603u 1176.089s 1:19:27.64 656.9%  0+0k 484552+4273744io 381pf+0w

; GCL 2.6.10pre CLtL1
; (from today: git commit 7e321ad46dbb96449960e5c4cd6605b2af3a971c)
29950.747u 1029.940s 1:18:17.05 659.5%  0+0k 155464+6549560io 5pf+0w

I've included both of the above results in the timing results for
various Lisps, here:

http://www.cs.utexas.edu/users/moore/acl2/v6-3/new.html#performance

Mostly I don't have much insight from your extensive data to add
beyond what you are seeing.  But I might have a useful comment or two.
For starters, consider these two lines from your message:

-188.79 books/rtl/rel9/support/lib3.delta1/seed.cert.out 365.55 88 2103 2 
554.34 141 3726 3
-127.23 books/tau/bounders/elementary-bounders.cert.out 912.70 97 6315 2 
1039.93 198 12635 4

I think this says that 2.6.10pre is 188.79 seconds faster than 2.6.8
for the first book (188.79 is (- 554.34 365.55)) and 44.1 seconds
faster for the second.  I suspect that these proofs generate rather
large amounts of "output", which we're not seeing but which is being
stored by ACL2's "gag-mode" feature for potential future replay (more
on that below).  So maybe there's an unusually large amount of
garbage, and your improvements for garbage collection and/or reduction
of cons words are helping.  (The figure of -44.1 for "r2.6.10pre vs
r2.6.10pre.widecons" on elementary-bounders might back that up, unless
I'm reading it backwards!)

(Details about the "gag-mode" issue above: I plan to fix that problem
in the future, so that storing that output doesn't happen when proof
output is inhibited, as is the case during book certification.
Actually, I made such a fix just to the above elementary-bounders book
so that I could certify it on my Mac, using 2.6.8 with its limited
maxpages.  But your tests probably don't use that fix.  When I install
a fix in the ACL2 sources I'll send you a patch to play with, if you
ask me to do so.)

Here's another that I can probably explain:

-107.32 books/workshops/2013/greve-slind/defung/defung-test.cert.out 67.23 20 
210 2 174.55 87 3042 2

I think that one makes VERY heavy use of fixnums, especially the form
(def::ung rectest (x) ...), without declarations or such to help.

Finally, thanks for the fix to save-gprof.lsp, which I haven't looked
at for many years.  I'm pretty sure that instead of adding the
argument :unix, the added argument should be the name of the current
os, which in the ACL2 package is (os (w *the-live-state*)).  If you
think I'm wrong, please let me know.  Note that this expression does
indeed evaluate to :unix on a linux machine.

Thanks --
-- Matt
   From: Camm Maguire <[email protected]>
   Cc: [email protected]
   Date: Tue, 01 Oct 2013 15:04:27 -0400

   Greetings!  And thank you so much for this very helpful report!

   This post is long, so feel free to skip over any uninteresting detail.
   The short news is, 2.6.10pre is now even or better with 2.6.8.

   =============================================================================
   In the analysis below, several regressions have been run collecting
   the following statistics:

   i=$(find books -name "*.cert.out");
   for j in $i; do 
     awk '/seconds runtime/ {k=$(NF-2);} 
          /\[S?GC for/ {i++;a=gensub(".*T=([0-9]*).*","\\1","g")+0;j+=a}
          /\[SGC on/ {l++} 
          END {print k,i,j,l,m}' m=$j $j;done | sort -n

   giving the runtime seconds, the number of gc calls, the amount of gc
   time, and the number of sgc on/off calls, which can be compared using
   (e.g. files r2.6.10pre vs r2.6.8)

   join <(awk '{print $5,$1,$2,$3,$4}' r2.6.10pre|sort) <(awk '{print 
$5,$1,$2,$3,$4}' r2.6.8|sort) | awk '{print $2-$6,$0}' | sort -n

   in turn giving a report like

   -188.79 books/rtl/rel9/support/lib3.delta1/seed.cert.out 365.55 88 2103 2 
554.34 141 3726 3
   -127.23 books/tau/bounders/elementary-bounders.cert.out 912.70 97 6315 2 
1039.93 198 12635 4
   -84.02 books/workshops/2013/greve-slind/defung/defung-test.cert.out 67.23 20 
210 2 151.25 113 3797 4
   ...
   16.2 books/misc/misc2/reverse-by-separation.cert.out 187.42 26 467 2 171.22 
44 817 2
   16.56 books/workshops/2013/greve-slind/defung/defung-stress.cert.out 486.32 
50 1378 2 469.76 103 2498 3
   20.79 books/centaur/defrstobj/basic-tests.cert.out 462.67 51 1471 2 441.88 
80 1868 3

   Perhaps your knowledge of the various books and their algorithms might
   shed light as to why certain factors produce these particular results
   in the discussion below.

   I'll be placing the results at http://people.debian.org/~camm/acl2/
   with hopefully self-explanatory names.
   =============================================================================


   Performance issues:
   =======================================

   1) beginning relblock allocation

   Issue: when gcl saves an image, it eliminates the hole and minimizes
   the relblock space to make the disk image smaller.  All gc statistics
   used in balancing the heap and reducing overall gc time
   (e.g. si::*optimize-maximum-pages*) are reset to 0.  This seems to be
   appropriate, as the type of calculation involved in building an image
   is not necessarily representative of that done is using it.  So on
   startup, one has somewhat of a 'shrink wrapped' image.  It takes some
   time for gcl to gather new statistics and expand the heap from this
   state.  The idea is to expand each page type so that its size vis a
   vis the rate of allocation is equivalent to the others.  Thus, if for
   some reason one starts with a larger allocation of a given type by
   fiat, the gc will be triggered by the other types, and they will scale
   according to the fiat type.  As 2.6.8 did not shrink wrap the
   relblock, it starts with a very large allocation, causing the heap to
   grow more quickly and save gc time relative to 2.6.10.

   Current status: At startup, 2.6.10 now scales the new hole size as
   a set fraction of the number of (dynamically determined) available
   pages, and the relblock size as a multiple of the non-relocatable
   heap.  Both of these operations cost nothing, but are somewhat
   ad-hoc.  The functions #'si::set-starting-hole-divisor and
   #'si::set-starting-relblock-heap-multiple are provided to tune if
   necessary the defaults of 10 and 2.

   2) cons size

   Issue: 2.6.9 and forward defaults to a two word cons, whereas 2.6.8
   has a three word cons, the first word being a type word.  This is to
   save space and restore the build on more limited 32bit machines, which
   has now been achieved. (acl2 6.2 is now in Debian testing).  It should
   also help with memory bandwidth.  There is however an extra branch
   required in typing an object, and more if immediate fixnums are in
   force.  In principle, this should easily be dominated by the cost of
   referencing the pointer, but this should be tested.

   Current status: we have an extra configure switch

   --enable-widecons ; which defaults to "no".  

   Thankfully widecons is a net loss of some 7 min, with a profile like
   (r2.6.10pre vs r2.6.10pre.widecons):

   -44.1 books/tau/bounders/elementary-bounders.cert.out 912.70 97 6315 2 
956.80 114 7651 2
   -15.9 books/centaur/vl/transforms/xf-sizing.cert.out 134.27 36 790 2 150.17 
37 880 2
   -11.77 books/centaur/regression/common.cert.out 177.72 66 2074 2 189.49 51 
1516 2
   ...
   15.44 books/centaur/defrstobj/basic-tests.cert.out 462.67 51 1471 2 447.23 
52 1537 2
   17.25 
books/workshops/2004/legato/support/generic-theory-tail-recursion-mult.cert.out 
194.77 34 624 2 177.52 31 566 2
   33.19 books/workshops/2013/greve-slind/defung/defung-stress.cert.out 486.32 
50 1378 2 453.13 56 1483 2


   3) immediate fixnums

   Issue: 2.6.9 and forward support immediate fixnums, the rationale
   having been to lower memory requirements as described above.  These
   provide faster inlined arithmetic and comparisons, though require
   extra branching when typing an object.  When a two word cons and
   immediate fixnums are both present, up to four branches can be
   required in a typing, as immediate fixnums can now appear in the cdr
   of the cons inspected to determine the type.  We should see if this is
   a win too.

   Current status: We now have the following configure switches

   --enable-immfix ; defaults to yes, can be disabled with  --disable-immfix

   --enable-fastimmfix=xx ; try to get at least an xx bit wide fixnum table 
centered
                          ; on the NULL address, which has no boxing cost
                          ; default is 64, meaning use high memory
                          ; immediate fixnums requiring arithmetic to box.

   --enable-safecdr       ; do not place immediate fixnums in cdr, but
                          ; boxed versions instead, and speed up typing
                          ; accordingly.  Defaults to "no"
   --enable-safecdrdbg    ; debug the above algorithm and error on failure


   The dominant fixnum cost is allocation, and this has already been
   virtually eliminated in acl2 via its use of
   #'allocate-bigger-fixnum-range.  So for acl2, its really just the
   typing cost vs. the arithmetic acceleration and saving of a pointer
   dereference. 

   --disable-immfix is a slight net gain (30 sec), with a profile like
   (r2.6.10pre vs r2.6.10pre.no-immfix):

   -107.32 books/workshops/2013/greve-slind/defung/defung-test.cert.out 67.23 
20 210 2 174.55 87 3042 2
   -10.62 books/unicode/utf8-decode.cert.out 116.08 26 325 2 126.70 24 316 2
   -9.86 books/centaur/tutorial/intro.cert.out 159.80 68 1905 2 169.66 60 1793 2
   ...
   12.08 books/workshops/1999/ste/inference.cert.out 149.30 20 287 2 137.22 24 
294 2
   13.54 books/misc/misc2/reverse-by-separation.cert.out 187.42 26 467 2 173.88 
27 463 2
   16.07 books/rtl/rel9/support/lib3.delta1/seed.cert.out 365.55 88 2103 2 
349.48 88 1950 2

   I anticipate this is not likely to hold up when other immediate fixnum
   accelerations are backported from master, e.g. accelerating eql. 

   Several of these experiments are still running.  I'll post more if desired.

   4) general gc:

   Several inner gc loops have been optimized, significantly speeding up
   contiguous gc in particular, and to a lesser extent, relocatable.  All
   contiguous gc used to also collect relocatable by default -- this is
   now separated for efficiency.  Traditionally, (si::gbc t) collected
   everything, (si::gbc 1) relocatable, and (si::gbc nil) cells only.
   This is still the case, with (si::gbc 0) added to collect contiguous
   only.   

   echo '(time (progn (setq si::*optimize-maximum-pages* nil)(dotimes (i 1000) 
(si::gbc ???))))' | ./saved_ansi_gcl |grep "^gbc time"

   ???    2.6.8  2.6.10pre
   ===    =====  =========
   t      9.500  7.510
   1      7.250  6.980
   0      -----  6.699
   nil    7.210  6.050

   5) contblock allocation

   some sgc bit logic has changed, particularly regarding contiguous
   blocks, making the large contiguous allocation I recommended earlier
   now possibly obsolete.  I have not tested this, and leaving it in
   should cause minimal harm.

   When I get around to it, I really need to implement a trampoline at
   the end of .data for all pc32 relocations, rendering the whole issue
   mute.  This will likely complicate many platforms, however.

   Finally, I've used your save-gprof.lsp file, and have found one bug:

   diff -u save-gprof.lsp.ori save-gprof.lsp
   --- save-gprof.lsp.ori       2013-10-01 13:56:55.739771000 -0500
   +++ save-gprof.lsp   2013-10-01 13:57:15.712921000 -0500
   @@ -302,7 +302,7 @@

                                        `(setq ,sym-initial-cbd
                                               (,sym-pathname-os-to-unix
   -                                            (namestring (truename ""))
   +                                            (namestring (truename "")) :unix
                                                ,sym-state))
                                        `(,sym-f-put-global
                                          ',sym-cbd


   Take care,
   -- 
   Camm Maguire                                     [email protected]
   ==========================================================================
   "The earth is but one country, and mankind its citizens."  --  Baha'u'llah


_______________________________________________
Gcl-devel mailing list
[email protected]
https://lists.gnu.org/mailman/listinfo/gcl-devel

Re: [Gcl-devel] Omnibus gcl/acl2 performance post

Reply via email to