Re: IRA for GCC 4.4

J.C. Pizarro Sun, 27 Apr 2008 12:45:29 -0700

On Fri 25 Apr 2008 22:22:55 -0500, Peter Bergner <[EMAIL PROTECTED]> wrote:
> On Thu, 2008-04-24 at 20:23 -0400, Vladimir Makarov wrote:
> > Hi, Peter.  The last time I looked at the conflict builder
> > (ra-conflict.c), I did not see the compressed matrix.  Is it in the
> > trunk?  What should I look at?
>
> Yes, the compressed bit matrix was committed as revision 129037 on
> October 5th, so it's been there a while.  Note that the old square
> bit matrix was used not only for testing for conflicts, but also for
> visiting an allocno's neighbors.  The new code (and all compilers I've
> worked on/with), use a {,compressed} upper triangular bit matrix for
> testing for conflicts and an adjacency list for visiting neighbors.
>
> The code that allocates and initializes the compressed bit matrix is in
> global.c.  If you remember how a upper triangular bit matrix works, it's
> just one big bit vector, where the bit number that represents the conflict
> between allocnos LOW and HIGH is given by either of these two functions:
>
>   1) bitnum = f(HIGH) + LOW
>   2) bitnum = f(LOW) + HIGH
>
> where:
>
>   1) f(HIGH) = (HIGH * (HIGH - 1)) / 2
>   2) f(LOW) = LOW * (max_allocno - LOW) + (LOW * (LOW - 1)) / 2 - LOW - 1
>
> As mentioned in some of the conflict graph bit matrix literature (actually,
> they only mention expression #1 above), the expensive functions f(HIGH) and
> f(LOW) can be precomputed and stored in an array, so to access the conflict
> graph bits only takes a load and an addition.  Below is an example bit matrix
> with initialized array:
>
>                 0    1    2     3    4    5    6    7    8    9   10   11
> ------        -------------------------------------------------------------
> | -1 |      0 |    |  0 |  1 |  2 |  3 |  4 |  5 |  6 |  7 |  8 |  9 | 10 |
> ------        -------------------------------------------------------------
> |  9 |      1 |    |    | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
> ------        -------------------------------------------------------------
> | 18 |      2 |    |    |    | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 |
> ------        -------------------------------------------------------------
> | 26 |      3 |    |    |    |    | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 |
> ------        -------------------------------------------------------------
> | 33 |      4 |    |    |    |    |    | 38 | 39 | 40 | 41 | 42 | 43 | 44 |
> ------        -------------------------------------------------------------
> | 39 |      5 |    |    |    |    |    |    | 45 | 46 | 47 | 48 | 49 | 50 |
> ------        -------------------------------------------------------------
> | 44 |      6 |    |    |    |    |    |    |    | 51 | 52 | 53 | 54 | 55 |
> ------        -------------------------------------------------------------
> | 48 |      7 |    |    |    |    |    |    |    |    | 56 | 57 | 58 | 59 |
> ------        -------------------------------------------------------------
> | 51 |      8 |    |    |    |    |    |    |    |    |    | 60 | 61 | 62 |
> ------        -------------------------------------------------------------
> | 53 |      9 |    |    |    |    |    |    |    |    |    |    | 63 | 64 |
> ------        -------------------------------------------------------------
> | 54 |     10 |    |    |    |    |    |    |    |    |    |    |    | 65 |
> ------        -------------------------------------------------------------
> | NA |     11 |    |    |    |    |    |    |    |    |    |    |    |    |
> ------        -------------------------------------------------------------
>
> As an example, if we look at the interference between allocnos 8 and 10, we
> compute "array[8] + 10" = "51 + 10" = "61", which if you look above, you will
> see is the correct bit number for that interference bit.
>
> The difference between a compressed upper triangular bit matrix from a 
> standard
> upper triangular bit matrix like the one above, is we eliminate space from the
> bit matrix for conflicts we _know_ can never exist.  The easiest case to 
> catch,
> and the only one we catch at the moment, is that allocnos that are "local" to 
> a
> basic block B1 cannot conflict with allocnos that are local to basic block B2,
> where B1 != B2.  To take advantage of this fact, I updated the code in 
> global.c
> to sort the allocnos such that all "global" allocnos (allocnos that are live 
> in
> more than one basic block) are given smaller allocno numbers than the "local"
> allocnos, and all allocnos for a given basic block are grouped together in a
> contiguous range to allocno numbers.  The sorting is accomplished by:
>
>   /* ...so we can sort them in the order we want them to receive
>      their allocnos.  */
>   qsort (reg_allocno, max_allocno, sizeof (int), regno_compare);
>
> Once we have them sorted, our conceptual view of the compressed bit matrix
> will now look like:
>
>                   G    G    G    B0   B0   B0   B1   B1   B2   B2   B2   B2
>
>                   0    1    2     3    4    5    6    7    8    9   10   11
> ------          -------------------------------------------------------------
> | -1 |    G   0 |    |  0 |  1 |  2 |  3 |  4 |  5 |  6 |  7 |  8 |  9 | 10 |
> ------          -------------------------------------------------------------
> |  9 |    G   1 |    |    | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
> ------          -------------------------------------------------------------
> | 18 |    G   2 |    |    |    | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 |
> ------          -------------------------------------------------------------
> | 26 |    B0  3 |    |    |    |    | 30 | 31 |  * |  * |  * |  * |  * |  * |
> ------          -------------------------------------------------------------
> | 27 |    B0  4 |    |    |    |    |    | 32 |  * |  * |  * |  * |  * |  * |
> ------          -------------------------------------------------------------
> | NA |    B0  5 |    |    |    |    |    |    |  * |  * |  * |  * |  * |  * |
> ------          -------------------------------------------------------------
> | 26 |    B1  6 |    |    |    |    |    |    |    | 33 |  * |  * |  * |  * |
> ------          -------------------------------------------------------------
> | NA |    B1  7 |    |    |    |    |    |    |    |    |  * |  * |  * |  * |
> ------          -------------------------------------------------------------
> | 25 |    B2  8 |    |    |    |    |    |    |    |    |    | 34 | 35 | 36 |
> ------          -------------------------------------------------------------
> | 27 |    B2  9 |    |    |    |    |    |    |    |    |    |    | 37 | 38 |
> ------          -------------------------------------------------------------
> | 28 |    B2 10 |    |    |    |    |    |    |    |    |    |    |    | 39 |
> ------          -------------------------------------------------------------
> | NA |    B2 11 |    |    |    |    |    |    |    |    |    |    |    |    |
> ------          -------------------------------------------------------------
>
> I have augmented the figure showing which allocnos are global (G) and which
> are local (B0, B1 and B2).  The "*" slots in the matrix correspond to 
> conflicts
> that cannot exist between the local allocnos.  By sorting the allocnos, we
> have grouped the unneeded space together to make it easy to eliminate.
> You'll also notice the initialized array now has different values than before.
> This is how we conceptually "skip" over the empty space we have not allocated.
>
> As an example, if we look at the interference between allocnos 8 and 10, we
> compute "array[8] + 10" = "25 + 10" = "35", which if you look above, you will
> see is the correct bit number for that interference bit.  The code that
> computes the precomputed bit matrix values as well as the number of bits in
> the bit matrix is the following code, also from global.c:
>
>   max_bitnum = 0;
>   max_global_bitnum = 0;
>   for (i = 0; i < (size_t) max_allocno; i++)
>     {
>       int regno = allocno[i].reg;
>       int blk = regno_basic_block (regno);
>       int row_size = --num_allocnos_per_blk[blk];
>       reg_allocno[regno] = (int) i;
>       partial_bitnum[i] = (row_size > 0) ? max_bitnum - ((int) i + 1) : -1;
>       max_bitnum += row_size;
>       if (blk == 0)
>         max_global_bitnum += row_size;
>     }
>
> Once that is finished, we're ready to allocate the smaller bit vector used for
> the bit matrix.  The other benefit is that the accesses to this compressed
> bit matrix are exactly like those of a standard upper triangular bit matrix.
> A third benefit is that since all of the conflict bits for allocnos from the
> same basic block are consecutive in the bit vector, we should get slightly
> better cache usage than if they we're sorted that way.
>
> The amount of space savings over a standard upper triangular bit matrix
> depends on the structure of the function we're compiling.  Functions with
> all global allocnos or functions with one basic block (eg, fpppp from SPEC97)
> cause us to degrade into a standard upper triangular bit matrix.  OTOH,
> functions like 177.mesa/src/get.c:gl_GetBooleanv() from SPEC2000 show
> huge wins.  For gl_GetBooleanv(), we allocate 201613 bytes for the square
> bit matrix, 100727 bytes for the standard upper triangular bit matrix and
> 1722 bytes for the compressed upper triangular bit matrix.  That means
> the compressed bit matrix is 117 times smaller than the square bit matrix
> and nearly 60 times smaller than the standard upper triangular bit matrix.
>
>
> I am thinking about how we can compress the matrix even further.  If I had
> region info, I should be able to easily sort the global allonos into
> groups, such that global allocnos in different groups do not conflict.
> Kenny had some region building code, but he said it was costly to compute.
> I may play around with it, if for no other reason than just to see what
> type of space savings it would buy.  I'm also trying to come up with some
> ideas on how we can sort allocnos that are local to the same basic block,
> such that we can get space savings for them as well.  If anyone has ideas,
> let me know! :)
>
>
>
> > I have also another question.  I saw that sparset was used for the
> > conflict builder.  I tried that too when I worked on YARA project.  I
> > even wanted to contribute a generic sparset implementation.  But I found
> > that in general case bitmaps are not worse the sparse sets and much
> > better if we take  a needed space into account.  May be you have another
> > impression?  It would be very interesting for me to hear it.  I found
> > that bitmaps have more advanced design than sparsets.  I always wanted
> > to find inventors the bitmaps but never tracked them down.
>
> The sparseset representation is only used for the allocnos_live set.
> The old version was a bit vector to match up with the square bit matrix.
> Since I changed that to save space, I had to reimplement allocnos_live.
> Danny suggested I use a bitmap for that set and I tried it, but I found
> for the particular usage of allocnos_live, a sparseset was noticeably
> faster than bitmaps.  I'll note that the main operations on allocnos_live
> are to add allocnos to the set, remove allonos to the set and iterate over
> the members of the set and occasionally clear the entire set.  These are
> all O(1) operations for the sparseset with fairly low constant factors
> too.  I didn't look too closely, but I'm guessing that the main problem
> with bitmaps for this type of usage was the slower iterating over all
> of the members of the set versus the sparseset.
>
> Obviously, bitmaps are much better than sparsesets wrt space usage, so
> you have to use them sparingly.  You wouldn't want an array of these
> things! :)  But there are use cases where they work very very well.
> The currently "live" set is one such use.  Another use I have found
> where they work well is in the needLoad set used by Briggs' allocator.
>
> Whether you want/should use a sparseset really depends on the number
> and type of set operations your particular usage will see.  I'm sure
> there are many usage cases where bitmaps are superior to sparsesets,
> just like there are usage cases where sparsesets are superior.  I know
> that sounds like a cop-out, but it really does depend on how you're
> going to use it.
>
>
> Peter


Don't be stupid!

I don't need this dirty thing of "compressed upper triangulars".

Instead, i should use "chained multi upper triangulars" and "rectangulars",
 so i will have more control in its traversing and will require lesser
 computation because i don't check the known "don't cares" * in the big
 "compressed upper triangular".

Here below is the semantic interpretation of the 12 allocnos {x}:

G  {0} =  G_0     B0 {3} = B0_0     B1 {6} = B1_0     B2 {8} = B2_0
G  {1} =  G_1     B0 {4} = B0_1     B1 {7} = B1_1     B2 {9} = B2_1
G  {2} =  G_2     B0 {5} = B0_2                       B2{10} = B2_2
                                                      B2{11} = B2_3

There are 4 "chained upper triangulars" cut{G}, cut{B0}, cut{B1} and cut{B2}
and 1 "rectangular" between Global and Local BBs's allocnos r{G,{B0,B1,B2}}:

cut{G}:          G_0  G_1  G_2

                   0    1    2
------          ----------------
| -1 |   G_0  0 |    |  0 |  1 |
------          ----------------
|  0 |   G_1  1 |    |    |  2 |
------          ----------------
| NA |   G_2  2 |    |    |    |
------          ----------------

cut{B0}:         B0_0 B0_1 B0_2

                   0    1    2
------          ----------------
| -1 |  B0_0  0 |    |  0 |  1 |
------          ----------------
|  0 |  B0_1  1 |    |    |  2 |
------          ----------------
| NA |  B0_2  2 |    |    |    |
------          ----------------

cut{B1}:         B1_0 B1_1

                   0    1
------          -----------
| -1 |  B1_0  0 |    |  0 |
------          -----------
| NA |  B1_1  1 |    |    |
------          -----------

cut{B2}:         B2_0 B2_1 B2_2 B2_3

                   0    1    2    3
------          ---------------------
| -1 |  B2_0  0 |    |  0 |  1 |  2 |
------          ---------------------
|  1 |  B2_1  1 |    |    |  3 |  4 |
------          ---------------------
|  2 |  B2_2  2 |    |    |    |  5 |
------          ---------------------
| NA |  B2_3  1 |    |    |    |    |
------          ---------------------

r{G,{B0,B1,B2}}: B0_0 B0_1 B0_2  B1_0 B1_1 B2_0 B2_1 B2_2 B2_3

                   0    1    2    3    4    5    6    7    8
                ----------------------------------------------
         G_0  0 |  0 |  1 |  2 |  3 |  4 |  5 |  6 |  7 |  8 |
                ----------------------------------------------
         G_1  1 |  9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
                ----------------------------------------------
         G_2  2 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 |
                ----------------------------------------------

This unique rectangular can be parted to many local rectangulars,
 where each BB can have its own local rectangular r{Bx,G}.
 It's useful when there are increases or decreases in the number of BBs.

   J.C.Pizarro

Re: IRA for GCC 4.4

Reply via email to