On Fri 25 Apr 2008 22:22:55 -0500, Peter Bergner <[EMAIL PROTECTED]> wrote:
> On Thu, 2008-04-24 at 20:23 -0400, Vladimir Makarov wrote:
> > Hi, Peter. The last time I looked at the conflict builder
> > (ra-conflict.c), I did not see the compressed matrix. Is it in the
> > trunk? What should I look at?
>
> Yes, the compressed bit matrix was committed as revision 129037 on
> October 5th, so it's been there a while. Note that the old square
> bit matrix was used not only for testing for conflicts, but also for
> visiting an allocno's neighbors. The new code (and all compilers I've
> worked on/with), use a {,compressed} upper triangular bit matrix for
> testing for conflicts and an adjacency list for visiting neighbors.
>
> The code that allocates and initializes the compressed bit matrix is in
> global.c. If you remember how a upper triangular bit matrix works, it's
> just one big bit vector, where the bit number that represents the conflict
> between allocnos LOW and HIGH is given by either of these two functions:
>
> 1) bitnum = f(HIGH) + LOW
> 2) bitnum = f(LOW) + HIGH
>
> where:
>
> 1) f(HIGH) = (HIGH * (HIGH - 1)) / 2
> 2) f(LOW) = LOW * (max_allocno - LOW) + (LOW * (LOW - 1)) / 2 - LOW - 1
>
> As mentioned in some of the conflict graph bit matrix literature (actually,
> they only mention expression #1 above), the expensive functions f(HIGH) and
> f(LOW) can be precomputed and stored in an array, so to access the conflict
> graph bits only takes a load and an addition. Below is an example bit matrix
> with initialized array:
>
> 0 1 2 3 4 5 6 7 8 9 10 11
> ------ -------------------------------------------------------------
> | -1 | 0 | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
> ------ -------------------------------------------------------------
> | 9 | 1 | | | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
> ------ -------------------------------------------------------------
> | 18 | 2 | | | | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 |
> ------ -------------------------------------------------------------
> | 26 | 3 | | | | | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 |
> ------ -------------------------------------------------------------
> | 33 | 4 | | | | | | 38 | 39 | 40 | 41 | 42 | 43 | 44 |
> ------ -------------------------------------------------------------
> | 39 | 5 | | | | | | | 45 | 46 | 47 | 48 | 49 | 50 |
> ------ -------------------------------------------------------------
> | 44 | 6 | | | | | | | | 51 | 52 | 53 | 54 | 55 |
> ------ -------------------------------------------------------------
> | 48 | 7 | | | | | | | | | 56 | 57 | 58 | 59 |
> ------ -------------------------------------------------------------
> | 51 | 8 | | | | | | | | | | 60 | 61 | 62 |
> ------ -------------------------------------------------------------
> | 53 | 9 | | | | | | | | | | | 63 | 64 |
> ------ -------------------------------------------------------------
> | 54 | 10 | | | | | | | | | | | | 65 |
> ------ -------------------------------------------------------------
> | NA | 11 | | | | | | | | | | | | |
> ------ -------------------------------------------------------------
>
> As an example, if we look at the interference between allocnos 8 and 10, we
> compute "array[8] + 10" = "51 + 10" = "61", which if you look above, you will
> see is the correct bit number for that interference bit.
>
> The difference between a compressed upper triangular bit matrix from a
> standard
> upper triangular bit matrix like the one above, is we eliminate space from the
> bit matrix for conflicts we _know_ can never exist. The easiest case to
> catch,
> and the only one we catch at the moment, is that allocnos that are "local" to
> a
> basic block B1 cannot conflict with allocnos that are local to basic block B2,
> where B1 != B2. To take advantage of this fact, I updated the code in
> global.c
> to sort the allocnos such that all "global" allocnos (allocnos that are live
> in
> more than one basic block) are given smaller allocno numbers than the "local"
> allocnos, and all allocnos for a given basic block are grouped together in a
> contiguous range to allocno numbers. The sorting is accomplished by:
>
> /* ...so we can sort them in the order we want them to receive
> their allocnos. */
> qsort (reg_allocno, max_allocno, sizeof (int), regno_compare);
>
> Once we have them sorted, our conceptual view of the compressed bit matrix
> will now look like:
>
> G G G B0 B0 B0 B1 B1 B2 B2 B2 B2
>
> 0 1 2 3 4 5 6 7 8 9 10 11
> ------ -------------------------------------------------------------
> | -1 | G 0 | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
> ------ -------------------------------------------------------------
> | 9 | G 1 | | | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
> ------ -------------------------------------------------------------
> | 18 | G 2 | | | | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 |
> ------ -------------------------------------------------------------
> | 26 | B0 3 | | | | | 30 | 31 | * | * | * | * | * | * |
> ------ -------------------------------------------------------------
> | 27 | B0 4 | | | | | | 32 | * | * | * | * | * | * |
> ------ -------------------------------------------------------------
> | NA | B0 5 | | | | | | | * | * | * | * | * | * |
> ------ -------------------------------------------------------------
> | 26 | B1 6 | | | | | | | | 33 | * | * | * | * |
> ------ -------------------------------------------------------------
> | NA | B1 7 | | | | | | | | | * | * | * | * |
> ------ -------------------------------------------------------------
> | 25 | B2 8 | | | | | | | | | | 34 | 35 | 36 |
> ------ -------------------------------------------------------------
> | 27 | B2 9 | | | | | | | | | | | 37 | 38 |
> ------ -------------------------------------------------------------
> | 28 | B2 10 | | | | | | | | | | | | 39 |
> ------ -------------------------------------------------------------
> | NA | B2 11 | | | | | | | | | | | | |
> ------ -------------------------------------------------------------
>
> I have augmented the figure showing which allocnos are global (G) and which
> are local (B0, B1 and B2). The "*" slots in the matrix correspond to
> conflicts
> that cannot exist between the local allocnos. By sorting the allocnos, we
> have grouped the unneeded space together to make it easy to eliminate.
> You'll also notice the initialized array now has different values than before.
> This is how we conceptually "skip" over the empty space we have not allocated.
>
> As an example, if we look at the interference between allocnos 8 and 10, we
> compute "array[8] + 10" = "25 + 10" = "35", which if you look above, you will
> see is the correct bit number for that interference bit. The code that
> computes the precomputed bit matrix values as well as the number of bits in
> the bit matrix is the following code, also from global.c:
>
> max_bitnum = 0;
> max_global_bitnum = 0;
> for (i = 0; i < (size_t) max_allocno; i++)
> {
> int regno = allocno[i].reg;
> int blk = regno_basic_block (regno);
> int row_size = --num_allocnos_per_blk[blk];
> reg_allocno[regno] = (int) i;
> partial_bitnum[i] = (row_size > 0) ? max_bitnum - ((int) i + 1) : -1;
> max_bitnum += row_size;
> if (blk == 0)
> max_global_bitnum += row_size;
> }
>
> Once that is finished, we're ready to allocate the smaller bit vector used for
> the bit matrix. The other benefit is that the accesses to this compressed
> bit matrix are exactly like those of a standard upper triangular bit matrix.
> A third benefit is that since all of the conflict bits for allocnos from the
> same basic block are consecutive in the bit vector, we should get slightly
> better cache usage than if they we're sorted that way.
>
> The amount of space savings over a standard upper triangular bit matrix
> depends on the structure of the function we're compiling. Functions with
> all global allocnos or functions with one basic block (eg, fpppp from SPEC97)
> cause us to degrade into a standard upper triangular bit matrix. OTOH,
> functions like 177.mesa/src/get.c:gl_GetBooleanv() from SPEC2000 show
> huge wins. For gl_GetBooleanv(), we allocate 201613 bytes for the square
> bit matrix, 100727 bytes for the standard upper triangular bit matrix and
> 1722 bytes for the compressed upper triangular bit matrix. That means
> the compressed bit matrix is 117 times smaller than the square bit matrix
> and nearly 60 times smaller than the standard upper triangular bit matrix.
>
>
> I am thinking about how we can compress the matrix even further. If I had
> region info, I should be able to easily sort the global allonos into
> groups, such that global allocnos in different groups do not conflict.
> Kenny had some region building code, but he said it was costly to compute.
> I may play around with it, if for no other reason than just to see what
> type of space savings it would buy. I'm also trying to come up with some
> ideas on how we can sort allocnos that are local to the same basic block,
> such that we can get space savings for them as well. If anyone has ideas,
> let me know! :)
>
>
>
> > I have also another question. I saw that sparset was used for the
> > conflict builder. I tried that too when I worked on YARA project. I
> > even wanted to contribute a generic sparset implementation. But I found
> > that in general case bitmaps are not worse the sparse sets and much
> > better if we take a needed space into account. May be you have another
> > impression? It would be very interesting for me to hear it. I found
> > that bitmaps have more advanced design than sparsets. I always wanted
> > to find inventors the bitmaps but never tracked them down.
>
> The sparseset representation is only used for the allocnos_live set.
> The old version was a bit vector to match up with the square bit matrix.
> Since I changed that to save space, I had to reimplement allocnos_live.
> Danny suggested I use a bitmap for that set and I tried it, but I found
> for the particular usage of allocnos_live, a sparseset was noticeably
> faster than bitmaps. I'll note that the main operations on allocnos_live
> are to add allocnos to the set, remove allonos to the set and iterate over
> the members of the set and occasionally clear the entire set. These are
> all O(1) operations for the sparseset with fairly low constant factors
> too. I didn't look too closely, but I'm guessing that the main problem
> with bitmaps for this type of usage was the slower iterating over all
> of the members of the set versus the sparseset.
>
> Obviously, bitmaps are much better than sparsesets wrt space usage, so
> you have to use them sparingly. You wouldn't want an array of these
> things! :) But there are use cases where they work very very well.
> The currently "live" set is one such use. Another use I have found
> where they work well is in the needLoad set used by Briggs' allocator.
>
> Whether you want/should use a sparseset really depends on the number
> and type of set operations your particular usage will see. I'm sure
> there are many usage cases where bitmaps are superior to sparsesets,
> just like there are usage cases where sparsesets are superior. I know
> that sounds like a cop-out, but it really does depend on how you're
> going to use it.
>
>
> Peter
Don't be stupid!
I don't need this dirty thing of "compressed upper triangulars".
Instead, i should use "chained multi upper triangulars" and "rectangulars",
so i will have more control in its traversing and will require lesser
computation because i don't check the known "don't cares" * in the big
"compressed upper triangular".
Here below is the semantic interpretation of the 12 allocnos {x}:
G {0} = G_0 B0 {3} = B0_0 B1 {6} = B1_0 B2 {8} = B2_0
G {1} = G_1 B0 {4} = B0_1 B1 {7} = B1_1 B2 {9} = B2_1
G {2} = G_2 B0 {5} = B0_2 B2{10} = B2_2
B2{11} = B2_3
There are 4 "chained upper triangulars" cut{G}, cut{B0}, cut{B1} and cut{B2}
and 1 "rectangular" between Global and Local BBs's allocnos r{G,{B0,B1,B2}}:
cut{G}: G_0 G_1 G_2
0 1 2
------ ----------------
| -1 | G_0 0 | | 0 | 1 |
------ ----------------
| 0 | G_1 1 | | | 2 |
------ ----------------
| NA | G_2 2 | | | |
------ ----------------
cut{B0}: B0_0 B0_1 B0_2
0 1 2
------ ----------------
| -1 | B0_0 0 | | 0 | 1 |
------ ----------------
| 0 | B0_1 1 | | | 2 |
------ ----------------
| NA | B0_2 2 | | | |
------ ----------------
cut{B1}: B1_0 B1_1
0 1
------ -----------
| -1 | B1_0 0 | | 0 |
------ -----------
| NA | B1_1 1 | | |
------ -----------
cut{B2}: B2_0 B2_1 B2_2 B2_3
0 1 2 3
------ ---------------------
| -1 | B2_0 0 | | 0 | 1 | 2 |
------ ---------------------
| 1 | B2_1 1 | | | 3 | 4 |
------ ---------------------
| 2 | B2_2 2 | | | | 5 |
------ ---------------------
| NA | B2_3 1 | | | | |
------ ---------------------
r{G,{B0,B1,B2}}: B0_0 B0_1 B0_2 B1_0 B1_1 B2_0 B2_1 B2_2 B2_3
0 1 2 3 4 5 6 7 8
----------------------------------------------
G_0 0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
----------------------------------------------
G_1 1 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
----------------------------------------------
G_2 2 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 |
----------------------------------------------
This unique rectangular can be parted to many local rectangulars,
where each BB can have its own local rectangular r{Bx,G}.
It's useful when there are increases or decreases in the number of BBs.
J.C.Pizarro