On Fri 25 Apr 2008 22:22:55 -0500, Peter Bergner <[EMAIL PROTECTED]> wrote: > On Thu, 2008-04-24 at 20:23 -0400, Vladimir Makarov wrote: > > Hi, Peter. The last time I looked at the conflict builder > > (ra-conflict.c), I did not see the compressed matrix. Is it in the > > trunk? What should I look at? > > Yes, the compressed bit matrix was committed as revision 129037 on > October 5th, so it's been there a while. Note that the old square > bit matrix was used not only for testing for conflicts, but also for > visiting an allocno's neighbors. The new code (and all compilers I've > worked on/with), use a {,compressed} upper triangular bit matrix for > testing for conflicts and an adjacency list for visiting neighbors. > > The code that allocates and initializes the compressed bit matrix is in > global.c. If you remember how a upper triangular bit matrix works, it's > just one big bit vector, where the bit number that represents the conflict > between allocnos LOW and HIGH is given by either of these two functions: > > 1) bitnum = f(HIGH) + LOW > 2) bitnum = f(LOW) + HIGH > > where: > > 1) f(HIGH) = (HIGH * (HIGH - 1)) / 2 > 2) f(LOW) = LOW * (max_allocno - LOW) + (LOW * (LOW - 1)) / 2 - LOW - 1 > > As mentioned in some of the conflict graph bit matrix literature (actually, > they only mention expression #1 above), the expensive functions f(HIGH) and > f(LOW) can be precomputed and stored in an array, so to access the conflict > graph bits only takes a load and an addition. Below is an example bit matrix > with initialized array: > > 0 1 2 3 4 5 6 7 8 9 10 11 > ------ ------------------------------------------------------------- > | -1 | 0 | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | > ------ ------------------------------------------------------------- > | 9 | 1 | | | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | > ------ ------------------------------------------------------------- > | 18 | 2 | | | | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | > ------ ------------------------------------------------------------- > | 26 | 3 | | | | | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | > ------ ------------------------------------------------------------- > | 33 | 4 | | | | | | 38 | 39 | 40 | 41 | 42 | 43 | 44 | > ------ ------------------------------------------------------------- > | 39 | 5 | | | | | | | 45 | 46 | 47 | 48 | 49 | 50 | > ------ ------------------------------------------------------------- > | 44 | 6 | | | | | | | | 51 | 52 | 53 | 54 | 55 | > ------ ------------------------------------------------------------- > | 48 | 7 | | | | | | | | | 56 | 57 | 58 | 59 | > ------ ------------------------------------------------------------- > | 51 | 8 | | | | | | | | | | 60 | 61 | 62 | > ------ ------------------------------------------------------------- > | 53 | 9 | | | | | | | | | | | 63 | 64 | > ------ ------------------------------------------------------------- > | 54 | 10 | | | | | | | | | | | | 65 | > ------ ------------------------------------------------------------- > | NA | 11 | | | | | | | | | | | | | > ------ ------------------------------------------------------------- > > As an example, if we look at the interference between allocnos 8 and 10, we > compute "array[8] + 10" = "51 + 10" = "61", which if you look above, you will > see is the correct bit number for that interference bit. > > The difference between a compressed upper triangular bit matrix from a > standard > upper triangular bit matrix like the one above, is we eliminate space from the > bit matrix for conflicts we _know_ can never exist. The easiest case to > catch, > and the only one we catch at the moment, is that allocnos that are "local" to > a > basic block B1 cannot conflict with allocnos that are local to basic block B2, > where B1 != B2. To take advantage of this fact, I updated the code in > global.c > to sort the allocnos such that all "global" allocnos (allocnos that are live > in > more than one basic block) are given smaller allocno numbers than the "local" > allocnos, and all allocnos for a given basic block are grouped together in a > contiguous range to allocno numbers. The sorting is accomplished by: > > /* ...so we can sort them in the order we want them to receive > their allocnos. */ > qsort (reg_allocno, max_allocno, sizeof (int), regno_compare); > > Once we have them sorted, our conceptual view of the compressed bit matrix > will now look like: > > G G G B0 B0 B0 B1 B1 B2 B2 B2 B2 > > 0 1 2 3 4 5 6 7 8 9 10 11 > ------ ------------------------------------------------------------- > | -1 | G 0 | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | > ------ ------------------------------------------------------------- > | 9 | G 1 | | | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | > ------ ------------------------------------------------------------- > | 18 | G 2 | | | | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | > ------ ------------------------------------------------------------- > | 26 | B0 3 | | | | | 30 | 31 | * | * | * | * | * | * | > ------ ------------------------------------------------------------- > | 27 | B0 4 | | | | | | 32 | * | * | * | * | * | * | > ------ ------------------------------------------------------------- > | NA | B0 5 | | | | | | | * | * | * | * | * | * | > ------ ------------------------------------------------------------- > | 26 | B1 6 | | | | | | | | 33 | * | * | * | * | > ------ ------------------------------------------------------------- > | NA | B1 7 | | | | | | | | | * | * | * | * | > ------ ------------------------------------------------------------- > | 25 | B2 8 | | | | | | | | | | 34 | 35 | 36 | > ------ ------------------------------------------------------------- > | 27 | B2 9 | | | | | | | | | | | 37 | 38 | > ------ ------------------------------------------------------------- > | 28 | B2 10 | | | | | | | | | | | | 39 | > ------ ------------------------------------------------------------- > | NA | B2 11 | | | | | | | | | | | | | > ------ ------------------------------------------------------------- > > I have augmented the figure showing which allocnos are global (G) and which > are local (B0, B1 and B2). The "*" slots in the matrix correspond to > conflicts > that cannot exist between the local allocnos. By sorting the allocnos, we > have grouped the unneeded space together to make it easy to eliminate. > You'll also notice the initialized array now has different values than before. > This is how we conceptually "skip" over the empty space we have not allocated. > > As an example, if we look at the interference between allocnos 8 and 10, we > compute "array[8] + 10" = "25 + 10" = "35", which if you look above, you will > see is the correct bit number for that interference bit. The code that > computes the precomputed bit matrix values as well as the number of bits in > the bit matrix is the following code, also from global.c: > > max_bitnum = 0; > max_global_bitnum = 0; > for (i = 0; i < (size_t) max_allocno; i++) > { > int regno = allocno[i].reg; > int blk = regno_basic_block (regno); > int row_size = --num_allocnos_per_blk[blk]; > reg_allocno[regno] = (int) i; > partial_bitnum[i] = (row_size > 0) ? max_bitnum - ((int) i + 1) : -1; > max_bitnum += row_size; > if (blk == 0) > max_global_bitnum += row_size; > } > > Once that is finished, we're ready to allocate the smaller bit vector used for > the bit matrix. The other benefit is that the accesses to this compressed > bit matrix are exactly like those of a standard upper triangular bit matrix. > A third benefit is that since all of the conflict bits for allocnos from the > same basic block are consecutive in the bit vector, we should get slightly > better cache usage than if they we're sorted that way. > > The amount of space savings over a standard upper triangular bit matrix > depends on the structure of the function we're compiling. Functions with > all global allocnos or functions with one basic block (eg, fpppp from SPEC97) > cause us to degrade into a standard upper triangular bit matrix. OTOH, > functions like 177.mesa/src/get.c:gl_GetBooleanv() from SPEC2000 show > huge wins. For gl_GetBooleanv(), we allocate 201613 bytes for the square > bit matrix, 100727 bytes for the standard upper triangular bit matrix and > 1722 bytes for the compressed upper triangular bit matrix. That means > the compressed bit matrix is 117 times smaller than the square bit matrix > and nearly 60 times smaller than the standard upper triangular bit matrix. > > > I am thinking about how we can compress the matrix even further. If I had > region info, I should be able to easily sort the global allonos into > groups, such that global allocnos in different groups do not conflict. > Kenny had some region building code, but he said it was costly to compute. > I may play around with it, if for no other reason than just to see what > type of space savings it would buy. I'm also trying to come up with some > ideas on how we can sort allocnos that are local to the same basic block, > such that we can get space savings for them as well. If anyone has ideas, > let me know! :) > > > > > I have also another question. I saw that sparset was used for the > > conflict builder. I tried that too when I worked on YARA project. I > > even wanted to contribute a generic sparset implementation. But I found > > that in general case bitmaps are not worse the sparse sets and much > > better if we take a needed space into account. May be you have another > > impression? It would be very interesting for me to hear it. I found > > that bitmaps have more advanced design than sparsets. I always wanted > > to find inventors the bitmaps but never tracked them down. > > The sparseset representation is only used for the allocnos_live set. > The old version was a bit vector to match up with the square bit matrix. > Since I changed that to save space, I had to reimplement allocnos_live. > Danny suggested I use a bitmap for that set and I tried it, but I found > for the particular usage of allocnos_live, a sparseset was noticeably > faster than bitmaps. I'll note that the main operations on allocnos_live > are to add allocnos to the set, remove allonos to the set and iterate over > the members of the set and occasionally clear the entire set. These are > all O(1) operations for the sparseset with fairly low constant factors > too. I didn't look too closely, but I'm guessing that the main problem > with bitmaps for this type of usage was the slower iterating over all > of the members of the set versus the sparseset. > > Obviously, bitmaps are much better than sparsesets wrt space usage, so > you have to use them sparingly. You wouldn't want an array of these > things! :) But there are use cases where they work very very well. > The currently "live" set is one such use. Another use I have found > where they work well is in the needLoad set used by Briggs' allocator. > > Whether you want/should use a sparseset really depends on the number > and type of set operations your particular usage will see. I'm sure > there are many usage cases where bitmaps are superior to sparsesets, > just like there are usage cases where sparsesets are superior. I know > that sounds like a cop-out, but it really does depend on how you're > going to use it. > > > Peter
Don't be stupid! I don't need this dirty thing of "compressed upper triangulars". Instead, i should use "chained multi upper triangulars" and "rectangulars", so i will have more control in its traversing and will require lesser computation because i don't check the known "don't cares" * in the big "compressed upper triangular". Here below is the semantic interpretation of the 12 allocnos {x}: G {0} = G_0 B0 {3} = B0_0 B1 {6} = B1_0 B2 {8} = B2_0 G {1} = G_1 B0 {4} = B0_1 B1 {7} = B1_1 B2 {9} = B2_1 G {2} = G_2 B0 {5} = B0_2 B2{10} = B2_2 B2{11} = B2_3 There are 4 "chained upper triangulars" cut{G}, cut{B0}, cut{B1} and cut{B2} and 1 "rectangular" between Global and Local BBs's allocnos r{G,{B0,B1,B2}}: cut{G}: G_0 G_1 G_2 0 1 2 ------ ---------------- | -1 | G_0 0 | | 0 | 1 | ------ ---------------- | 0 | G_1 1 | | | 2 | ------ ---------------- | NA | G_2 2 | | | | ------ ---------------- cut{B0}: B0_0 B0_1 B0_2 0 1 2 ------ ---------------- | -1 | B0_0 0 | | 0 | 1 | ------ ---------------- | 0 | B0_1 1 | | | 2 | ------ ---------------- | NA | B0_2 2 | | | | ------ ---------------- cut{B1}: B1_0 B1_1 0 1 ------ ----------- | -1 | B1_0 0 | | 0 | ------ ----------- | NA | B1_1 1 | | | ------ ----------- cut{B2}: B2_0 B2_1 B2_2 B2_3 0 1 2 3 ------ --------------------- | -1 | B2_0 0 | | 0 | 1 | 2 | ------ --------------------- | 1 | B2_1 1 | | | 3 | 4 | ------ --------------------- | 2 | B2_2 2 | | | | 5 | ------ --------------------- | NA | B2_3 1 | | | | | ------ --------------------- r{G,{B0,B1,B2}}: B0_0 B0_1 B0_2 B1_0 B1_1 B2_0 B2_1 B2_2 B2_3 0 1 2 3 4 5 6 7 8 ---------------------------------------------- G_0 0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ---------------------------------------------- G_1 1 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | ---------------------------------------------- G_2 2 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | ---------------------------------------------- This unique rectangular can be parted to many local rectangulars, where each BB can have its own local rectangular r{Bx,G}. It's useful when there are increases or decreases in the number of BBs. J.C.Pizarro