Hi Andi and Richi, sorry for the late reply. While looking for a testcase where the DP algorithm performs better than Andi's greedy one I found out some new things about bit test switch lowering and I had to think them through. I write about what I found bellow.
On Thu 2024-11-21 10:01:38, Andi Kleen wrote: > On Fri, Nov 15, 2024 at 10:43:57AM +0100, Filip Kastl wrote: > > Hi, > > > > Andi's greedy bit test finding algorithm was reverted. I found a fix for > > the > > problem that caused the revert. I made this patch to reintroduce the greedy > > alg into GCC. However I think we should keep the old slow but more powerful > > algorithm so I added a limit on the number of cases of the switch statement > > that decides which of the two algorithms to use. > > Do we actually have a case where the DP algorithm is better? > > In the bootstrap comparison greedy does produce less or the same clusters > It seems reasonable to me to want a bit test cluster finding algorithm that produces the minimal number of clusters (I count both the bit test clusters and simple clusters which correspond to single cases of the original switch). I actually found out that, with respect to this metric, neither the DP algorithm nor the greedy algorithm are optimal. I came up with this testcase int f0(); int f1(); int f2(); int f3(); int f4(); int main(int a) { switch (a) { case 0: case 2: case 4: case 6: return f0(); case 8: return f1(); case 10: case 14: case 16: case 18: return f2(); case 12: return f3(); case 20: return f4(); } return -1; } where the DP algorithm manages to cover all cases with two bit test clusters but the greedy algorithm only finds one bit test cluster (and 4 simple clusters for a total of 5 clusters). But if you reverse the cases (meaning you map case i to case 20 - i), you get this testcase int f0(); int f1(); int f2(); int f3(); int f4(); int main(int a) { switch (a) { case 20: case 18: case 16: case 14: return f0(); case 12: return f1(); case 10: case 6: case 4: case 2: return f2(); case 8: return f3(); case 0: return f4(); } return -1; } where the situation is reversed: Greedy manages to cover all cases with two bit test clusters but DP only manages to find one bit test cluster. This surprised me. I thought that the DP algorithm is optimal and that the greedy algorithm being better on some cases could be explained by the differences in how the two algorithms used the 'is_beneficial' heuristic. But that's not the case. I looked more thoroughly into the code of the DP algorithm. I'm now pretty convinced the DP algorithm could give optimal solution but simply isn't configured to do so. It first looks for minimal number of clusters and only after it finds the clusters it goes through them and checks that they are 'is_beneficial'. It breaks down the non-'is_beneficial' ones into simple clusters. Instead of this the algorithm could just look for the minimal number of clusters such that each of them is 'is_beneficial' in the first place. So we could have an optimal bit test clustering algorithm. We just don't at the moment. Since we are currently in stage 3, I don't expect we want to go make the DP algorithm optimal. We can do that in the next stage 1. However, I think we still want to fix PR117091. I think we need to keep the limit I introduced at least for the jump table clustering because that is still at least O(n^2). For the bit test clustering we can either use both the DP algorithm and the greedy algorithm as I do in my patch or I guess we could just use the greedy algorithm. Since none of the two is optimal, it isn't really clear which one is better. On one hand the greedy algorithm is faster and Andi claims that he saw it performing better. On the other hand, I would like to make the DP algorithm optimal and use it in GCC 16 and it seems to me simpler to keep the DP algorithm than to remove it and add it again later. What do you think, Andi and Richi? I myself slightly prefer keeping the DP but I would be fine with either option. > > > + k = 0; > > + while (i + k < l) > > + { > > + cluster *end_cluster = clusters[i + k]; > > + > > + /* Does value range fit into the BITS_PER_WORD window? */ > > + HOST_WIDE_INT w = cluster::get_range (start_cluster->get_low (), > > + end_cluster->get_high ()); > > + if (w == 0 || w > BITS_PER_WORD) > > + break; > > + > > + /* Check for max # of targets. */ > > + if (targets.elements () == m_max_case_bit_tests > > + && !targets.contains (end_cluster->m_case_bb)) > > + break; > > + > > + targets.add (end_cluster->m_case_bb); > > + k++; > > + } > > > > + if (is_beneficial (k, targets.elements ())) > > Equivalent to the old test in DP would be k + 1 > I'm not sure it makes a lot of difference though. > I've double checked and I'm pretty sure that the k is used here the same way it is used in DP. In both cases, is_beneficial (unsigned, unsigned) ends up being called with the number of cases to be covered by a bit test cluster as the first argument. In the DP code this is harder to see because it first calls is_beneficial (vec<cluster> &, unsigned, unsigned) which then calls is_beneficial (unsigned, unsigned) and also there's some +/-1 weirdness with the end of the sequence of cases. > > -Andi Cheers, Filip Kastl P.S.: Inspired by Andi's algorithm I think I found a way to speed up the DP from O(n^2) to O(n * BITS_PER_WORD) so effectively O(n)! So in GCC 16 we could have an optimal *and* fast bit test clustering algorithm. Btw, sorry for being sceptical about the existence of an optimal O(n) algorithm in the PR117091 bugzilla report. Though for jump table clustering we still have only O(n^2) (or O(n^3)? not sure) and I don't think we can use the "clusters are at most BITS_PER_WORD long" trick there.