On Thu, 5 Dec 2024, Filip Kastl wrote:
> Hi Andi and Richi,
>
> sorry for the late reply. While looking for a testcase where the DP algorithm
> performs better than Andi's greedy one I found out some new things about bit
> test switch lowering and I had to think them through. I write about what I
> found bellow.
>
> On Thu 2024-11-21 10:01:38, Andi Kleen wrote:
> > On Fri, Nov 15, 2024 at 10:43:57AM +0100, Filip Kastl wrote:
> > > Hi,
> > >
> > > Andi's greedy bit test finding algorithm was reverted. I found a fix for
> > > the
> > > problem that caused the revert. I made this patch to reintroduce the
> > > greedy
> > > alg into GCC. However I think we should keep the old slow but more
> > > powerful
> > > algorithm so I added a limit on the number of cases of the switch
> > > statement
> > > that decides which of the two algorithms to use.
> >
> > Do we actually have a case where the DP algorithm is better?
> >
> > In the bootstrap comparison greedy does produce less or the same clusters
> >
>
> It seems reasonable to me to want a bit test cluster finding algorithm that
> produces the minimal number of clusters (I count both the bit test clusters
> and
> simple clusters which correspond to single cases of the original switch). I
> actually found out that, with respect to this metric, neither the DP algorithm
> nor the greedy algorithm are optimal.
>
> I came up with this testcase
>
> int f0();
> int f1();
> int f2();
> int f3();
> int f4();
>
> int main(int a)
> {
> switch (a)
> {
> case 0:
> case 2:
> case 4:
> case 6:
> return f0();
> case 8:
> return f1();
> case 10:
> case 14:
> case 16:
> case 18:
> return f2();
> case 12:
> return f3();
> case 20:
> return f4();
> }
> return -1;
> }
>
> where the DP algorithm manages to cover all cases with two bit test clusters
> but the greedy algorithm only finds one bit test cluster (and 4 simple
> clusters
> for a total of 5 clusters).
>
> But if you reverse the cases (meaning you map case i to case 20 - i), you get
> this testcase
>
> int f0();
> int f1();
> int f2();
> int f3();
> int f4();
>
> int main(int a)
> {
> switch (a)
> {
> case 20:
> case 18:
> case 16:
> case 14:
> return f0();
> case 12:
> return f1();
> case 10:
> case 6:
> case 4:
> case 2:
> return f2();
> case 8:
> return f3();
> case 0:
> return f4();
> }
> return -1;
> }
>
> where the situation is reversed: Greedy manages to cover all cases with two
> bit test clusters but DP only manages to find one bit test cluster.
>
> This surprised me. I thought that the DP algorithm is optimal and that the
> greedy algorithm being better on some cases could be explained by the
> differences in how the two algorithms used the 'is_beneficial' heuristic. But
> that's not the case.
>
> I looked more thoroughly into the code of the DP algorithm. I'm now pretty
> convinced the DP algorithm could give optimal solution but simply isn't
> configured to do so. It first looks for minimal number of clusters and only
> after it finds the clusters it goes through them and checks that they are
> 'is_beneficial'. It breaks down the non-'is_beneficial' ones into simple
> clusters. Instead of this the algorithm could just look for the minimal
> number
> of clusters such that each of them is 'is_beneficial' in the first place. So
> we could have an optimal bit test clustering algorithm. We just don't at the
> moment.
>
> Since we are currently in stage 3, I don't expect we want to go make the DP
> algorithm optimal. We can do that in the next stage 1. However, I think we
> still want to fix PR117091. I think we need to keep the limit I introduced at
> least for the jump table clustering because that is still at least O(n^2).
> For
> the bit test clustering we can either use both the DP algorithm and the greedy
> algorithm as I do in my patch or I guess we could just use the greedy
> algorithm. Since none of the two is optimal, it isn't really clear which one
> is better. On one hand the greedy algorithm is faster and Andi claims that he
> saw it performing better. On the other hand, I would like to make the DP
> algorithm optimal and use it in GCC 16 and it seems to me simpler to keep the
> DP algorithm than to remove it and add it again later.
>
> What do you think, Andi and Richi? I myself slightly prefer keeping the DP
> but
> I would be fine with either option.
I think we can keep both, though I have no strong opinion.
Richard.
> >
> > > + k = 0;
> > > + while (i + k < l)
> > > + {
> > > + cluster *end_cluster = clusters[i + k];
> > > +
> > > + /* Does value range fit into the BITS_PER_WORD window? */
> > > + HOST_WIDE_INT w = cluster::get_range (start_cluster->get_low (),
> > > + end_cluster->get_high ());
> > > + if (w == 0 || w > BITS_PER_WORD)
> > > + break;
> > > +
> > > + /* Check for max # of targets. */
> > > + if (targets.elements () == m_max_case_bit_tests
> > > + && !targets.contains (end_cluster->m_case_bb))
> > > + break;
> > > +
> > > + targets.add (end_cluster->m_case_bb);
> > > + k++;
> > > + }
> > >
> > > + if (is_beneficial (k, targets.elements ()))
> >
> > Equivalent to the old test in DP would be k + 1
> > I'm not sure it makes a lot of difference though.
> >
>
> I've double checked and I'm pretty sure that the k is used here the same way
> it
> is used in DP. In both cases, is_beneficial (unsigned, unsigned) ends up
> being
> called with the number of cases to be covered by a bit test cluster as the
> first argument. In the DP code this is harder to see because it first calls
> is_beneficial (vec<cluster> &, unsigned, unsigned) which then calls
> is_beneficial (unsigned, unsigned) and also there's some +/-1 weirdness with
> the end of the sequence of cases.
>
> >
> > -Andi
>
> Cheers,
> Filip Kastl
>
> P.S.: Inspired by Andi's algorithm I think I found a way to speed up the DP
> from O(n^2) to O(n * BITS_PER_WORD) so effectively O(n)! So in GCC 16 we
> could
> have an optimal *and* fast bit test clustering algorithm.
>
> Btw, sorry for being sceptical about the existence of an optimal O(n)
> algorithm
> in the PR117091 bugzilla report. Though for jump table clustering we still
> have only O(n^2) (or O(n^3)? not sure) and I don't think we can use the
> "clusters are at most BITS_PER_WORD long" trick there.
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)