Re: [PATCH] gimple: Add limit after which slower switchlower algs are used [PR117091] [PR117352]

Filip Kastl Thu, 05 Dec 2024 02:42:42 -0800

Hi Andi and Richi,

sorry for the late reply.  While looking for a testcase where the DP algorithm
performs better than Andi's greedy one I found out some new things about bit
test switch lowering and I had to think them through.  I write about what I
found bellow.

On Thu 2024-11-21 10:01:38, Andi Kleen wrote:
> On Fri, Nov 15, 2024 at 10:43:57AM +0100, Filip Kastl wrote:
> > Hi,
> > 
> > Andi's greedy bit test finding algorithm was reverted.  I found a fix for 
> > the
> > problem that caused the revert.  I made this patch to reintroduce the greedy
> > alg into GCC.  However I think we should keep the old slow but more powerful
> > algorithm so I added a limit on the number of cases of the switch statement
> > that decides which of the two algorithms to use.
> 
> Do we actually have a case where the DP algorithm is better?
> 
> In the bootstrap comparison greedy does produce less or the same clusters
> 

It seems reasonable to me to want a bit test cluster finding algorithm that
produces the minimal number of clusters (I count both the bit test clusters and
simple clusters which correspond to single cases of the original switch).  I
actually found out that, with respect to this metric, neither the DP algorithm
nor the greedy algorithm are optimal.

I came up with this testcase

int f0();    
int f1();    
int f2();    
int f3();    
int f4();    

int main(int a)    
{    
    switch (a)    
    {    
        case 0:    
        case 2:    
        case 4:    
        case 6:    
            return f0();    
        case 8:    
            return f1();    
        case 10:    
        case 14:    
        case 16:    
        case 18:    
            return f2();    
        case 12:    
            return f3();    
        case 20:    
            return f4();    
    }    
    return -1;    
}

where the DP algorithm manages to cover all cases with two bit test clusters
but the greedy algorithm only finds one bit test cluster (and 4 simple clusters
for a total of 5 clusters).

But if you reverse the cases (meaning you map case i to case 20 - i), you get
this testcase

int f0();    
int f1();    
int f2();    
int f3();    
int f4();    

int main(int a)    
{    
    switch (a)    
    {    
        case 20:    
        case 18:    
        case 16:    
        case 14:    
            return f0();    
        case 12:    
            return f1();    
        case 10:    
        case 6:    
        case 4:    
        case 2:    
            return f2();    
        case 8:    
            return f3();    
        case 0:    
            return f4();    
    }    
    return -1;    
}

where the situation is reversed:  Greedy manages to cover all cases with two
bit test clusters but DP only manages to find one bit test cluster.

This surprised me.  I thought that the DP algorithm is optimal and that the
greedy algorithm being better on some cases could be explained by the
differences in how the two algorithms used the 'is_beneficial' heuristic.  But
that's not the case.

I looked more thoroughly into the code of the DP algorithm.  I'm now pretty
convinced the DP algorithm could give optimal solution but simply isn't
configured to do so.  It first looks for minimal number of clusters and only
after it finds the clusters it goes through them and checks that they are
'is_beneficial'.  It breaks down the non-'is_beneficial' ones into simple
clusters.  Instead of this the algorithm could just look for the minimal number
of clusters such that each of them is 'is_beneficial' in the first place.  So
we could have an optimal bit test clustering algorithm.  We just don't at the
moment.

Since we are currently in stage 3, I don't expect we want to go make the DP
algorithm optimal.  We can do that in the next stage 1.  However, I think we
still want to fix PR117091.  I think we need to keep the limit I introduced at
least for the jump table clustering because that is still at least O(n^2).  For
the bit test clustering we can either use both the DP algorithm and the greedy
algorithm as I do in my patch or I guess we could just use the greedy
algorithm.  Since none of the two is optimal, it isn't really clear which one
is better.  On one hand the greedy algorithm is faster and Andi claims that he
saw it performing better.  On the other hand, I would like to make the DP
algorithm optimal and use it in GCC 16 and it seems to me simpler to keep the
DP algorithm than to remove it and add it again later.

What do you think, Andi and Richi?  I myself slightly prefer keeping the DP but
I would be fine with either option.

> 
> > +      k = 0;
> > +      while (i + k < l)
> > +   {
> > +     cluster *end_cluster = clusters[i + k];
> > +
> > +     /* Does value range fit into the BITS_PER_WORD window?  */
> > +     HOST_WIDE_INT w = cluster::get_range (start_cluster->get_low (),
> > +                                           end_cluster->get_high ());
> > +     if (w == 0 || w > BITS_PER_WORD)
> > +       break;
> > +
> > +     /* Check for max # of targets.  */
> > +     if (targets.elements () == m_max_case_bit_tests
> > +         && !targets.contains (end_cluster->m_case_bb))
> > +       break;
> > +
> > +     targets.add (end_cluster->m_case_bb);
> > +     k++;
> > +   }
> >  
> > +      if (is_beneficial (k, targets.elements ()))
> 
> Equivalent to the old test in DP would be k + 1
> I'm not sure it makes a lot of difference though.
> 

I've double checked and I'm pretty sure that the k is used here the same way it
is used in DP.  In both cases, is_beneficial (unsigned, unsigned) ends up being
called with the number of cases to be covered by a bit test cluster as the
first argument.  In the DP code this is harder to see because it first calls
is_beneficial (vec<cluster> &, unsigned, unsigned) which then calls
is_beneficial (unsigned, unsigned) and also there's some +/-1 weirdness with
the end of the sequence of cases.

> 
> -Andi

Cheers,
Filip Kastl

P.S.: Inspired by Andi's algorithm I think I found a way to speed up the DP
from O(n^2) to O(n * BITS_PER_WORD) so effectively O(n)!  So in GCC 16 we could
have an optimal *and* fast bit test clustering algorithm.

Btw, sorry for being sceptical about the existence of an optimal O(n) algorithm
in the PR117091 bugzilla report.  Though for jump table clustering we still
have only O(n^2) (or O(n^3)? not sure) and I don't think we can use the
"clusters are at most BITS_PER_WORD long" trick there.

Re: [PATCH] gimple: Add limit after which slower switchlower algs are used [PR117091] [PR117352]

Reply via email to