On Wed, 24 Mar 2021, guojiufu wrote:
> On 2021-03-24 15:55, Richard Biener wrote:
> > On Wed, Mar 24, 2021 at 3:55 AM guojiufu <[email protected]> wrote:
> >>
> >> On 2021-03-23 16:25, Richard Biener via Gcc wrote:
> >> > On Tue, Mar 23, 2021 at 4:33 AM guojiufu <[email protected]>
> >> > wrote:
> >> >>
> >> >> On 2021-03-22 16:31, Jakub Jelinek via Gcc wrote:
> >> >> > On Mon, Mar 22, 2021 at 09:22:26AM +0100, Richard Biener via Gcc
> >> >> > wrote:
> >> >> >> Better than doing loop versioning is to enhance SCEV (and thus also
> >> >> >> dependence analysis) to track extra conditions they need to handle
> >> >> >> cases similar as to how niter analysis computes it's 'assumptions'
> >> >> >> condition. That allows the versioning to be done when there's an
> >> >> >> actual beneficial transform (like vectorization) rather than just
> >> >> >> upfront for the eventual chance that there'll be any. Ideally such
> >> >> >> transform would then choose IVs in their transformed copy that
> >> >> >> are analyzable w/o repeating such versioning exercise for the next
> >> >> >> transform.
> >> >> >
> >> >> > And it might be beneficial to perform some type promotion/demotion
> >> >> > pass, either early during vectorization or separately before
> >> >> > vectorization
> >> >> > on a loop copy guarded with the ifns e.g. ifconv uses too.
> >> >> > Find out what type sizes the loop use, first try to demote
> >> >> > computations
> >> >> > to narrower types in the vectorized loop candidate (e.g. if something
> >> >> > is computed in a wider type only to have the result demoted to
> >> >> > narrower
> >> >> > type), then pick up the widest type size still in use in the loop (ok,
> >> >> > this assumes we don't mix multiple vector sizes in the loop, but
> >> >> > currently
> >> >> > our vectorizer doesn't do that) and try to promote computations that
> >> >> > could
> >> >> > be promoted to that type size. We do partially something like that
> >> >> > during
> >> >> > vect patterns for bool types, but not other types I think.
> >> >> >
> >> >> > Jakub
> >> >>
> >> >> Thanks for the suggestions!
> >> >>
> >> >> Enhancing SCEV could help other optimizations and improve performance
> >> >> in
> >> >> some cases.
> >> >> While one of the direct ideas of using the '64bit type' is to
> >> >> eliminate
> >> >> conversions,
> >> >> even for some cases which are not easy to be optimized through
> >> >> ifconv/vectorization,
> >> >> for examples:
> >> >>
> >> >> unsigned int i = 0;
> >> >> while (a[i]>1e-3)
> >> >> i++;
> >> >>
> >> >> unsigned int i = 0;
> >> >> while (p1[i] == p2[i] && p1[i] != '\0')
> >> >> i++;
> >> >>
> >> >> Or only do versioning on type for this kind of loop? Any suggestions?
> >> >
> >> > But the "optimization" resulting from such versioning is hard to
> >> > determine upfront which means we'll pay quite a big code size cost
> >> > for unknown questionable gain. What's the particular optimization
> >>
> >> Right. Code size increasing is a big pain on large loops. If the gain
> >> is not significant, this optimization may not positive.
> >>
> >> > in the above cases? Note that for example for
> >> >
> >> > unsigned int i = 0;
> >> > while (a[i]>1e-3)
> >> > i++;
> >> >
> >> > you know that when 'i' wraps then the loop will not terminate. There's
> >>
> >> Thanks :) The code would be "while (a[i]>1e-3 && i < n)", the upbound is
> >> checkable. Otherwise, the optimization to avoid zext is not adoptable.
> >>
> >> > the address computation that is i * sizeof (T) which is done in a
> >> > larger
> >> > type to avoid overflow so we have &a + zext (i) * 8 - is that the
> >> > operation
> >> > that is 'slow' for you?
> >>
> >> This is the point: "zext(i)" is the instruction that I want to
> >> eliminate,
> >> which is the direct goal of the optimization.
> >>
> >> The gain of eliminating the 'zext' is visible or not, and the code size
> >> increasing is small enough or not, this is a question and needs to
> >> trade-off.
> >> It may be only acceptable if the loop is very small, then eliminating
> >> 'zext'
> >> would help to save runtime, and code size increase maybe not big.
> >
> > OK, so I indeed think that the desire to micro-optimize a 'zext' doesn't
> > make versioning a good trade-off. The micro-architecture should better
> > not make that totally slow (I'd expect an extra latency comparable to
> > the multiply or add on the &a + zext(i) * 8 instruction chain).
>
> Agree, I understand your point. The concern is some micro-architectures are
> not
> very well on this yet. I tested the above example code:
> unsigned i = 0;
> while (a[i] > 1e-3 && i < n)
> i++;
> there are ~30% performance improvement if using "long i" instead of"unsigned
> i"
> on ppc64le and x86. It seems those instructions are not optimized too much on
> some platforms. So, I'm wondering we need to do this in GCC.
On x86 I see indexed addressing modes being used which should be fine.
Compilable testcase:
unsigned foo (double *a, unsigned n)
{
unsigned i = 0;
while (a[i] > 1e-3 && i < n)
i++;
return i;
}
ppc64le seems to do some odd unrolling/peeling or whatnot, I have a hard
time following its assembly ... ah, -fno-unroll-loops "helps" and
produces
.L5:
lfd %f0,0(%r9)
addi %r3,%r3,1
addi %r9,%r9,8
rldicl %r3,%r3,0,32
fcmpu %cr0,%f0,%f12
bnglr %cr0
bdnz .L5
which looks pretty good to me, I suppose the rldicl is the
zero-extension but the IVs are already 64bit and the
zero-extension should be sunk to the loop exit instead.
> >
> > OTOH making SCEV analysis not give up but instead record the constraints
> > under which its solution is valid is a very good and useful thing to do.
>
> Thanks! Enhance SCEV could help a few cases, especially when other
> optimizations
> are enabled.
>
> Thanks again for your suggestions!
>
> BR.
> Jiufu Guo.
>
> >
> > Richard.
> >
> >> Thanks again for your very helpful comments!
> >>
> >> BR.
> >> Jiufu Guo.
> >>
> >> >
> >> > Richard.
> >> >
> >> >> BR.
> >> >> Jiufu Guo.
>
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)