https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118945

--- Comment #9 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Andrew Waterman from comment #8)
> >  In fact, I'd be rather surprised to see anything preferring tail 
> > undisturbed.
> 
> Right.  To be precise, microarchitectures without register renaming
> absolutely do prefer to leave the tail undisturbed.  But that's why the ISA
> defines the agnostic mode in such a way that undisturbed is a valid
> implementation of agnostic.  (The in-order microarchitectures I've worked on
> simply ignore the tail-/mask-agnostic setting; the state bits that control
> the mode are essentially vestigial.)
> 
> Since no plausible implementation will benefit from being in undisturbed
> mode, we don't need to consider that aspect of the problem, but...
> 
> > I prefer fewer "vsetvli" (which allows more fusion) by default.
> 
> ...but here's the rub.  Implementations that don't benefit from the agnostic
> setting would definitely prefer to avoid the extra setvl instructions, not
> because they're expensive, but because they're not free.
> 
> > Some designs aren't sensitive to the number of vsetvls and I would expect 
> > that over time that's where high performance designs will land over time.
> 
> Low-performance ones, too.  (Making vset[i]vli fast is more of an
> engineering cost than a silicon cost.)  But the instructions still have to
> be fetched and decoded, and registers have to be read and written, so the
> perf cost will converge on that of, say, an ADDI instruction, which is to
> say cheap but not zero.  For narrow-issue machines, this does matter.
> 
> > Obviously for your design you'll want to set the knob which says "minimize 
> > vsetvls" as opposed to "avoid false dependencies by preferring tail 
> > agnostic". That's easily handled by putting the data in the tuning 
> > structure for each design.
> 
> And so this is the right answer :)

In my uarch, "vsetvli" is cheap but is not zero-cost which is pretty similar
ADDI. As andrew's said, for in-order microarchitecture, you can't ignore the
cost of "vsetvli" that's why I prefer keep original "vsetvli" strategy (which
is  fusing "vsetvli" as many as possible) by default.

For example, you should test it in K1 banana which is better ("keep agnostic
but more vsetvli" vs "allow aggressive fusion into single undisturbed".

Also, the example shows in the PR is not appropriate to make us to make a
decision here since it just produce 1 vsetvli when you disable aggressive
fusion 
into undisturbed which seems to not to be very costly.

I think we should consider many more different situation and consider it
carefully. Like:

vsetvli ... e8,mf8 ta ma (demand ratio)
...
vservli zero zero e32 mf2 tu ma (demand ratio)
...
vservli zero zero e64 m1 ta ma (demand SEW and LMUL)
...
vservli zero zero e64 m1 ta mu (demand ratio)
...
vservli zero zero e16 mf4 tu mu(demand ratio)
...
vservli zero zero e32 mf2 ta ma(demand ratio)
...
vservli zero zero e8 mf8 ta ma(demand ratio)

In current strategy, 7 "vsetvli" will be fused into 1 single "vsetvli":

vservli ... e64 m1 tu mu

However, if you just keep agnostic not allow to fuse it, you will end up with 6
more "vsetvli"s. I don't think this codegen can better in any
micro-architecture design.

Reply via email to