> > Hi,
> > I plan to commit some retuning of znver3 codegen that is based on real
> > hardware benchmarks.  It turns out that there are not too many changes
> > necessary sinze Zen3 is quite smooth upgrade to Zen2.  In summary:
> > 
> >  - some instructions (like idiv) have shorter latencies.  Adjusting
> >    costs reduces code size a bit but seems within noise in benchmark
> >    (since our cost calculation is quite off anyway because it does not
> >    account register pressure and parallelism that does make huge
> >    difference here)
> >  - gather instructions are still microcoded but a lot faster than in
> >    znver1/znver2 and it turns out they are now beneficial for few tsmc
> >    benchmarks, so I plan to enable them.
> 
> Can we get a copy of this benchmark to try ?  
> we need to check on bigger benchmarks like SPEC also. 

Yes, I am also running specs.  However for basic instruction selection
tuning smaller benchmarks are doing quite well.  In general if there are
relatively natural loops where gather helps, i think we should enable it
and try to fix possible regressions (I did not see one in spec runs, but
I plan to do more benhcmarking this week).

I did some work on TSVC mostly because zen3 seems very smooth update to
zen2 for instruction selection (which is already happy with almost
everything especially for scalar code) and vectorizer costs seems to be
place where we seem to have most room for improvement.

I briefly analyzed all tsvc kernels where we regress compared to clang,
aocc and icc.  You can search tsvc in bugzilla. Richard also wrote some
observations there.  These are related to missing features rather than 
cost model however.

One problem of tsvc is that it is FP only.  I hacked it for integer but
it would be nice to have someting else as well.
> 
> > 
> >    It seems we missed revisiting this for znver2 tuning.
> >    I think even for znver2 it may make sense to re-enable them, so I
> >    will benchmark this as well.
> >  - memcpy/memset expansion seems to work same way as for znver2,
> >    so I am keeping same changes.
> >  - instruction scheduler is already modified in trunk to some degree
> >    reflecting new units.  Problem with instruction scheduling is that
> >    it treats zen as in-order CPU and is unlikely going to fill all
> >    execution resources this way.
> >    We may want to try to model the out-of-order nature similar way as
> >    LLVM does, but at the other hand the current scheduling logic seems
> >    to do mostly fine (i.e. not worse than llvm's).  What matters is
> >    to schedule for long latencies and just after branch boundaries
> >    where simplified model seems to do just fine.
> 
> So we can keep the existing model for znver3 for GCC 11 ?

I think so - I experimented with making the model bit more precise and
it does not seem to add any performance improvements and makes the
automaton a lot bigger.  The existing model already handles the updated
zen3 latencies...

I think the only possible iprovment here would be to start modelling
explicitly the out of order nature but even then I am not sure how much
benefits that can bring (given that we are limited to relatively small
basic blocks and do not have a lot of information needed to model the
execution precisely). Do you have some options on this?

Honza

Reply via email to