> -----Original Message-----
> From: Segher Boessenkool <seg...@kernel.crashing.org>
> Sent: Monday, June 30, 2025 6:23 AM
> To: Cui, Lili <lili....@intel.com>
> Cc: ubiz...@gmail.com; gcc-patches@gcc.gnu.org; Liu, Hongtao
> <hongtao....@intel.com>; richard.guent...@gmail.com; Michael Matz
> <m...@suse.de>
> Subject: Re: [PATCH V3] x86: Enable separate shrink wrapping
> 
> Hi!
> 
> On Tue, Jun 17, 2025 at 10:03:28PM +0800, Cui, Lili wrote:
> > Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No
> performance regression was observed.
> > For O2 multi-copy :
> > 511.povray_r improved by 2.8% on ZNVER5.
> > 511.povray_r improved by 4.2% on EMR
> 
> No huge improvement, but none was expected anyway, x86 is a target with
> only few registers.
> 
> > Tested against SPEC CPU 2017, this change always has a net-positive
> > effect on the dynamic instruction count.  See the following table for
> > the breakdown on how this reduces the number of dynamic instructions
> > per workload on a like-for-like (with/without this commit):
> >
> > instruction count       base            with commit (commit-base)/commit
> > 502.gcc_r           98666845943     96891561634     -1.80%
> > 526.blender_r               6.21226E+11     6.12992E+11     -1.33%
> > 520.omnetpp_r               1.1241E+11      1.11093E+11     -1.17%
> > 500.perlbench_r     1271558717      1263268350      -0.65%
> > 523.xalancbmk_r             2.20103E+11     2.18836E+11     -0.58%
> > 531.deepsjeng_r             2.73591E+11     2.72114E+11     -0.54%
> > 500.perlbench_r     64195557393     63881512409     -0.49%
> > 541.leela_r         2.99097E+11     2.98245E+11     -0.29%
> > 548.exchange2_r             1.27976E+11     1.27784E+11     -0.15%
> > 527.cam4_r          88981458425     88887334679     -0.11%
> > 554.roms_r          2.60072E+11     2.59809E+11     -0.10%
> 
> The spec tests most representative for real-life code are perl and gcc, so 
> those
> are nice results :-)
> 
> This is code size, dynamic or static does not matter much here.  Nice to see 
> it
> improve anyway :-)
> 
> > +  /* Don't mess with the following registers.  */  if
> > + (frame_pointer_needed)
> > +    bitmap_clear_bit (components, HARD_FRAME_POINTER_REGNUM);
> 
> What is that about?  Isn't that one of the bigger possible wins?

Good question!  Initially, I looked at other architectures and disabled the 
hard frame pointer, but after reconsidering, I realized your point makes sense. 
If the hard frame pointer were enabled,  we would typically emit push %rbp and 
mov %rsp, %rbp at the first of prologue,  there is no room for separate shrink 
wrap, but if the function itself also use rbp, there might be room for 
optimization,  I took out these two lines and ran some tests, and everything 
seems fine. I will do more testing t and try to find a case where the 
optimization is really made.

Thanks,
Lili.

> 
> Anyway, nice to see SWS finally used for x86 as well!
> 
> 
> Segher

Reply via email to