Re: [PATCH] [x86_64]: Zhaoxin yongfeng enablement

2023-10-30 Thread Mayshao-oc
>On Fri, Oct 27, 2023 at 12:20 PM mayshao  wrote:
>>
>> On 2023/10/26 17:34, Uros Bizjak wrote:
>> > On Wed, Oct 25, 2023 at 8:43 AM mayshao  wrote:
>> >>
>> >> Hi all:
>> >>  This patch enables -march/-mtune=yongfeng, costs and tunings are set 
>> >> according to the characteristics of the processor. We add a new md file 
>> >> to describe yongfeng processor.
>> >>
>> >>  Bootstrapped /regtested X86_64.
>> >>
>> >>  Ok for trunk?
>> >> BR
>> >> Mayshao
>> >> gcc/ChangeLog:
>> >>
>> >>  * common/config/i386/cpuinfo.h (get_zhaoxin_cpu): Recognize 
>> >> yongfeng.
>> >>  * common/config/i386/i386-common.cc: Add yongfeng.
>> >>  * common/config/i386/i386-cpuinfo.h (enum processor_subtypes): 
>> >> Add ZHAOXIN_FAM7H_YONGFENG.
>> >>  * config.gcc: Add yongfeng.
>> >>  * config/i386/driver-i386.cc (host_detect_local_cpu): Let 
>> >> -march=native
>> >>  recognize yongfeng processors.
>> >>  * config/i386/i386-c.cc (ix86_target_macros_internal): Add 
>> >> yongfeng.
>> >>  * config/i386/i386-options.cc (m_YONGFENG): New definition.
>> >>  (m_ZHAOXIN): Ditto.
>> >>  * config/i386/i386.h (enum processor_type): Add 
>> >> PROCESSOR_YONGFENG.
>> >>  * config/i386/i386.md: Add yongfeng.
>> >>  * config/i386/lujiazui.md: Fix typo.
>> >>  * config/i386/x86-tune-costs.h (struct processor_costs): Add 
>> >> yongfeng costs.
>> >>  * config/i386/x86-tune-sched.cc (ix86_issue_rate): Add yongfeng.
>> >>  (ix86_adjust_cost): Ditto.
>> >>  * config/i386/x86-tune.def (X86_TUNE_SCHEDULE): Replace 
>> >> m_LUJIAZUI by m_ZHAOXIN.
>> >>  (X86_TUNE_PARTIAL_REG_DEPENDENCY): Ditto.
>> >>  (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY): Ditto.
>> >>  (X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY): Ditto.
>> >>  (X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Ditto.
>> >>  (X86_TUNE_MOVX): Ditto.
>> >>  (X86_TUNE_MEMORY_MISMATCH_STALL): Ditto.
>> >>  (X86_TUNE_FUSE_CMP_AND_BRANCH_32): Ditto.
>> >>  (X86_TUNE_FUSE_CMP_AND_BRANCH_64): Ditto.
>> >>  (X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS): Ditto.
>> >>  (X86_TUNE_FUSE_ALU_AND_BRANCH): Ditto.
>> >>  (X86_TUNE_ACCUMULATE_OUTGOING_ARGS): Ditto.
>> >>  (X86_TUNE_USE_LEAVE): Ditto.
>> >>  (X86_TUNE_PUSH_MEMORY): Ditto.
>> >>  (X86_TUNE_LCP_STALL): Ditto.
>> >>  (X86_TUNE_INTEGER_DFMODE_MOVES): Ditto.
>> >>  (X86_TUNE_OPT_AGU): Ditto.
>> >>  (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): Ditto.
>> >>  (X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES): Ditto.
>> >>  (X86_TUNE_USE_SAHF): Ditto.
>> >>  (X86_TUNE_USE_BT): Ditto.
>> >>  (X86_TUNE_AVOID_FALSE_DEP_FOR_BMI): Ditto.
>> >>  (X86_TUNE_ONE_IF_CONV_INSN): Ditto.
>> >>  (X86_TUNE_AVOID_MFENCE): Ditto.
>> >>  (X86_TUNE_EXPAND_ABS): Ditto.
>> >>  (X86_TUNE_USE_SIMODE_FIOP): Ditto.
>> >>  (X86_TUNE_USE_FFREEP): Ditto.
>> >>  (X86_TUNE_EXT_80387_CONSTANTS): Ditto.
>> >>  (X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL): Ditto.
>> >>  (X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL): Ditto.
>> >>  (X86_TUNE_SSE_TYPELESS_STORES): Ditto.
>> >>  (X86_TUNE_SSE_LOAD0_BY_PXOR): Ditto.
>> >>  (X86_TUNE_USE_GATHER_2PARTS): Add m_YONGFENG.
>> >>  (X86_TUNE_USE_GATHER_4PARTS): Ditto.
>> >>  (X86_TUNE_USE_GATHER_8PARTS): Ditto.
>> >>  (X86_TUNE_AVOID_128FMA_CHAINS): Ditto.
>> >>  * doc/extend.texi: Add details about yongfeng.
>> >>  * doc/invoke.texi: Ditto.
>> >>  * config/i386/yongfeng.md: New file for decribing yongfeng 
>> >> processor.
>> >>
>> >> gcc/testsuite/ChangeLog:
>> >>
>> >>  * g++.target/i386/mv32.C: Handle new march.
>> >>  * gcc.target/i386/funcspec-56.inc: Ditto.
>> >
>> > LGTM.
>> >
>> > There are a couple of comments that 

Re: [PATCH] [x86_64] Zhaoxin lujiazui enablement

2022-03-28 Thread Mayshao-oc
On Sun, Mar 27, 2022 at 5:15 PM Uros Bizjak  wrote:
> On Fri, Mar 25, 2022 at 3:08 AM MayShao  wrote:
> >
> > Hi Uros,
> >
> > This patch fix Zhaoxin CPU Vendor ID detection problem
> > and add Zhaoxin "lujiazui" processor support and tuning.
> >
> > Currently gcc can't recognize Zhaoxin CPU (Vendor ID "CentaurHauls" and 
> > "Shanghai")
> > and wrongly identify Zhaoxin "lujiazui" as Intel core2 or i386, which is 
> > confusing for users.
> >
> > This patch enables -march/-mtune=lujiazui. Lujiazui is Zhaonxin family 7th 
> > processor.
> > Costs and tunings are set according to the characteristics of the processor.
> > We add a new md file to describe lujiazui pipeline.
> >
> > Testing :
> > Bootstrap is ok, and no regressions for i386/x86-64 testsuite.
> >
> > OK for master?
>
> This patch is not a bugfix, so it will have to wait for a next stage 1
> to reopen.
>
> Uros.
>
Yes, Thanks for your reminder.
Then please help to review this patch again
when the next stage 1 reopen.
I have ever contributed to glibc before, should I need to
re-sign the FSF copyright assignment for this patch?


May
> >
> > Background:
> > Related Zhaoxin linux kernel patch can be found at:
> >  https://lore.kernel.org/lkml/01042674b2f741b2aed1f797359bd...@zhaoxin.com/
> >
> > Related Zhaoxin glibc patch can be found at:
> >  
> > https://sourceware.org/git/?p=glibc.git;a=commit;h=32ac0b988466785d6e3cc1dffc364bb26fc63193
> >
> > gcc/ChangeLog:
> >
> >* common/config/i386/cpuinfo.h (get_zhaoxin_cpu): Detect
> >the cpu type of ZHAOXIN processors.
> >(cpu_indicator_init): Handle ZHAOXIN processors.
> >* common/config/i386/i386-common.cc: Add lujiazui.
> >* common/config/i386/i386-cpuinfo.h (enum processor_vendor): Add
> >VENDOR_ZHAOXIN.
> >(enum processor_types): Add ZHAOXIN_FAM7H.
> >(enum processor_subtypes):Add ZHAOXIN_FAM7H_LUJIAZUI.
> >* config.gcc: Add -march=lujiazui.
> >* config/i386/cpuid.h (signature_SHANGHAI_ebx): New definition
> >for ZHAOXIN.
> >(signature_SHANGHAI_ecx): Likewise.
> >(signature_SHANGHAI_edx): Likewise.
> >* config/i386/driver-i386.cc (host_detect_local_cpu): Let
> >-march=native recognize lujiazui processor.
> >* config/i386/i386-c.cc (ix86_target_macros_internal): Add
> >lujiazui def_or_undef.
> >* config/i386/i386-options.cc (m_LUJIAZUI): New definition.
> >* config/i386/i386.h (enum processor_type): Add PROCESSOR_LUJIAZUI.
> >* config/i386/i386.md: Add lujiazui cpu and include new md file.
> >* config/i386/x86-tune-costs.h (struct processor_costs): Add
> >lujiazui_cost.
> >* config/i386/x86-tune-sched.cc (ix86_issue_rate): Add lujiazui.
> >(ix86_adjust_cost): Likewise.
> >* config/i386/x86-tune.def (X86_TUNE_SCHEDULE): Enable for lujiazui.
> >(X86_TUNE_PARTIAL_REG_DEPENDENCY): Likewise.
> >(X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY): Likewise.
> >(X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY): Likewise.
> >(X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Likewise.
> >(X86_TUNE_MOVX): Likewise.
> >(X86_TUNE_MEMORY_MISMATCH_STALL): Likewise.
> >(X86_TUNE_FUSE_CMP_AND_BRANCH_32): Likewise.
> >(X86_TUNE_FUSE_CMP_AND_BRANCH_64): Likewise.
> >(X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS): Likewise.
> >(X86_TUNE_FUSE_ALU_AND_BRANCH): Likewise.
> >(X86_TUNE_ACCUMULATE_OUTGOING_ARGS): Likewise.
> >(X86_TUNE_USE_LEAVE): Likewise.
> >(X86_TUNE_PUSH_MEMORY): Likewise.
> >(X86_TUNE_LCP_STALL): Likewise.
> >(X86_TUNE_USE_INCDEC): Likewise.
> >(X86_TUNE_INTEGER_DFMODE_MOVES): Likewise.
> >(X86_TUNE_OPT_AGU): Likewise.
> >(X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): Likewise.
> >(X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES): Likewise.
> >(X86_TUNE_USE_SAHF): Likewise.
> >(X86_TUNE_USE_BT): Likewise.
> >(X86_TUNE_AVOID_FALSE_DEP_FOR_BMI): Likewise.
> >(X86_TUNE_ONE_IF_CONV_INSN): Likewise.
> >(X86_TUNE_AVOID_MFENCE): Likewise.
> >(X86_TUNE_EXPAND_ABS): Likewise.
> >(X86_TUNE_USE_SIMODE_FIOP): Likewise.
> >(X86_TUNE_USE_FFREEP): Likewise.
> >(X86_TUNE_EXT_80387_CONSTANTS): Likewise.
> >(X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL): Likewise.
> >(X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL): Likewise.
> >(X86_TUNE_SSE_TYPELESS_STORES): Likewise.
> >(X86_TUNE_SSE_LOAD0_BY_PXOR): Likewise.
> >(X86_TUNE_USE_GATHER): Likewise.
> >* doc/extend.texi: Add lujiazui.
> >* doc/invoke.texi: Add details about lujiazui.
> >* config/i386/lujiazui.md: New file for describing lujiazui pipeline.
> >
> > gcc/testsuite/ChangeLog:
> >
> >* gcc.target/i386/funcspec-56.inc: Handle new march.
> >* g++.target/i386/mv31.C: New test for -march=lujiazui.
> > ---
> >  gcc/common/c

Re: [PATCH] [x86_64] Zhaoxin lujiazui enablement

2022-10-26 Thread Mayshao-oc


Hi Martin:
Thanks for your patch,  I comment the questions below.

> Hello.

> I noticed this patch set which is kind of related to 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107364.

> And I have a couple of questions:

>1) I noticed you drop AVX and F16C features for the newly added "lujiazui". 
>Why do you need it?
>  I would expect these features would be properly detected by cpuid?

Yes, these features could be detected by cpuid, and in respect of 
functionality, these features are ok, but in respect of performance, these 
features need further improvement, so we decide to drop it now, and add these 
features back when performance meet our expectation.

> 2) If you really need it, can you please test for me the attached patch? It 
> should come up
>  with a new function.

I have tested the patch, It's ok.

> 3) Have question about:

> else if (vendor == signature_CENTAUR_ebx && family < 0x07)
>cpu_model->__cpu_vendor = VENDOR_CENTAUR;
> else if (vendor == signature_SHANGHAI_ebx
>   || vendor == signature_CENTAUR_ebx)

> Are there any signature_CENTAUR_ebx models with family == 0x7 ?
> Similarly, are there any signature_SHANGHAI_ebx modes with family < 0x7 ?

Yes, both cases exist in our products.

> Thanks,
> Martin

BR
Mayshao


Re: [PATCH] [x86_64] Zhaoxin lujiazui enablement

2022-10-27 Thread Mayshao-oc




>>
>> Hi Martin:
>> Thanks for your patch,  I comment the questions below.

>Hi.

>:)

>>
>>> Hello.
>>
>>> I noticed this patch set which is kind of related to 
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107364.
>>
>>> And I have a couple of questions:
>>
>>>1) I noticed you drop AVX and F16C features for the newly added "lujiazui". 
>>>Why do you need it?
>>>  I would expect these features would be properly detected by cpuid?
>>
>> Yes, these features could be detected by cpuid, and in respect of 
>> functionality, these features are ok, but in respect of performance, these 
>> features need further improvement, so we decide to drop it now, and add 
>> these features back when performance meet our expectation.

> I see. So theoretically you can increase costs of the corresponding insns and 
> that could be dropped now?
> But I'm not a costing expert.

I am new to gcc, and have lots of things to learn. About LTO and PGO, I have 
read some knowledge you and hubicka shared, and it helps me a lot, As a 
performance issue, it is a good idea to use cost model to solve, and disable 
avx entirely seems overkill. But cost model need to set the appropriate value 
of the cost, it's challenging to specify the number and more challenging to 
justify why we set that number. Our current approach have a pitfall to 
accommodate AVX intrinsic functions(eg: __mm256_loadu_pd), we could use -mavx 
to specify this explictly to overcome this.

>>
>>> 2) If you really need it, can you please test for me the attached patch? It 
>>> should come up
>>>  with a new function.
>>
>> I have tested the patch, It's ok.

> Good, I'm going to install it.

>>
>>> 3) Have question about:
>>
>>> else if (vendor == signature_CENTAUR_ebx && family < 0x07)
>>>cpu_model->__cpu_vendor = VENDOR_CENTAUR;
>>> else if (vendor == signature_SHANGHAI_ebx
>>>   || vendor == signature_CENTAUR_ebx)
>>
>>> Are there any signature_CENTAUR_ebx models with family == 0x7 ?
>>> Similarly, are there any signature_SHANGHAI_ebx modes with family < 0x7 ?
>>
>> Yes, both cases exist in our products.

> Good. Then we miss a CPU features detection for (vendor == 
> signature_CENTAUR_ebx && family < 0x07)
> aka https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107364. But it's not worth 
> it as it's a legacy hardware,
> right?

Yes, for legacy hardware, we need to keep it work correctly, but in respect of 
performance, we don't spend a lot of time to tune.

> Cheers,
> Martin

>>
>>> Thanks,
>> Martin
>>
>> BR
>> Mayshao



Re: [PATCH] [x86_64]: Zhaoxin lujiazui enablement

2022-05-17 Thread Mayshao-oc
> On Tue, May 17, 2022 at 5:15 AM mayshao  wrote:
>> Hi Uros:
>> This patch fix Zhaoxin CPU vendor ID detection problem and add 
>> zhaoxin "lujiazui" processor support.
>> Currently gcc can't recognize Zhaoxin CPU(vendor ID "CentaurHauls" 
>> and "Shanghai") if user use -march=native option, which is confusing for 
>> users.
>> This patch enables -march=native in zhaoxin family 7th processor and 
>> -march/-mtune=lujiazui, costs and tunning are set according to the 
>> characteristics of the processor.We add a new md file to describe lujiazui 
>> pipeline.
>> Testing:
>> Bootstrap is ok, and no regressions for i386/x86-64 testsuite.
>> Ok for master?
>> Background:
>> Related Zhaoxin linux kernel patch can be found at:
>> https://lore.kernel.org/lkml/01042674b2f741b2aed1f797359bd...@zhaoxin.com/
>> Related Zhaoxin glibc patch can be found at:
>> https://sourceware.org/git/?p=glibc.git;a=commit;h=32ac0b988466785d6e3cc1dffc364bb26fc63193
>> gcc/ChangeLog:
> The entries below are suspiciously empty - please fill in the details.

Sorry for forgetting this. Update patch. Thanks.

* common/config/i386/cpuinfo.h (get_zhaoxin_cpu): Detect
the specific type of Zhaoxin CPU, and return Zhaoxin CPU name.
(cpu_indicator_init): Handle Zhaoxin processors.
* common/config/i386/i386-common.cc: Add lujiazui.
* common/config/i386/i386-cpuinfo.h (enum processor_vendor): Add
VENDOR_ZHAOXIN.
(enum processor_types): Add ZHAOXIN_FAM7H.
(enum processor_subtypes): Add ZHAOXIN_FAM7H_LUJIAZUI.
* config.gcc: Add lujiazui.
* config/i386/cpuid.h (signature_SHANGHAI_ebx): Add
Signatures for zhaoxin
(signature_SHANGHAI_ecx): Ditto.
(signature_SHANGHAI_edx): Ditto.
* config/i386/driver-i386.cc (host_detect_local_cpu): Let
-march=native recognize lujiazui processors.
* config/i386/i386-c.cc (ix86_target_macros_internal): Add lujiazui.
* config/i386/i386-options.cc (m_LUJIAZUI): New_definition.
* config/i386/i386.h (enum processor_type): Ditto.
* config/i386/i386.md: Add lujiazui.
* config/i386/x86-tune-costs.h (struct processor_costs): Add
lujiazui costs.
* config/i386/x86-tune-sched.cc (ix86_issue_rate): Add lujiazui.
(ix86_adjust_cost): Ditto.
* config/i386/x86-tune.def (X86_TUNE_SCHEDULE): Add lujiazui tunnings.
(X86_TUNE_PARTIAL_REG_DEPENDENCY): Ditto.
(X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY): Ditto.
(X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY): Ditto.
(X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY): Ditto.
(X86_TUNE_MOVX): Ditto.
(X86_TUNE_MEMORY_MISMATCH_STALL): Ditto.
(X86_TUNE_FUSE_CMP_AND_BRANCH_32): Ditto.
(X86_TUNE_FUSE_CMP_AND_BRANCH_64): Ditto.
(X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS): Ditto.
(X86_TUNE_FUSE_ALU_AND_BRANCH): Ditto.
(X86_TUNE_ACCUMULATE_OUTGOING_ARGS): Ditto.
(X86_TUNE_USE_LEAVE): Ditto.
(X86_TUNE_PUSH_MEMORY): Ditto.
(X86_TUNE_LCP_STALL): Ditto.
(X86_TUNE_USE_INCDEC): Ditto.
(X86_TUNE_INTEGER_DFMODE_MOVES): Ditto.
(X86_TUNE_OPT_AGU): Ditto.
(X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB): Ditto.
(X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES): Ditto.
(X86_TUNE_USE_SAHF): Ditto.
(X86_TUNE_USE_BT): Ditto.
(X86_TUNE_AVOID_FALSE_DEP_FOR_BMI): Ditto.
(X86_TUNE_ONE_IF_CONV_INSN): Ditto.
(X86_TUNE_AVOID_MFENCE): Ditto.
(X86_TUNE_EXPAND_ABS): Ditto.
(X86_TUNE_USE_SIMODE_FIOP): Ditto.
(X86_TUNE_USE_FFREEP): Ditto.
(X86_TUNE_EXT_80387_CONSTANTS): Ditto.
(X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL): Ditto.
(X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL): Ditto.
(X86_TUNE_SSE_TYPELESS_STORES): Ditto.
(X86_TUNE_SSE_LOAD0_BY_PXOR): Ditto.
* doc/extend.texi: Add details about lujiazui.
* doc/invoke.texi: Add details about lujiazui.
* config/i386/lujiazui.md: Introduce lujiazui cpu and include new md file.

gcc/testsuite/ChangeLog:

* gcc.target/i386/funcspec-56.inc: Test -arch=lujiauzi and -tune=lujiazui.
* g++.target/i386/mv32.C: Ditto.

>> * common/config/i386/cpuinfo.h (get_zhaoxin_cpu):
>> (cpu_indicator_init):
>> * common/config/i386/i386-common.cc:
>> * common/config/i386/i386-cpuinfo.h (enum processor_vendor):
>> (enum processor_types):
>> (enum processor_subtypes):
>> * config.gcc:
>> * config/i386/cpuid.h (signature_SHANGHAI_ebx):
>> (signature_SHANGHAI_ecx):
>> (signature_SHANGHAI_edx):
>> * config/i386/driver-i386.cc (host_detect_local_cpu):
>> * config/i386/i386-c.cc (ix86_target_macros_internal):
>> * config/i386/i386-options.cc (m_LUJIAZUI):
>> * config/i386/i386.h (enum processor_type):
>> * config/i386/i386.md:
>> * config/i386/x86-tune-costs.h (struct processor_costs):
>> * config/i386/x86-tune-sched.cc (ix86_issue_rate):
>> (ix86_adjust_cost):
>> * config/i386/x86-tune.def (X86_TUNE_SCHEDULE):
>> (X86_TUNE_PARTIAL_REG_DEPENDENCY):
>> (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY):
>> (X86_TUNE_SSE_PARTIAL_REG_FP_CONVERTS_DEPENDENCY):
>> (X86_TUNE_SSE_PARTIAL_REG_CONVERTS_DEPENDENCY):
>> (X86_TUNE_MOVX):
>> (X86_TUNE_MEMOR

答复: [PATCH] i386: correct division modeling in lujiazui.md

2022-12-20 Thread Mayshao-oc
>Ping. If there are any questions or concerns about the patch, please let me
>know: I'm interested in continuing this cleanup at least for older AMD models.
>
Thanks for your patch.
We are running benchmark on speccpu2017 to get the performance number, it takes 
some time. 
If we get the result , we will give feedback right away. 
BR 
Mayshao
>I noticed I had an extra line in my Changelog:
>
>>  (lua_sseicvt_si): Ditto.
>
>It got there accidentally and I will drop it.
>
>Alexander
>
>On Fri, 9 Dec 2022, Alexander Monakov wrote:
>
>> Model the divider in Lujiazui processors as a separate automaton to 
>> significantly reduce the overall model size. This should also result 
>> in improved accuracy, as pipe 0 should be able to accept new 
>> instructions while the divider is occupied.
>> 
>> It is unclear why integer divisions are modeled as if pipes 0-3 are 
>> all occupied. I've opted to keep a single-cycle reservation of all 
>> four pipes together, so GCC should continue trying to pack 
>> instructions around a division accordingly.
>> 
>> Currently top three symbols in insn-automata.o are:
>> 
>> 106102 r lujiazui_core_check
>> 106102 r lujiazui_core_transitions
>> 196123 r lujiazui_core_min_issue_delay
>> 
>> This patch shrinks all lujiazui tables to:
>> 
>> 3 r lujiazui_decoder_min_issue_delay
>> 20 r lujiazui_decoder_transitions
>> 32 r lujiazui_agu_min_issue_delay
>> 126 r lujiazui_agu_transitions
>> 304 r lujiazui_div_base
>> 352 r lujiazui_div_check
>> 352 r lujiazui_div_transitions
>> 1152 r lujiazui_core_min_issue_delay
>> 1592 r lujiazui_agu_translate
>> 1592 r lujiazui_core_translate
>> 1592 r lujiazui_decoder_translate
>> 1592 r lujiazui_div_translate
>> 3952 r lujiazui_div_min_issue_delay
>> 9216 r lujiazui_core_transitions
>> 
>> This continues the work on reducing i386 insn-automata.o size started 
>> with similar fixes for division and multiplication instructions in 
>> znver.md [1][2]. I plan to submit corresponding fixes for 
>> b[td]ver[123].md as well.
>> 
>> [1] 
>> https://inbox.sourceware.org/gcc-patches/23c795d6-403c-5927-e610-f0f12
>> 15f5...@ispras.ru/T/#m36e069d43d07d768d4842a779e26b4a0915cc543
>> [2] 
>> https://inbox.sourceware.org/gcc-patches/20221101162637.14238-1-amonak
>> o...@ispras.ru/
>> 
>> gcc/ChangeLog:
>> 
>>  PR target/87832
>>  * config/i386/lujiazui.md (lujiazui_div): New automaton.
>>  (lua_div): New unit.
>>  (lua_idiv_qi): Correct unit in the reservation.
>>  (lua_idiv_qi_load): Ditto.
>>  (lua_idiv_hi): Ditto.
>>  (lua_idiv_hi_load): Ditto.
>>  (lua_idiv_si): Ditto.
>>  (lua_idiv_si_load): Ditto.
>>  (lua_idiv_di): Ditto.
>>  (lua_idiv_di_load): Ditto.
>>  (lua_fdiv_SF): Ditto.
>>  (lua_fdiv_SF_load): Ditto.
>>  (lua_fdiv_DF): Ditto.
>>  (lua_fdiv_DF_load): Ditto.
>>  (lua_fdiv_XF): Ditto.
>>  (lua_fdiv_XF_load): Ditto.
>>  (lua_ssediv_SF): Ditto.
>>  (lua_ssediv_load_SF): Ditto.
>>  (lua_ssediv_V4SF): Ditto.
>>  (lua_ssediv_load_V4SF): Ditto.
>>  (lua_ssediv_V8SF): Ditto.
>>  (lua_ssediv_load_V8SF): Ditto.
>>  (lua_ssediv_SD): Ditto.
>>  (lua_ssediv_load_SD): Ditto.
>>  (lua_ssediv_V2DF): Ditto.
>>  (lua_ssediv_load_V2DF): Ditto.
>>  (lua_ssediv_V4DF): Ditto.
>>  (lua_ssediv_load_V4DF): Ditto.
>>  (lua_sseicvt_si): Ditto.
>> ---
>>  gcc/config/i386/lujiazui.md | 58 
>> +++--
>>  1 file changed, 30 insertions(+), 28 deletions(-)
>> 
>> diff --git a/gcc/config/i386/lujiazui.md b/gcc/config/i386/lujiazui.md 
>> index 9046c09f2..58a230c70 100644
>> --- a/gcc/config/i386/lujiazui.md
>> +++ b/gcc/config/i386/lujiazui.md
>> @@ -19,8 +19,8 @@
>>  
>>  ;; Scheduling for ZHAOXIN lujiazui processor.
>>  
>> -;; Modeling automatons for decoders, execution pipes and AGU pipes.
>> -(define_automaton "lujiazui_decoder,lujiazui_core,lujiazui_agu")
>> +;; Modeling automatons for decoders, execution pipes, AGU pipes, and 
>> divider.
>> +(define_automaton 
>> +"lujiazui_decoder,lujiazui_core,lujiazui_agu,lujiazui_div")
>>  
>>  ;; The rules for the decoder are simple:
>>  ;;  - an instruction with 1 uop can be decoded by any of the three @@ 
>> -55,6 +55,8 @@ (define_reservation "lua_decoder01" 
>> "lua_decoder0|lua_decoder1")  (define_cpu_unit 
>> "lua_p0,lua_p1,lua_p2,lua_p3" "lujiazui_core")  (define_cpu_unit 
>> "lua_p4,lua_p5" "lujiazui_agu")
>>  
>> +(define_cpu_unit "lua_div" "lujiazui_div")
>> +
>>  (define_reservation "lua_p03" "lua_p0|lua_p3")  (define_reservation 
>> "lua_p12" "lua_p1|lua_p2")  (define_reservation "lua_p1p2" 
>> "lua_p1+lua_p2") @@ -229,56 +231,56 @@ (define_insn_reservation 
>> "lua_idiv_qi" 21
>>(and (eq_attr "memory" "none")
>> (and (eq_attr "mode" "QI")
>>  (eq_attr "type" "idiv"
>> - "lua_decoder0,lua_p0p1p2p3*21")
>> + "lua_decoder0,lua_p0p1p2p3,lua_di

Re: [PATCH] i386: correct division modeling in lujiazui.md

2022-12-29 Thread Mayshao-oc
>Ping. If there are any questions or concerns about the patch, please let me
>know: I'm interested in continuing this cleanup at least for older AMD models.
>
Hi Alexander:
According to the speccpu2017 benchmark result, the patch looks good in 
lujiazui. 
BR 
Mayshao
>I noticed I had an extra line in my Changelog:
>
>>  (lua_sseicvt_si): Ditto.
>
>It got there accidentally and I will drop it.
>
>Alexander
>
>On Fri, 9 Dec 2022, Alexander Monakov wrote:
>
>> Model the divider in Lujiazui processors as a separate automaton to 
>> significantly reduce the overall model size. This should also result 
>> in improved accuracy, as pipe 0 should be able to accept new 
>> instructions while the divider is occupied.
>> 
>> It is unclear why integer divisions are modeled as if pipes 0-3 are 
>> all occupied. I've opted to keep a single-cycle reservation of all 
>> four pipes together, so GCC should continue trying to pack 
>> instructions around a division accordingly.
>> 
>> Currently top three symbols in insn-automata.o are:
>> 
>> 106102 r lujiazui_core_check
>> 106102 r lujiazui_core_transitions
>> 196123 r lujiazui_core_min_issue_delay
>> 
>> This patch shrinks all lujiazui tables to:
>> 
>> 3 r lujiazui_decoder_min_issue_delay
>> 20 r lujiazui_decoder_transitions
>> 32 r lujiazui_agu_min_issue_delay
>> 126 r lujiazui_agu_transitions
>> 304 r lujiazui_div_base
>> 352 r lujiazui_div_check
>> 352 r lujiazui_div_transitions
>> 1152 r lujiazui_core_min_issue_delay
>> 1592 r lujiazui_agu_translate
>> 1592 r lujiazui_core_translate
>> 1592 r lujiazui_decoder_translate
>> 1592 r lujiazui_div_translate
>> 3952 r lujiazui_div_min_issue_delay
>> 9216 r lujiazui_core_transitions
>> 
>> This continues the work on reducing i386 insn-automata.o size started 
>> with similar fixes for division and multiplication instructions in 
>> znver.md [1][2]. I plan to submit corresponding fixes for 
>> b[td]ver[123].md as well.
>> 
>> [1] 
>> https://inbox.sourceware.org/gcc-patches/23c795d6-403c-5927-e610-f0f12
>> 15f5...@ispras.ru/T/#m36e069d43d07d768d4842a779e26b4a0915cc543
>> [2] 
>> https://inbox.sourceware.org/gcc-patches/20221101162637.14238-1-amonak
>> o...@ispras.ru/
>> 
>> gcc/ChangeLog:
>> 
>>  PR target/87832
>>  * config/i386/lujiazui.md (lujiazui_div): New automaton.
>>  (lua_div): New unit.
>>  (lua_idiv_qi): Correct unit in the reservation.
>>  (lua_idiv_qi_load): Ditto.
>>  (lua_idiv_hi): Ditto.
>>  (lua_idiv_hi_load): Ditto.
>>  (lua_idiv_si): Ditto.
>>  (lua_idiv_si_load): Ditto.
>>  (lua_idiv_di): Ditto.
>>  (lua_idiv_di_load): Ditto.
>>  (lua_fdiv_SF): Ditto.
>>  (lua_fdiv_SF_load): Ditto.
>>  (lua_fdiv_DF): Ditto.
>>  (lua_fdiv_DF_load): Ditto.
>>  (lua_fdiv_XF): Ditto.
>>  (lua_fdiv_XF_load): Ditto.
>>  (lua_ssediv_SF): Ditto.
>>  (lua_ssediv_load_SF): Ditto.
>>  (lua_ssediv_V4SF): Ditto.
>>  (lua_ssediv_load_V4SF): Ditto.
>>  (lua_ssediv_V8SF): Ditto.
>>  (lua_ssediv_load_V8SF): Ditto.
>>  (lua_ssediv_SD): Ditto.
>>  (lua_ssediv_load_SD): Ditto.
>>  (lua_ssediv_V2DF): Ditto.
>>  (lua_ssediv_load_V2DF): Ditto.
>>  (lua_ssediv_V4DF): Ditto.
>>  (lua_ssediv_load_V4DF): Ditto.
>>  (lua_sseicvt_si): Ditto.
>> ---
>>  gcc/config/i386/lujiazui.md | 58 
>> +++--
>>  1 file changed, 30 insertions(+), 28 deletions(-)
>> 
>> diff --git a/gcc/config/i386/lujiazui.md b/gcc/config/i386/lujiazui.md 
>> index 9046c09f2..58a230c70 100644
>> --- a/gcc/config/i386/lujiazui.md
>> +++ b/gcc/config/i386/lujiazui.md
>> @@ -19,8 +19,8 @@
>>  
>>  ;; Scheduling for ZHAOXIN lujiazui processor.
>>  
>> -;; Modeling automatons for decoders, execution pipes and AGU pipes.
>> -(define_automaton "lujiazui_decoder,lujiazui_core,lujiazui_agu")
>> +;; Modeling automatons for decoders, execution pipes, AGU pipes, and 
>> divider.
>> +(define_automaton 
>> +"lujiazui_decoder,lujiazui_core,lujiazui_agu,lujiazui_div")
>>  
>>  ;; The rules for the decoder are simple:
>>  ;;  - an instruction with 1 uop can be decoded by any of the three @@ 
>> -55,6 +55,8 @@ (define_reservation "lua_decoder01" 
>> "lua_decoder0|lua_decoder1")  (define_cpu_unit 
>> "lua_p0,lua_p1,lua_p2,lua_p3" "lujiazui_core")  (define_cpu_unit 
>> "lua_p4,lua_p5" "lujiazui_agu")
>>  
>> +(define_cpu_unit "lua_div" "lujiazui_div")
>> +
>>  (define_reservation "lua_p03" "lua_p0|lua_p3")  (define_reservation 
>> "lua_p12" "lua_p1|lua_p2")  (define_reservation "lua_p1p2" 
>> "lua_p1+lua_p2") @@ -229,56 +231,56 @@ (define_insn_reservation 
>> "lua_idiv_qi" 21
>>(and (eq_attr "memory" "none")
>> (and (eq_attr "mode" "QI")
>>  (eq_attr "type" "idiv"
>> - "lua_decoder0,lua_p0p1p2p3*21")
>> + "lua_decoder0,lua_p0p1p2p3,lua_div*21")
>>  
>>  (define_insn_reservation "lua_idiv_qi_load" 25
>>

[PATCH] [libatomic]: Handle AVX+CX16 ZHAOXIN like intel for 16b atomic [PR104688]

2024-07-11 Thread MayShao-oc
From: mayshao 

Hi all:
We reply in PR104688 that ZHAOXIN guarantees that 16-byte VMOVDQA on 
16-byte aligned address is atomic, if memory type of the address is WB. So 
there is no need to clear bit_AVX on ZHAOXIN CPUs.

Bootstrapped /regtested X86_64.

Ok for trunk?
BR
Mayshao

libatomic/ChangeLog:

PR target/104688
* config/x86/init.c (__libat_feat1_init): Don't clear
bit_AVX on ZHAOXIN CPUs.
---
 libatomic/config/x86/init.c | 14 --
 1 file changed, 14 deletions(-)

diff --git a/libatomic/config/x86/init.c b/libatomic/config/x86/init.c
index a75be3f175c..3740c88a936 100644
--- a/libatomic/config/x86/init.c
+++ b/libatomic/config/x86/init.c
@@ -34,20 +34,6 @@ __libat_feat1_init (void)
   unsigned int eax, ebx, ecx, edx;
   FEAT1_REGISTER = 0;
   __get_cpuid (1, &eax, &ebx, &ecx, &edx);
-#ifdef __x86_64__
-  if ((FEAT1_REGISTER & (bit_AVX | bit_CMPXCHG16B))
-  == (bit_AVX | bit_CMPXCHG16B))
-{
-  /* Intel SDM guarantees that 16-byte VMOVDQA on 16-byte aligned address
-is atomic, and AMD is going to do something similar soon.
-We don't have a guarantee from vendors of other CPUs with AVX,
-like Zhaoxin and VIA.  */
-  unsigned int ecx2 = 0;
-  __get_cpuid (0, &eax, &ebx, &ecx2, &edx);
-  if (ecx2 != signature_INTEL_ecx && ecx2 != signature_AMD_ecx)
-   FEAT1_REGISTER &= ~bit_AVX;
-}
-#endif
   /* See the load in load_feat1.  */
   __atomic_store_n (&__libat_feat1, FEAT1_REGISTER, __ATOMIC_RELAXED);
   return FEAT1_REGISTER;
-- 
2.27.0



[PATCH v2] [libatomic]: Handle AVX+CX16 ZHAOXIN like intel for 16b atomic [PR104688]

2024-07-18 Thread MayShao-oc
From: mayshao 

Hi Jakub:

Thanks for your review,We should just amend this to handle Zhaoxin.

Bootstrapped /regtested X86_64.

Ok for trunk?
BR
Mayshao

libatomic/ChangeLog:

PR target/104688
* config/x86/init.c (__libat_feat1_init): Don't clear
bit_AVX on ZHAOXIN CPUs.
---
 libatomic/config/x86/init.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/libatomic/config/x86/init.c b/libatomic/config/x86/init.c
index a75be3f175c..0d6864909bb 100644
--- a/libatomic/config/x86/init.c
+++ b/libatomic/config/x86/init.c
@@ -39,12 +39,15 @@ __libat_feat1_init (void)
   == (bit_AVX | bit_CMPXCHG16B))
 {
   /* Intel SDM guarantees that 16-byte VMOVDQA on 16-byte aligned address
-is atomic, and AMD is going to do something similar soon.
-We don't have a guarantee from vendors of other CPUs with AVX,
-like Zhaoxin and VIA.  */
-  unsigned int ecx2 = 0;
+is atomic, and AMD is going to do something similar soon. Zhaoxin also
+guarantees this. We don't have a guarantee from vendors of other CPUs 
+with AVX,like VIA.  */
+  unsigned int ecx2 = 0, family = 0;
+  family = (eax >> 8) & 0x0f;
   __get_cpuid (0, &eax, &ebx, &ecx2, &edx);
-  if (ecx2 != signature_INTEL_ecx && ecx2 != signature_AMD_ecx)
+  if (ecx2 != signature_INTEL_ecx && ecx2 != signature_AMD_ecx
+  && !(ecx2 == signature_CENTAUR_ecx && family > 0x6)
+  && ecx2 != signature_SHANGHAI_ecx)
FEAT1_REGISTER &= ~bit_AVX;
 }
 #endif
-- 
2.27.0



Re: [PATCH] invoke.texi: Clarify -march=lujiazui

2024-05-23 Thread mayshao-oc

Hi Jakub:

  I think the modified lujiazui description is what actually 
happens,thanks.


BR
Mayshao



[这封邮件来自外部发件人 谨防风险]

Hi!

Yesterday I was searching which exact CPUs are affected by the PR114576
wrong-code issue and went from the PTA_* bitmasks in GCC, so arrived
at the goldmont, goldmont-plus, tremont and lujiazui CPUs (as -march=
cases which do enable -maes and don't enable -mavx).
But when double-checking that against the invoke.texi documentation,
that was true for the first 3, but lujiazui said it supported AVX.
I was really confused by that, until I found the
https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604407.html
explanation.  So, seems the CPUs do have AVX and F16C but -march=lujiazui
doesn't enable those and even activelly attempts to filter those out from
the announced CPUID features, in glibc as well as e.g. in libgcc.

Thus, I think we should document what actually happens, otherwise
users could assume that
gcc -march=lujiazui predefines __AVX__ and __F16C__, which it doesn't.

Tested on x86_64, ok for trunk?

2024-04-11  Jakub Jelinek  

 * doc/invoke.texi (lujiazui): Clarify that while the CPUs do support
 AVX and F16C, -march=lujiazui actually doesn't enable those.

--- gcc/doc/invoke.texi.jj  2024-04-11 09:26:01.156865894 +0200
+++ gcc/doc/invoke.texi 2024-04-11 10:47:53.457582922 +0200
@@ -34696,8 +34696,10 @@ instruction set support.

  @item lujiazui
  ZHAOXIN lujiazui CPU with x86-64, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1,
-SSE4.2, AVX, POPCNT, AES, PCLMUL, RDRND, XSAVE, XSAVEOPT, FSGSBASE, CX16,
-ABM, BMI, BMI2, F16C, FXSR, RDSEED instruction set support.
+SSE4.2, POPCNT, AES, PCLMUL, RDRND, XSAVE, XSAVEOPT, FSGSBASE, CX16,
+ABM, BMI, BMI2, FXSR, RDSEED instruction set support.  While the CPUs
+do support AVX and F16C, these aren't enabled by @code{-march=lujiazui}
+for performance reasons.

  @item yongfeng
  ZHAOXIN yongfeng CPU with x86-64, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1,

 Jakub



Re: [PATCH] [x86_64]: Zhaoxin shijidadao enablement

2024-06-18 Thread mayshao-oc




On 5/28/24 14:15, Uros Bizjak wrote:




On Mon, May 27, 2024 at 10:33 AM MayShao  wrote:


From: mayshao 

Hi all:
 This patch enables -march/-mtune=shijidadao, costs and tunings are set 
according to the characteristics of the processor.

 Bootstrapped /regtested X86_64.

 Ok for trunk?


OK.

Thanks,
Uros.


Thanks for your review, please help me commit.

BR
Mayshao




BR
Mayshao
gcc/ChangeLog:

 * common/config/i386/cpuinfo.h (get_zhaoxin_cpu): Recognize shijidadao.
 * common/config/i386/i386-common.cc: Add shijidadao.
 * common/config/i386/i386-cpuinfo.h (enum processor_subtypes):
 Add ZHAOXIN_FAM7H_SHIJIDADAO.
 * config.gcc: Add shijidadao.
 * config/i386/driver-i386.cc (host_detect_local_cpu):
 Let -march=native recognize shijidadao processors.
 * config/i386/i386-c.cc (ix86_target_macros_internal): Add shijidadao.
 * config/i386/i386-options.cc (m_ZHAOXIN): Add m_SHIJIDADAO.
 (m_SHIJIDADAO): New definition.
 * config/i386/i386.h (enum processor_type): Add PROCESSOR_SHIJIDADAO.
 * config/i386/x86-tune-costs.h (struct processor_costs):
 Add shijidadao_cost.
 * config/i386/x86-tune-sched.cc (ix86_issue_rate): Add shijidadao.
 (ix86_adjust_cost): Ditto.
 * config/i386/x86-tune.def (X86_TUNE_USE_GATHER_2PARTS): Add 
m_SHIJIDADAO.
 (X86_TUNE_USE_GATHER_4PARTS): Ditto.
 (X86_TUNE_USE_GATHER_8PARTS): Ditto.
 (X86_TUNE_AVOID_128FMA_CHAINS): Ditto.
 * doc/extend.texi: Add details about shijidadao.
 * doc/invoke.texi: Ditto.

gcc/testsuite/ChangeLog:

 * g++.target/i386/mv32.C: Handle new -march
 * gcc.target/i386/funcspec-56.inc: Ditto.
---
  gcc/common/config/i386/cpuinfo.h  |   8 +-
  gcc/common/config/i386/i386-common.cc |   8 +-
  gcc/common/config/i386/i386-cpuinfo.h |   1 +
  gcc/config.gcc|  14 ++-
  gcc/config/i386/driver-i386.cc|  11 +-
  gcc/config/i386/i386-c.cc |   7 ++
  gcc/config/i386/i386-options.cc   |   4 +-
  gcc/config/i386/i386.h|   1 +
  gcc/config/i386/x86-tune-costs.h  | 116 ++
  gcc/config/i386/x86-tune-sched.cc |   2 +
  gcc/config/i386/x86-tune.def  |   8 +-
  gcc/doc/extend.texi   |   3 +
  gcc/doc/invoke.texi   |   6 +
  gcc/testsuite/g++.target/i386/mv32.C  |   6 +
  gcc/testsuite/gcc.target/i386/funcspec-56.inc |   2 +
  15 files changed, 183 insertions(+), 14 deletions(-)

diff --git a/gcc/common/config/i386/cpuinfo.h b/gcc/common/config/i386/cpuinfo.h
index 4610bf6d6a4..936039725ab 100644
--- a/gcc/common/config/i386/cpuinfo.h
+++ b/gcc/common/config/i386/cpuinfo.h
@@ -667,12 +667,18 @@ get_zhaoxin_cpu (struct __processor_model *cpu_model,
   reset_cpu_feature (cpu_model, cpu_features2, FEATURE_F16C);
   cpu_model->__cpu_subtype = ZHAOXIN_FAM7H_LUJIAZUI;
 }
- else if (model >= 0x5b)
+ else if (model == 0x5b)
 {
   cpu = "yongfeng";
   CHECK___builtin_cpu_is ("yongfeng");
   cpu_model->__cpu_subtype = ZHAOXIN_FAM7H_YONGFENG;
 }
+ else if (model >= 0x6b)
+   {
+ cpu = "shijidadao";
+ CHECK___builtin_cpu_is ("shijidadao");
+ cpu_model->__cpu_subtype = ZHAOXIN_FAM7H_SHIJIDADAO;
+   }
break;
  default:
break;
diff --git a/gcc/common/config/i386/i386-common.cc 
b/gcc/common/config/i386/i386-common.cc
index 895e5fa662d..eb3f94c529c 100644
--- a/gcc/common/config/i386/i386-common.cc
+++ b/gcc/common/config/i386/i386-common.cc
@@ -2066,6 +2066,7 @@ const char *const processor_names[] =
"intel",
"lujiazui",
"yongfeng",
+  "shijidadao",
"geode",
"k6",
"athlon",
@@ -2271,10 +2272,13 @@ const pta processor_alias_table[] =
| PTA_SSSE3 | PTA_SSE4_1 | PTA_FXSR, 0, P_NONE},
{"lujiazui", PROCESSOR_LUJIAZUI, CPU_LUJIAZUI,
 PTA_LUJIAZUI,
-   M_CPU_SUBTYPE (ZHAOXIN_FAM7H_LUJIAZUI), P_NONE},
+   M_CPU_SUBTYPE (ZHAOXIN_FAM7H_LUJIAZUI), P_PROC_BMI},
{"yongfeng", PROCESSOR_YONGFENG, CPU_YONGFENG,
 PTA_YONGFENG,
-   M_CPU_SUBTYPE (ZHAOXIN_FAM7H_YONGFENG), P_NONE},
+   M_CPU_SUBTYPE (ZHAOXIN_FAM7H_YONGFENG), P_PROC_AVX2},
+  {"shijidadao", PROCESSOR_SHIJIDADAO, CPU_YONGFENG,
+   PTA_YONGFENG,
+   M_CPU_SUBTYPE (ZHAOXIN_FAM7H_SHIJIDADAO), P_PROC_AVX2},
{"k8", PROCESSOR_K8, CPU_K8,
  PTA_64BIT | PTA_MMX | PTA_3DNOW | PTA_3DNOW_A | PTA_SSE
| PTA_SSE2 | PTA_NO_SAHF | PTA_FXSR, 0, P_NONE},
diff --git a/gcc/common/config/i386/i386-cpuinfo.h 
b/gcc/common/config/i386/i386-cpuinfo.h
index 9edad96d4fd..fa3b76f4931 100644
--- a/gcc/common/config/i386/i386-cpuinfo.h
+++ b/gcc/common/config/i386/i386-cpuinfo.h
@@ -104,6 +104,7 @@ enum processor_subtypes

[PATCH] [x86_64] Add flag to control tight loops alignment opt

2024-11-04 Thread MayShao-oc
Hi all:
This patch add -malign-tight-loops flag to control
pass_align_tight_loops.
The motivation is that pass_align_tight_loops may cause performance
regression in nested loops.

The example code as follows:

#define ITER 2
#define ITER_O 10

int i, j,k;
int array[ITER];

void loop()
{
  int i;
  for(k = 0; k < ITER_O; k++)
  for(j = 0; j < ITER; j++)
  for(i = 0; i < ITER; i++)
  {
array[i] += j;
array[i] += i;
array[i] += 2*j;
array[i] += 2*i;
  }
}

When I compile it with gcc -O1 loop.c, the output assembly as follows.
It is not optimal, because of too many nops insert in the outer loop.

00400540 :
  400540:   48 83 ec 08 sub$0x8,%rsp
  400544:   bf 0a 00 00 00  mov$0xa,%edi
  400549:   b9 00 00 00 00  mov$0x0,%ecx
  40054e:   8d 34 09lea(%rcx,%rcx,1),%esi
  400551:   b8 00 00 00 00  mov$0x0,%eax
  400556:   66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1)
  40055d:   00 00 00 00
  400561:   66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1)
  400568:   00 00 00 00
  40056c:   66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1)
  400573:   00 00 00 00
  400577:   66 0f 1f 84 00 00 00nopw   0x0(%rax,%rax,1)
  40057e:   00 00
  400580:   89 ca   mov%ecx,%edx
  400582:   03 14 85 60 10 60 00add0x601060(,%rax,4),%edx
  400589:   01 c2   add%eax,%edx
  40058b:   01 f2   add%esi,%edx
  40058d:   8d 14 42lea(%rdx,%rax,2),%edx
  400590:   89 14 85 60 10 60 00mov%edx,0x601060(,%rax,4)
  400597:   48 83 c0 01 add$0x1,%rax
  40059b:   48 3d 20 4e 00 00   cmp$0x4e20,%rax
  4005a1:   75 dd   jne400580 

   I benchmark this program in the intel Xeon, and find the optimization may 
cause a 40% performance regression
(6.6B cycles VS 9.3B cycles).
   So I propose to add -malign-tight-loops flag to control tight loop 
optimization to avoid this, we could disalbe this optimization by default.
   Bootstrapped X86_64.
   Ok for trunk?

BR
Mayshao

gcc/ChangeLog:

* config/i386/i386-features.cc (ix86_align_tight_loops): New flag.
* config/i386/i386.opt (malign-tight-loops): New option.
* doc/invoke.texi (-malign-tight-loops): Document.
---
 gcc/config/i386/i386-features.cc | 4 +++-
 gcc/config/i386/i386.opt | 4 
 gcc/doc/invoke.texi  | 7 ++-
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index e2e85212a4f..f9546e00b07 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -3620,7 +3620,9 @@ public:
   /* opt_pass methods: */
   bool gate (function *) final override
 {
-  return optimize && optimize_function_for_speed_p (cfun);
+  return ix86_align_tight_loops
+  && optimize
+  && optimize_function_for_speed_p (cfun);
 }
 
   unsigned int execute (function *) final override
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 64c295d344c..ec41de192bc 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -1266,6 +1266,10 @@ mlam=
 Target RejectNegative Joined Enum(lam_type) Var(ix86_lam_type) Init(lam_none)
 -mlam=[none|u48|u57] Instrument meta data position in user data pointers.
 
+malign-tight-loops
+Target Var(ix86_align_tight_loops) Init(0) Optimization
+Enable align tight loops.
+
 Enum
 Name(lam_type) Type(enum lam_type) UnknownError(unknown lam type %qs)
 
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 07920e07b4d..9ec1e1f0095 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -1510,7 +1510,7 @@ See RS/6000 and PowerPC Options.
 -mindirect-branch=@var{choice}  -mfunction-return=@var{choice}
 -mindirect-branch-register -mharden-sls=@var{choice}
 -mindirect-branch-cs-prefix -mneeded -mno-direct-extern-access
--munroll-only-small-loops -mlam=@var{choice}}
+-munroll-only-small-loops -mlam=@var{choice} -malign-tight-loops}
 
 @emph{x86 Windows Options}
 
@@ -36530,6 +36530,11 @@ LAM(linear-address masking) allows special bits in the 
pointer to be used
 for metadata. The default is @samp{none}. With @samp{u48}, pointer bits in
 positions 62:48 can be used for metadata; With @samp{u57}, pointer bits in
 positions 62:57 can be used for metadata.
+
+@opindex malign-tight-loops
+@opindex mno-align-tight-loops
+@item -malign-tight-loops
+Controls tight loop alignment optimization.
 @end table
 
 @node x86 Windows Options
-- 
2.27.0



Re: [PATCH] [x86_64] Add microarchtecture tunable for pass_align_tight_loops

2024-11-06 Thread Mayshao-oc
> > On Thu, Nov 7, 2024 at 10:29?AM MayShao-oc  wrote:
> >
> > Hi all:
> >For zhaoxin, I find no improvement when enable pass_align_tight_loops,
> > and have performance drop in some cases.
> >This patch add a new tunable to bypass pass_align_tight_loops in zhaoxin.
> >
> >Bootstrapped X86_64.
> >Ok for trunk?
> > BR
> > Mayshao
> > gcc/ChangeLog:
> >
> > * config/i386/i386-features.cc (TARGET_ALIGN_TIGHT_LOOPS):
> > default true in all processors except for zhaoxin.
> > * config/i386/i386.h (TARGET_ALIGN_TIGHT_LOOPS): New Macro.
> > * config/i386/x86-tune.def (X86_TUNE_ALIGN_TIGHT_LOOPS):
> > New tune
> > ---
> >  gcc/config/i386/i386-features.cc | 4 +++-
> >  gcc/config/i386/i386.h   | 3 +++
> >  gcc/config/i386/x86-tune.def | 4 
> >  3 files changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/gcc/config/i386/i386-features.cc 
> > b/gcc/config/i386/i386-features.cc
> > index e2e85212a4f..d9fd92964fe 100644
> > --- a/gcc/config/i386/i386-features.cc
> > +++ b/gcc/config/i386/i386-features.cc
> > @@ -3620,7 +3620,9 @@ public:
> >/* opt_pass methods: */
> >bool gate (function *) final override
> >  {
> > -  return optimize && optimize_function_for_speed_p (cfun);
> > +  return TARGET_ALIGN_TIGHT_LOOPS
> > +&& optimize
> > +&& optimize_function_for_speed_p (cfun);
> >  }
> >
> >unsigned int execute (function *) final override
> > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> > index 2dcd8803a08..7f9010246c2 100644
> > --- a/gcc/config/i386/i386.h
> > +++ b/gcc/config/i386/i386.h
> > @@ -466,6 +466,9 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
> >  #define TARGET_USE_RCR ix86_tune_features[X86_TUNE_USE_RCR]
> >  #define TARGET_SSE_MOVCC_USE_BLENDV \
> > ix86_tune_features[X86_TUNE_SSE_MOVCC_USE_BLENDV]
> > +#define TARGET_ALIGN_TIGHT_LOOPS \
> > +ix86_tune_features[X86_TUNE_ALIGN_TIGHT_LOOPS]
> > +
> >
> >  /* Feature tests against the various architecture variations.  */
> >  enum ix86_arch_indices {
> > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> > index 6ebb2fd3414..bd4fa8b3eee 100644
> > --- a/gcc/config/i386/x86-tune.def
> > +++ b/gcc/config/i386/x86-tune.def
> > @@ -542,6 +542,10 @@ DEF_TUNE (X86_TUNE_V2DF_REDUCTION_PREFER_HADDPD,
> >  DEF_TUNE (X86_TUNE_SSE_MOVCC_USE_BLENDV,
> >   "sse_movcc_use_blendv", ~m_CORE_ATOM)
> >
> > +/* X86_TUNE_ALIGN_TIGHT_LOOPS: if false, tight loops are not aligned. */
> > +DEF_TUNE (X86_TUNE_ALIGN_TIGHT_LOOPS, "align_tight_loops",
> > +~(m_ZHAOXIN))
> Please also add ~(m_ZHAOXIN | m_CASCADELAKE | m_SKYLAKE_AVX512))
> And could you put it under the section of
> 
>  
> /*/
> -/* Branch predictor tuning  
> */
> +/* Branch predictor and The Front-end tuning
>   */
>  
> /*/
> > +
> >  
> > /*/
> >  /* AVX instruction selection tuning (some of SSE flags affects AVX, too)   
> >   */
> >  
> > /*/
> > --
> > 2.27.0
> >
> 
> 
> --
> BR,
> Hongtao

Ok

BR
Mayshao


0001-x86_64-Add-microarchtecture-tunable-for-pass_align_v1.patch
Description: 0001-x86_64-Add-microarchtecture-tunable-for-pass_align_v1.patch


[PATCH] [x86_64] Add microarchtecture tunable for pass_align_tight_loops

2024-11-06 Thread MayShao-oc
Hi all:
   For zhaoxin, I find no improvement when enable pass_align_tight_loops,
and have performance drop in some cases.
   This patch add a new tunable to bypass pass_align_tight_loops in zhaoxin.

   Bootstrapped X86_64.
   Ok for trunk?
BR
Mayshao
gcc/ChangeLog:

* config/i386/i386-features.cc (TARGET_ALIGN_TIGHT_LOOPS):
default true in all processors except for zhaoxin.
* config/i386/i386.h (TARGET_ALIGN_TIGHT_LOOPS): New Macro.
* config/i386/x86-tune.def (X86_TUNE_ALIGN_TIGHT_LOOPS):
New tune
---
 gcc/config/i386/i386-features.cc | 4 +++-
 gcc/config/i386/i386.h   | 3 +++
 gcc/config/i386/x86-tune.def | 4 
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index e2e85212a4f..d9fd92964fe 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -3620,7 +3620,9 @@ public:
   /* opt_pass methods: */
   bool gate (function *) final override
 {
-  return optimize && optimize_function_for_speed_p (cfun);
+  return TARGET_ALIGN_TIGHT_LOOPS
+&& optimize
+&& optimize_function_for_speed_p (cfun);
 }
 
   unsigned int execute (function *) final override
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 2dcd8803a08..7f9010246c2 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -466,6 +466,9 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 #define TARGET_USE_RCR ix86_tune_features[X86_TUNE_USE_RCR]
 #define TARGET_SSE_MOVCC_USE_BLENDV \
ix86_tune_features[X86_TUNE_SSE_MOVCC_USE_BLENDV]
+#define TARGET_ALIGN_TIGHT_LOOPS \
+ix86_tune_features[X86_TUNE_ALIGN_TIGHT_LOOPS]
+
 
 /* Feature tests against the various architecture variations.  */
 enum ix86_arch_indices {
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 6ebb2fd3414..bd4fa8b3eee 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -542,6 +542,10 @@ DEF_TUNE (X86_TUNE_V2DF_REDUCTION_PREFER_HADDPD,
 DEF_TUNE (X86_TUNE_SSE_MOVCC_USE_BLENDV,
  "sse_movcc_use_blendv", ~m_CORE_ATOM)
 
+/* X86_TUNE_ALIGN_TIGHT_LOOPS: if false, tight loops are not aligned. */
+DEF_TUNE (X86_TUNE_ALIGN_TIGHT_LOOPS, "align_tight_loops",
+~(m_ZHAOXIN))
+
 /*/
 /* AVX instruction selection tuning (some of SSE flags affects AVX, too) */
 /*/
-- 
2.27.0



Re: [PATCH] [x86_64] Add microarchtecture tunable for pass_align_tight_loops

2024-11-07 Thread Mayshao-oc
> > -Original Message-
> > From: Xi Ruoyao 
> > Sent: Thursday, November 7, 2024 1:12 PM
> > To: Liu, Hongtao ; Mayshao-oc  > o...@zhaoxin.com>; Hongtao Liu 
> > Cc: gcc-patches@gcc.gnu.org; hubi...@ucw.cz; ubiz...@gmail.com;
> > richard.guent...@gmail.com; Tim Hu(WH-RD) ; Silvia
> > Zhao(BJ-RD) ; Louis Qi(BJ-RD)
> > ; Cobe Chen(BJ-RD) 
> > Subject: Re: [PATCH] [x86_64] Add microarchtecture tunable for
> > pass_align_tight_loops
> > On Thu, 2024-11-07 at 04:58 +, Liu, Hongtao wrote:
> > > > > > Hi all:
> > > > > > For zhaoxin, I find no improvement when enable
> > > > > > pass_align_tight_loops, and have performance drop in some cases.
> > > > > > This patch add a new tunable to bypass
> > > > > > pass_align_tight_loops in
> > > > zhaoxin.
> > > > > >
> > > > > > Bootstrapped X86_64.
> > > > > > Ok for trunk?
> > > LGTM.
> >
> > I'd suggest to add the reference to PR 117438 into the subject and 
> > ChangeLog.
> Yes, thanks.
Add PR 117438 into the subject and ChangeLog.
> >
> > --
> > Xi Ruoyao 
> > School of Aerospace Science and Technology, Xidian University
BR
Mayshao


0001-x86_64-Add-microarchtecture-tunable-for-pass_align_v2.patch
Description: 0001-x86_64-Add-microarchtecture-tunable-for-pass_align_v2.patch


Re: [PATCH] [x86_64] Add microarchtecture tunable for pass_align_tight_loops

2024-11-07 Thread Mayshao-oc
> On Fri, Nov 8, 2024 at 10:21 AM Mayshao-oc  wrote:
> > > > -Original Message-
> > > > From: Xi Ruoyao 
> > > > Sent: Thursday, November 7, 2024 1:12 PM
> > > > To: Liu, Hongtao ; Mayshao-oc  > > > o...@zhaoxin.com>; Hongtao Liu 
> > > > Cc: gcc-patches@gcc.gnu.org; hubi...@ucw.cz; ubiz...@gmail.com;
> > > > richard.guent...@gmail.com; Tim Hu(WH-RD) ; Silvia
> > > > Zhao(BJ-RD) ; Louis Qi(BJ-RD)
> > > > ; Cobe Chen(BJ-RD) 
> > > > Subject: Re: [PATCH] [x86_64] Add microarchtecture tunable for
> > > > pass_align_tight_loops
> > > > On Thu, 2024-11-07 at 04:58 +, Liu, Hongtao wrote:
> > > > > > > > Hi all:
> > > > > > > > For zhaoxin, I find no improvement when enable
> > > > > > > > pass_align_tight_loops, and have performance drop in some cases.
> > > > > > > > This patch add a new tunable to bypass
> > > > > > > > pass_align_tight_loops in
> > > > > > zhaoxin.
> > > > > > > >
> > > > > > > > Bootstrapped X86_64.
> > > > > > > > Ok for trunk?
> > > > > LGTM.
> > > >
> > > > I'd suggest to add the reference to PR 117438 into the subject and 
> > > > ChangeLog.
> > > Yes, thanks.
> > Add PR 117438 into the subject and ChangeLog.
> PR target/117438
> Others LGTM.
Update this in ChangeLog.
I should report the PR in bugzilla in target category in the first place.
Thanks.
> > > >
> > > > --
> > > > Xi Ruoyao 
> > > > School of Aerospace Science and Technology, Xidian University
> > BR
> > Mayshao
> 
> 
> 
> --
> BR,
> Hongtao
BR
Mayshao


0001-x86_64-Add-microarchtecture-tunable-for-pass_align_v3.patch
Description: 0001-x86_64-Add-microarchtecture-tunable-for-pass_align_v3.patch


Re: [PATCH] [x86_64] Add microarchtecture tunable for pass_align_tight_loops

2024-11-19 Thread Mayshao-oc
> On Fri, Nov 8, 2024 at 10:21 AM Mayshao-oc  wrote:
> >
> > > > -Original Message-
> > > > From: Xi Ruoyao 
> > > > Sent: Thursday, November 7, 2024 1:12 PM
> > > > To: Liu, Hongtao ; Mayshao-oc  > > > o...@zhaoxin.com>; Hongtao Liu 
> > > > Cc: gcc-patches@gcc.gnu.org; hubi...@ucw.cz; ubiz...@gmail.com; 
> > > > richard.guent...@gmail.com; Tim Hu(WH-RD) ; 
> > > > Silvia
> > > > Zhao(BJ-RD) ; Louis Qi(BJ-RD) 
> > > > ; Cobe Chen(BJ-RD) 
> > > > Subject: Re: [PATCH] [x86_64] Add microarchtecture tunable for 
> > > > pass_align_tight_loops On Thu, 2024-11-07 at 04:58 +, Liu, 
> > > > Hongtao wrote:
> > > > > > > > Hi all:
> > > > > > > > For zhaoxin, I find no improvement when enable 
> > > > > > > > pass_align_tight_loops, and have performance drop in some cases.
> > > > > > > > This patch add a new tunable to bypass 
> > > > > > > > pass_align_tight_loops in
> > > > > > zhaoxin.
> > > > > > > >
> > > > > > > > Bootstrapped X86_64.
> > > > > > > > Ok for trunk?
> > > > > LGTM.
> > > >
> > > > I'd suggest to add the reference to PR 117438 into the subject and 
> > > > ChangeLog.
> > > Yes, thanks.
> > Add PR 117438 into the subject and ChangeLog.
> PR target/117438
> Others LGTM.
> > > >
> > > > --
> > > > Xi Ruoyao 
> > > > School of Aerospace Science and Technology, Xidian University
> > BR
> > Mayshao
> 
> 
> 
> --
> BR,
> Hongtao

Hi Hongtao:

  It seems no further comments. Could you please help me commit this patch?

BR
Mayshao



0001-x86_64-Add-microarchtecture-tunable-for-pass_align_v3.patch
Description: 0001-x86_64-Add-microarchtecture-tunable-for-pass_align_v3.patch