https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110248
Kewen Lin <linkw at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization CC| |juzhe.zhong at rivai dot ai, | |rguenth at gcc dot gnu.org, | |rsandifo at gcc dot gnu.org Target| |powerpc*-linux-gnu --- Comment #1 from Kewen Lin <linkw at gcc dot gnu.org> --- Commit r14-1493 caused -4% degradation on SPEC2017 fp bmk 503.bwaves_r at option -Ofast --param=vect-partial-vector-usage=2 on Power10, as a follow up of [1], I looked into it and confirmed it had nothing to do with existing load density heuristics. The gap is from the hotspot mat_times_vec_, perf showed that the different iv choices leading more latency for the length and further uses. By further checking, I think it exposed one issue that currently we only checks the addressing mode supported or not against the mode, it's without any other information like gimple statement. Unfortunately for IFNs len_{load,store} which are generated with lxvl/stxvl only supporting the addressing mode: base register (+ length register, which isn't even index register), but when determining the group cost with cand (determine_group_iv_cost_address), it's unable to consider this characteristic, as the current valid_mem_ref_p (valid_mem_ref_p-> memory_address_addr_space_p ->legitimate_address_p) only checking mode, address_space, constructed rtx. For V16QImode, the normal vector load/store do support addressing modes "base + offset (DQ-form)", "base + index" since power9, so ivopts would consider it's fine to use base + index addressing mode for LEN_{load,store} uses and the related cost of adopting the scalar (no address object based) candidate with step 16 for those LEN_{load,store} uses is zero. For example: | Group 1: | Type: POINTER ARGUMENT ADDRESS | Use 1.0: | At stmt: vect_434 = .LEN_LOAD (vectp_y.124_438, 64B, loop_len_436, 0); | At pos: vectp_y.124_438 | IV struct: | Type: vector(2) real(kind=8) * | Base: (vector(2) real(kind=8) *) vectp_y.125_195 | Step: 32 | Object: (void *) vectp_y.125_195 | Biv: N | Overflowness wrto loop niter: Overflow | Use 1.1: | At stmt: .LEN_STORE (vectp_y.173_213, 64B, loop_len_436, vect_211, 0); | At pos: vectp_y.173_213 | IV struct: | Type: vector(2) real(kind=8) * | Base: (vector(2) real(kind=8) *) vectp_y.125_195 | Step: 32 | Object: (void *) vectp_y.125_195 | Biv: N | Overflowness wrto loop niter: Overflow | Candidate 7: | Var befor: ivtmp.182 | Var after: ivtmp.182 | Incr POS: before exit test | IV struct: | Type: sizetype | Base: 0 | Step: 32 | Biv: N | Overflowness wrto loop niter: No-overflow Group 1: cand cost compl. inv.expr. inv.vars 1 8 2 NIL; 1 2 12 2 18; NIL; 3 8 2 NIL; 1 4 12 2 19; NIL; 5 12 2 19; NIL; 6 0 2 NIL; NIL; 7 0 2 NIL; 1 ==> zero cost 8 0 0 NIL; NIL; 9 0 0 NIL; NIL; 31 0 0 NIL; NIL; [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620305.html