https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118125
Martin Jambor <jamborm at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Ever confirmed|0 |1 Last reconfirmed| |2025-01-10 Status|UNCONFIRMED |NEW --- Comment #1 from Martin Jambor <jamborm at gcc dot gnu.org> --- I am able to reproduce the issue on a Zen3 machine (with -flto -Ofast -march=native), the runtime grows from 193s to 225 (i.e. those 16%) and goes back if I disable the condition introduced by the commit causing this. Perf collected profile of the fast version is (functions taking more than 0.6% of run-time): # Samples: 824K of event 'cycles:Pu' # Event count (approx.): 748211409931 # # Overhead Samples Shared Object Symbol # ........ ............ ................... ..................................................................................................................................................................................................................................................................................................................................................................................................................................................... # 31.15% 255964 parest_r_peak.mine [.] _ZNK12METomography5Slave8internal16SparseDirectSPEC5solveIdEEvRN6dealii6VectorIT_EE 29.78% 244660 parest_r_peak.mine [.] _ZNK6dealii9SparseILUIdE5vmultIdEEvRNS_6VectorIT_EERKS5_ 6.02% 49434 parest_r_peak.mine [.] _ZNK6dealii12SparseMatrixIdE17precondition_SSORIdEEvRNS_6VectorIT_EERKS5_dRKSt6vectorIjSaIjEE 4.90% 40240 parest_r_peak.mine [.] _ZNK6dealii12SparseMatrixIdE5vmultINS_6VectorIdEES4_EEvRT_RKT0_ 3.36% 27587 parest_r_peak.mine [.] _ZN6dealii8FESystemILi3ELi3EE10initializeEv.constprop.0 2.52% 20717 parest_r_peak.mine [.] _ZNK6dealii15SparsityPatternclEjj 2.38% 19500 parest_r_peak.mine [.] _ZN12METomography5Slave5SlaveILi3EE12GlobalMatrix15assemble_matrixERKN6dealii18TriaActiveIteratorINS4_15DoFCellAccessorINS4_10DoFHandlerILi3ELi3EEEEEEERNS0_8internal13AssemblerDataILi3EEE 1.20% 9875 libc.so.6 [.] __memset_avx2_unaligned_erms 1.01% 8294 parest_r_peak.mine [.] _ZNK6dealii16ConstraintMatrix8condenseIdNS_11BlockVectorIdEEEEvRNS_17BlockSparseMatrixIT_EERT0_ 0.74% 6146 parest_r_peak.mine [.] _ZNSt8_Rb_treeIjjSt9_IdentityIjESt4lessIjESaIjEE16_M_insert_uniqueERKj 0.62% 5077 parest_r_peak.mine [.] _ZN12METomography13ForwardSolver35block_build_matrix_and_rhs_threadedILi3EEEvRKN6dealii10DoFHandlerIXT_EXT_EEESt4pairIjjES7_IPKNS2_10QuadratureIXT_EEEPKNS9_IXmiT_Li1EEEEERNS2_17BlockSparseMatrixIdEERNS2_11BlockVectorIdEERNS2_7Threads16DummyThreadMutexERKSt7complexIdERKS7_IPKNS2_8FunctionIXT_EEESX_ERSW_ 0.61% 5060 libstdc++.so.6.0.34 [.] _ZSt18_Rb_tree_incrementPSt18_Rb_tree_node_base 0.60% 4966 libc.so.6 [.] _int_malloc whereas profiling the slow version gives: # Samples: 951K of event 'cycles:Pu' # Event count (approx.): 864364832562 # # Overhead Samples Shared Object Symbol # ........ ............ .................. ..................................................................................................................................................................................................................................................................................................................................................................................................................................................... # 40.00% 379558 parest_r_peak.mine [.] _ZNK12METomography5Slave8internal16SparseDirectSPEC5solveIdEEvRN6dealii6VectorIT_EE 26.09% 247593 parest_r_peak.mine [.] _ZNK6dealii9SparseILUIdE5vmultIdEEvRNS_6VectorIT_EERKS5_ 5.19% 49279 parest_r_peak.mine [.] _ZNK6dealii12SparseMatrixIdE17precondition_SSORIdEEvRNS_6VectorIT_EERKS5_dRKSt6vectorIjSaIjEE 4.25% 40281 parest_r_peak.mine [.] _ZNK6dealii12SparseMatrixIdE5vmultINS_6VectorIdEES4_EEvRT_RKT0_ 2.91% 27561 parest_r_peak.mine [.] _ZN6dealii8FESystemILi3ELi3EE10initializeEv.constprop.0 2.18% 20658 parest_r_peak.mine [.] _ZNK6dealii15SparsityPatternclEjj 2.03% 19244 parest_r_peak.mine [.] _ZN12METomography5Slave5SlaveILi3EE12GlobalMatrix15assemble_matrixERKN6dealii18TriaActiveIteratorINS4_15DoFCellAccessorINS4_10DoFHandlerILi3ELi3EEEEEEERNS0_8internal13AssemblerDataILi3EEE 1.06% 10100 libc.so.6 [.] __memset_avx2_unaligned_erms 0.89% 8375 parest_r_peak.mine [.] _ZNK6dealii16ConstraintMatrix8condenseIdNS_11BlockVectorIdEEEEvRNS_17BlockSparseMatrixIT_EERT0_ 0.64% 6125 parest_r_peak.mine [.] _ZNSt8_Rb_treeIjjSt9_IdentityIjESt4lessIjESaIjEE16_M_insert_uniqueERKj The interesting thing is that I believe the change introduced in r15-6110-g92e0e0f8177530 can only affect inliner decisions (both heuristics and things like it redirecting various calls to __builtin_unreachable, which apparently is happening more often now) but the inlining decisions for the hottest function with the big sample increase are the same in both cases. I'll try to pin down where exactly the value range propagation leads to a slowdown.