https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114480
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot
gnu.org
Status|NEW |ASSIGNED
--- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #14)
> Created attachment 57829 [details]
> smaller testcase
>
> Smaller testcase, shows the same compile-time issue at -O0. At -O1 it's a
> lot
> less bad but memory usage is better (8GB), so the slowness of the full
> testcase
> is likely memory bandwidth related.
>
> -O1 is then
>
> tree PTA : 20.59 ( 21%)
> expand vars : 9.19 ( 9%)
> expand : 14.26 ( 15%)
The memory use goes into RTXen created during RTL expansion. The compile-time
part is add_scope_conflicts. There's the possibility to do like
var-tracking and use rev_post_order_and_mark_dfs_back_seme, avoiding iteration
for non-loops and have better cache locality.
We have half of the profile hits on ggc_internal_alloc and it's
17 | d8:+- mov %r14,%rax
#
| | mov (%r14),%r14
#
1440 | | test %r14,%r14
#
4 | | je 530
#
| |if (p->bytes == entry_size)
#
| e7:| cmp 0x10(%r14),%r12
#
65582 | +--jne d8
which is the linear walk
/* Check the list of free pages for one we can use. */
for (pp = &G.free_pages, p = *pp; p; pp = &p->next, p = *pp)
if (p->bytes == entry_size)
break;
so we seem to have many free pages for some reason but the free pages
pool is global and not per order?!
Samples: 299K of event 'cycles', Event count (approx.): 338413178083
Overhead Samples Command Shared Object Symbol
23.16% 67756 cc1plus cc1plus [.] ggc_internal_alloc
6.98% 21637 cc1plus cc1plus [.] bitmap_tree_splay
6.89% 20413 cc1plus cc1plus [.] bitmap_ior_into
4.05% 11989 cc1plus cc1plus [.] bitmap_elt_ior
3.16% 9840 cc1plus cc1plus [.] mergesort<sort_ctx>
2.90% 8860 cc1plus cc1plus [.] bitmap_set_bit
2.76% 8281 cc1plus cc1plus [.]
get_ref_base_and_extent
1.37% 4071 cc1plus cc1plus [.]
stmt_may_clobber_ref_p_1
1.32% 4095 cc1plus cc1plus [.] dominated_by_p
1.16% 3597 cc1plus cc1plus [.]
bitmap_tree_unlink_element
1.06% 3128 cc1plus cc1plus [.] walk_aliased_vdefs_1
the bitmap_tree_splay is from compute_idf, refactoring that some more,
also avoiding the duplicate processing and doing away with the bitmap
for the workset might help a bit there (not using tree view just gets
set-bit up with no overall positive change).
I will look into the above things more (but not the RA slowness at -O0).