On Fri, May 9, 2014 at 8:18 AM, Jeff Law <l...@redhat.com> wrote: > On 05/07/14 14:52, Richard Sandiford wrote: >> >> I noticed for_each_rtx showing up in profiles and thought I'd have a go >> at using worklist-based iterators instead. So far I have three: >> >> FOR_EACH_SUBRTX: iterates over const_rtx subrtxes of a const_rtx >> FOR_EACH_SUBRTX_VAR: iterates over rtx subrtxes of an rtx >> FOR_EACH_SUBRTX_PTR: iterates over subrtx pointers of an rtx * >> >> with FOR_EACH_SUBRTX_PTR being the direct for_each_rtx replacement. > > Yea, for_each_rtx is a bit of a workhorse, so I'm not surprised it's showing > up on some profiles. > > > > >> I've locally replaced all for_each_rtx calls in the generic code with >> these iterators and they make things reproducably faster. The speed-up >> on full --enable-checking=release ./cc1 and ./cc1plus times is only about >> 1%, >> but maybe that's enough to justify the churn. > > I'd consider that enough to justify the churn, especially if it's an > improvement we're seeing consistently rather than just something that's > helping pathological cases. > > > >> >> Implementation-wise, the main observation is that most subrtxes are part >> of a single contiguous sequence of "e" fields. E.g. when compiling an >> oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the >> subrtxes of 7,636,542 rtxes. Of those: > > Yea, makes sense. One could possibly carry this a step further and realize > that we could specialize the walkers. ie, rather than indexing the rtx_code > to get the rtx format, then iterate over that to find the 'e' fields, if we > have an PLUS, then damn it, we ought to know how to walk pretty directly. > > >> >> (A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields, >> (B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and >> no "E" fields, and >> (C) 43,532 (00.6%) are more complicated. >> >> (A) is really a special case of (B) in which the block has zero length. >> Those are the only two cases that really need to be handled inline. >> The implementation does this by having a mapping from an rtx code to the >> bounds of its "e" sequence, in the form of a start index and count. > > Right. So my thought above us just extending this a bit and realizing that > we don't need an index/count for most RTXs. That may (or may not) be worth > the extra pain. Your call whether or not to investigate that further. > > > >> >> Does this look like it's worth pursuing? If so, is it OK to start >> submitting now or should it wait until 4.9.1 is out? If it is OK, >> I plan to convert the ports too and get rid of for_each_rtx. > > I think it's definitely worth pursuing. I suspect the release managers > would definitely want to wait for 4.9.1 :-)
It's not a too big change so no need to wait. Thanks, Richard. > jeff >