On Thu, May 08, 2014 at 07:25:50AM +0100, Richard Sandiford wrote: > Trevor Saunders <[email protected]> writes: > > On Wed, May 07, 2014 at 09:52:49PM +0100, Richard Sandiford wrote: > >> I noticed for_each_rtx showing up in profiles and thought I'd have a go > >> at using worklist-based iterators instead. So far I have three: > >> > >> FOR_EACH_SUBRTX: iterates over const_rtx subrtxes of a const_rtx > >> FOR_EACH_SUBRTX_VAR: iterates over rtx subrtxes of an rtx > >> FOR_EACH_SUBRTX_PTR: iterates over subrtx pointers of an rtx * > >> > >> with FOR_EACH_SUBRTX_PTR being the direct for_each_rtx replacement. > >> > >> I made FOR_EACH_SUBRTX the "default" (unsuffixed) version because > >> most walks really don't modify the structure. I think we should > >> encourage const_rtxes to be used whereever possible. E.g. it might > >> make it easier to have non-GC storage for temporary rtxes in future. > >> > >> I've locally replaced all for_each_rtx calls in the generic code with > >> these iterators and they make things reproducably faster. The speed-up > >> on full --enable-checking=release ./cc1 and ./cc1plus times is only about > >> 1%, > >> but maybe that's enough to justify the churn. > > > > seems pretty nice, and it seems like it'll make code a little more > > readable too :) > > > >> Implementation-wise, the main observation is that most subrtxes are part > >> of a single contiguous sequence of "e" fields. E.g. when compiling an > >> oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the > >> subrtxes of 7,636,542 rtxes. Of those: > >> > >> (A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields, > >> (B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and > >> no "E" fields, and > >> (C) 43,532 (00.6%) are more complicated. > >> > >> (A) is really a special case of (B) in which the block has zero length. > >> Those are the only two cases that really need to be handled inline. > >> The implementation does this by having a mapping from an rtx code to the > >> bounds of its "e" sequence, in the form of a start index and count. > >> > >> Out of (C), the vast majority (43,509) are PARALLELs. However, as you'd > >> probably expect, bloating the inline code with that case made things > >> slower rather than faster. > >> > >> The vast majority (in fact all in the combine.ii run above) of iterations > >> can be done with a 16-element stack worklist. We obviously still need a > >> heap fallback for the pathological cases though. > >> > >> I spent a bit of time trying different iterator implementations and > >> seeing which produced the best code. Specific results from that were: > >> > >> - The storage used for the worklist is separate from the iterator, > >> in order to avoid capturing iterator fields. > >> > >> - Although the natural type of the storage would be auto_vec <..., 16>, > >> that produced some overhead compared with a separate stack array and heap > >> vector pointer. With the heap vector pointer, the only overhead is an > >> assignment in the constructor and an "if (x) release (x)"-style sequence > >> in the destructor. I think the extra complication over auto_vec is worth > >> it because in this case the heap version is so very rarely needed. > > > > hm, where does the overhead come from exactly? it seems like if its > > faster to use vec<T, va_heap, vl_embedd> *foo; we should fix something > > about vectors since this isn't the only place it could matter. does it > > matter if you use vec<T, va_heap, vl_embedd> * or vec<T> ? the second > > is basically just a wrapper around the former I'd expect has no effect. > > I'm not saying you're doing the wrong thing here, but if we can make > > generic vectors faster we probably should ;) or is the issue the > > __builtin_expect()s you can add? > > Part of the problem is that by having an array in the vec itself, > the other fields effectively have their address taken too. > So m_alloc, m_num and m_using_auto_storage need to be set up > and maintained on the stack, even though we're almost sure that they > will never be used.
ok
> >> - The maximum number of fields in (B)-type rtxes is 3. We get better
> >> code by making that explicit rather than having a general loop.
> >>
> >> - (C) codes map to an "e" count of UCHAR_MAX, so we can use a single
> >> check to test for that and for cases where the stack worklist is
> >> too small.
> >
> > can we use uint8_t?
>
> We don't really use that in GCC yet. I don't mind setting a precedent
> though :-)
>
> >> To give an example:
> >>
> >> /* Callback for for_each_rtx, that returns 1 upon encountering a VALUE
> >> whose UID is greater than the int uid that D points to. */
> >>
> >> static int
> >> refs_newer_value_cb (rtx *x, void *d)
> >> {
> >> if (GET_CODE (*x) == VALUE && CSELIB_VAL_PTR (*x)->uid > *(int *)d)
> >> return 1;
> >>
> >> return 0;
> >> }
> >>
> >> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than
> >> that of V. */
> >>
> >> static bool
> >> refs_newer_value_p (rtx expr, rtx v)
> >> {
> >> int minuid = CSELIB_VAL_PTR (v)->uid;
> >>
> >> return for_each_rtx (&expr, refs_newer_value_cb, &minuid);
> >> }
> >>
> >> becomes:
> >>
> >> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than
> >> that of V. */
> >>
> >> static bool
> >> refs_newer_value_p (const_rtx expr, rtx v)
> >> {
> >> int minuid = CSELIB_VAL_PTR (v)->uid;
> >> subrtx_iterator::array_type array;
> >
> > some reason to not hide it in the macro?
>
> I wanted to make the macro a true for loop, with no extra baggage.
> AFAIK there's no way of declaring both the array and the iterator
> in the first part of the for loop without making them part of the
> same object (and thus capturing the iterator fields again).
> A "do ... while (0)" wrapper might be too surprising.
yeah, for some reason I thought you could do for (T x, U y; ..) but
you seem to be right.
> Also, keeping the array separate allows you to reuse it between
> iterations, so you only do the allocation and free once. E.g.:
fair.
Trev
>
> subrtx_iterator::array_type array;
> for (rtx insn = ...; insn; insn = NEXT_INSN (insn))
> if (INSN_P (insn))
> FOR_EACH_SUBRTX (...)
>
> >> +template <typename T>
> >> +inline generic_subrtx_iterator <T>::~generic_subrtx_iterator ()
> >> +{
> >> +}
> >
> > What's wrong with just letting the compiler generate that for you?
>
> Bah, thanks for catching that. It was left over from an earlier
> experiment in which the destructor did do some cleanup of the array.
>
> >> @@ -78,6 +82,93 @@ static int non_rtx_starting_operands[NUM_RTX_CODE];
> >> static unsigned int
> >> num_sign_bit_copies_in_rep[MAX_MODE_INT + 1][MAX_MODE_INT + 1];
> >>
> >> +/* Store X into index I of ARRAY. ARRAY is known to have at least I
> >> + elements. Return the new base of ARRAY. */
> >> +
> >> +template <typename T>
> >> +typename T::value_type *
> >> +generic_subrtx_iterator <T>::add_single_to_queue (array_type &array,
> >> + value_type *base,
> >> + size_t i, value_type x)
> >> +{
> >> + if (base == array.stack)
> >> + {
> >> + if (i < LOCAL_ELEMS)
> >> + {
> >> + base[i] = x;
> >> + return base;
> >> + }
> >
> > it seems like this first little bit might be worth inlining, but I guess
> > you tested and it wasn't.
>
> Yeah. The cases where an "e" or entire "E" fit within the stack buffer
> are already handled inline. So in practice we only get here if we know
> we have blown or are about to blow the stack buffer. The case it
> handles is where part of an "E" field fits within the buffer and part of
> it doesn't.
>
> One option would have been to force the heap buffer to be used when
> entering this function. But that would then make the simple check:
>
> if (m_end > LOCAL_ELEMS)
> m_base = m_array.heap->address ();
>
> unsafe. It seemed better to make add_single_to_queue general and
> handle both stack and heap pushes.
>
> Thanks,
> Richard
signature.asc
Description: Digital signature
