On Wed, May 07, 2014 at 09:52:49PM +0100, Richard Sandiford wrote: > I noticed for_each_rtx showing up in profiles and thought I'd have a go > at using worklist-based iterators instead. So far I have three: > > FOR_EACH_SUBRTX: iterates over const_rtx subrtxes of a const_rtx > FOR_EACH_SUBRTX_VAR: iterates over rtx subrtxes of an rtx > FOR_EACH_SUBRTX_PTR: iterates over subrtx pointers of an rtx * > > with FOR_EACH_SUBRTX_PTR being the direct for_each_rtx replacement. > > I made FOR_EACH_SUBRTX the "default" (unsuffixed) version because > most walks really don't modify the structure. I think we should > encourage const_rtxes to be used whereever possible. E.g. it might > make it easier to have non-GC storage for temporary rtxes in future. > > I've locally replaced all for_each_rtx calls in the generic code with > these iterators and they make things reproducably faster. The speed-up > on full --enable-checking=release ./cc1 and ./cc1plus times is only about 1%, > but maybe that's enough to justify the churn.
seems pretty nice, and it seems like it'll make code a little more
readable too :)
> Implementation-wise, the main observation is that most subrtxes are part
> of a single contiguous sequence of "e" fields. E.g. when compiling an
> oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the
> subrtxes of 7,636,542 rtxes. Of those:
>
> (A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields,
> (B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and
> no "E" fields, and
> (C) 43,532 (00.6%) are more complicated.
>
> (A) is really a special case of (B) in which the block has zero length.
> Those are the only two cases that really need to be handled inline.
> The implementation does this by having a mapping from an rtx code to the
> bounds of its "e" sequence, in the form of a start index and count.
>
> Out of (C), the vast majority (43,509) are PARALLELs. However, as you'd
> probably expect, bloating the inline code with that case made things
> slower rather than faster.
>
> The vast majority (in fact all in the combine.ii run above) of iterations
> can be done with a 16-element stack worklist. We obviously still need a
> heap fallback for the pathological cases though.
>
> I spent a bit of time trying different iterator implementations and
> seeing which produced the best code. Specific results from that were:
>
> - The storage used for the worklist is separate from the iterator,
> in order to avoid capturing iterator fields.
>
> - Although the natural type of the storage would be auto_vec <..., 16>,
> that produced some overhead compared with a separate stack array and heap
> vector pointer. With the heap vector pointer, the only overhead is an
> assignment in the constructor and an "if (x) release (x)"-style sequence
> in the destructor. I think the extra complication over auto_vec is worth
> it because in this case the heap version is so very rarely needed.
hm, where does the overhead come from exactly? it seems like if its
faster to use vec<T, va_heap, vl_embedd> *foo; we should fix something
about vectors since this isn't the only place it could matter. does it
matter if you use vec<T, va_heap, vl_embedd> * or vec<T> ? the second
is basically just a wrapper around the former I'd expect has no effect.
I'm not saying you're doing the wrong thing here, but if we can make
generic vectors faster we probably should ;) or is the issue the
__builtin_expect()s you can add?
> - Several existing for_each_rtx callbacks have something like:
>
> if (GET_CODE (x) == CONST)
> return -1;
>
> or:
>
> if (CONSTANT_P (x))
> return -1;
>
> to avoid walking subrtxes of constants. That can be done without
> extra code checks and branches by having a separate code->bound
> mapping in which all constants are treated as leaf rtxes. This usage
> should be common enough to outweigh the cache penalty of two arrays.
>
> The choice between iterating over constants or not is given in the
> final parameter of the FOR_EACH_* iterator.
less repitition \O/
> - The maximum number of fields in (B)-type rtxes is 3. We get better
> code by making that explicit rather than having a general loop.
>
> - (C) codes map to an "e" count of UCHAR_MAX, so we can use a single
> check to test for that and for cases where the stack worklist is
> too small.
can we use uint8_t?
> To give an example:
>
> /* Callback for for_each_rtx, that returns 1 upon encountering a VALUE
> whose UID is greater than the int uid that D points to. */
>
> static int
> refs_newer_value_cb (rtx *x, void *d)
> {
> if (GET_CODE (*x) == VALUE && CSELIB_VAL_PTR (*x)->uid > *(int *)d)
> return 1;
>
> return 0;
> }
>
> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than
> that of V. */
>
> static bool
> refs_newer_value_p (rtx expr, rtx v)
> {
> int minuid = CSELIB_VAL_PTR (v)->uid;
>
> return for_each_rtx (&expr, refs_newer_value_cb, &minuid);
> }
>
> becomes:
>
> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than
> that of V. */
>
> static bool
> refs_newer_value_p (const_rtx expr, rtx v)
> {
> int minuid = CSELIB_VAL_PTR (v)->uid;
> subrtx_iterator::array_type array;
some reason to not hide it in the macro?
> FOR_EACH_SUBRTX (iter, array, expr, NONCONST)
> if (GET_CODE (*iter) == VALUE && CSELIB_VAL_PTR (*iter)->uid > minuid)
> return true;
> return false;
> }
>
> The iterator also allows subrtxes of a specific rtx to be skipped;
> this is the equivalent of returning -1 from a for_each_rtx callback.
> It also allows the current rtx to be replaced in the worklist by
> another. E.g.:
>
> static void
> mark_constants_in_pattern (rtx insn)
> {
> subrtx_iterator::array_type array;
> FOR_EACH_SUBRTX (iter, array, PATTERN (insn), ALL)
> {
> const_rtx x = *iter;
> if (GET_CODE (x) == SYMBOL_REF)
> {
> if (CONSTANT_POOL_ADDRESS_P (x))
> {
> struct constant_descriptor_rtx *desc = SYMBOL_REF_CONSTANT (x);
> if (desc->mark == 0)
> {
> desc->mark = 1;
> iter.substitute (desc->constant);
> }
> }
> else if (TREE_CONSTANT_POOL_ADDRESS_P (x))
> {
> tree decl = SYMBOL_REF_DECL (x);
> if (!TREE_ASM_WRITTEN (DECL_INITIAL (decl)))
> {
> n_deferred_constants--;
> output_constant_def_contents (CONST_CAST_RTX (x));
> }
> }
> }
> }
> }
>
> where the iter.substitute avoids a recursive walk. (The CONST_CAST_RTX
> is "temporary".)
>
> Does this look like it's worth pursuing? If so, is it OK to start
> submitting now or should it wait until 4.9.1 is out? If it is OK,
> I plan to convert the ports too and get rid of for_each_rtx.
>
> Thanks,
> Richard
>
>
> diff --git a/gcc/rtl-iter.h b/gcc/rtl-iter.h
> new file mode 100644
> index 0000000..1bc171d
> --- /dev/null
> +++ b/gcc/rtl-iter.h
> @@ -0,0 +1,293 @@
> +/* RTL iterators
> + Copyright (C) 2014 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify it under
> +the terms of the GNU General Public License as published by the Free
> +Software Foundation; either version 3, or (at your option) any later
> +version.
> +
> +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
> +for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3. If not see
> +<http://www.gnu.org/licenses/>. */
> +
> +/* This structure describes the subrtxes of an rtx as follows:
> +
> + - if the rtx has no subrtxes, START and COUNT are both 0.
> +
> + - if all the subrtxes of an rtx are stored in a contiguous block
> + of XEXPs ("e"s), START is the index of the first XEXP and COUNT
> + is the number of them.
> +
> + - otherwise START is arbitrary and COUNT is UCHAR_MAX. */
> +struct rtx_subrtx_bound_info {
> + unsigned char start;
> + unsigned char count;
> +};
> +extern rtx_subrtx_bound_info rtx_all_subrtx_bounds[];
> +extern rtx_subrtx_bound_info rtx_nonconst_subrtx_bounds[];
> +
> +/* Return true if CODE has no subrtxes. */
> +
> +static inline bool
> +leaf_code_p (enum rtx_code code)
> +{
> + return rtx_all_subrtx_bounds[code].count == 0;
> +}
> +
> +/* Used to iterate over subrtxes of an rtx. T abstracts the type of
> + access. */
> +template <typename T>
> +class generic_subrtx_iterator
> +{
> + static const size_t LOCAL_ELEMS = 16;
> + typedef typename T::value_type value_type;
> + typedef typename T::rtx_type rtx_type;
> + typedef typename T::rtunion_type rtunion_type;
> +
> +public:
> + struct array_type
> + {
> + array_type ();
> + ~array_type ();
> + value_type stack[LOCAL_ELEMS];
> + vec <value_type, va_heap, vl_embed> *heap;
> + };
> + generic_subrtx_iterator (array_type &, value_type,
> + const rtx_subrtx_bound_info *);
> + ~generic_subrtx_iterator ();
> +
> + value_type operator * () const;
> + bool at_end () const;
> + void next ();
> + void skip_subrtxes ();
> + void substitute (value_type);
> +
> +private:
> + /* The bounds to use for iterating over subrtxes. */
> + const rtx_subrtx_bound_info *m_bounds;
> +
> + /* The storage used for the worklist. */
> + array_type &m_array;
> +
> + /* The current rtx. */
> + value_type m_current;
> +
> + /* The base of the current worklist. */
> + value_type *m_base;
> +
> + /* The number of subrtxes in M_BASE. */
> + size_t m_end;
> +
> + /* The following booleans shouldn't end up in registers or memory
> + but just direct control flow. */
> +
> + /* True if the iteration is over. */
> + bool m_done;
> +
> + /* True if we should skip the subrtxes of M_CURRENT. */
> + bool m_skip;
> +
> + /* True if M_CURRENT has been replaced with a different rtx. */
> + bool m_substitute;
> +
> + static void free_array (array_type &);
> + static size_t add_subrtxes_to_queue (array_type &, value_type *, size_t,
> + rtx_type);
> + static value_type *add_single_to_queue (array_type &, value_type *, size_t,
> + value_type);
> +};
> +
> +template <typename T>
> +inline generic_subrtx_iterator <T>::array_type::array_type () : heap (0) {}
> +
> +template <typename T>
> +inline generic_subrtx_iterator <T>::array_type::~array_type ()
> +{
> + if (__builtin_expect (heap != 0, false))
> + free_array (*this);
> +}
> +
> +/* Iterate over X and its subrtxes, in arbitrary order. Use ARRAY to
> + store the worklist. We use an external array in order to avoid
> + capturing the fields of this structure when taking the address of
> + the array. Use BOUNDS to find the bounds of simple "e"-string codes. */
> +
> +template <typename T>
> +inline generic_subrtx_iterator <T>::
> +generic_subrtx_iterator (array_type &array, value_type x,
> + const rtx_subrtx_bound_info *bounds)
> + : m_bounds (bounds),
> + m_array (array),
> + m_current (x),
> + m_base (m_array.stack),
> + m_end (0),
> + m_done (false),
> + m_skip (false),
> + m_substitute (false)
> +{
> +}
> +
> +template <typename T>
> +inline generic_subrtx_iterator <T>::~generic_subrtx_iterator ()
> +{
> +}
What's wrong with just letting the compiler generate that for you?
> +/* Return the current subrtx. */
> +
> +template <typename T>
> +inline typename T::value_type
> +generic_subrtx_iterator <T>::operator * () const
> +{
> + return m_current;
> +}
> +
> +/* Return true if the iteration has finished. */
> +
> +template <typename T>
> +inline bool
> +generic_subrtx_iterator <T>::at_end () const
> +{
> + return m_done;
> +}
> +
> +/* Move on to the next subrtx. */
> +
> +template <typename T>
> +inline void
> +generic_subrtx_iterator <T>::next ()
> +{
> + if (m_substitute)
> + {
> + m_substitute = false;
> + m_skip = false;
> + return;
> + }
> + if (!m_skip)
> + {
> + /* Add the subrtxes of M_CURRENT. */
> + rtx_type x = T::get_rtx (m_current);
> + if (__builtin_expect (x != 0, true))
> + {
> + enum rtx_code code = GET_CODE (x);
> + ssize_t count = m_bounds[code].count;
> + if (count > 0)
> + {
> + /* Handle the simple case of a single "e" block that is known
> + to fit into the current array. */
> + if (__builtin_expect (m_end + count <= LOCAL_ELEMS + 1, true))
> + {
> + /* Set M_CURRENT to the first subrtx and queue the rest. */
> + ssize_t start = m_bounds[code].start;
> + rtunion_type *src = &x->u.fld[start];
> + if (__builtin_expect (count > 2, false))
> + m_base[m_end++] = T::get_value (src[2].rt_rtx);
> + if (count > 1)
> + m_base[m_end++] = T::get_value (src[1].rt_rtx);
> + m_current = T::get_value (src[0].rt_rtx);
> + return;
> + }
> + /* Handle cases which aren't simple "e" sequences or where
> + the sequence might overrun M_BASE. */
> + count = add_subrtxes_to_queue (m_array, m_base, m_end, x);
> + if (count > 0)
> + {
> + m_end += count;
> + if (m_end > LOCAL_ELEMS)
> + m_base = m_array.heap->address ();
> + m_current = m_base[--m_end];
> + return;
> + }
> + }
> + }
> + }
> + else
> + m_skip = false;
> + if (m_end == 0)
> + m_done = true;
> + else
> + m_current = m_base[--m_end];
> +}
> +
> +/* Skip the subrtxes of the current rtx. */
> +
> +template <typename T>
> +inline void
> +generic_subrtx_iterator <T>::skip_subrtxes ()
> +{
> + m_skip = true;
> +}
> +
> +/* Ignore the subrtxes of the current rtx and look at X instead. */
> +
> +template <typename T>
> +inline void
> +generic_subrtx_iterator <T>::substitute (value_type x)
> +{
> + m_substitute = true;
> + m_current = x;
> +}
> +
> +/* Iterators for const_rtx. */
> +struct const_rtx_accessor
> +{
> + typedef const_rtx value_type;
> + typedef const_rtx rtx_type;
> + typedef const rtunion rtunion_type;
> + static rtx_type get_rtx (value_type x) { return x; }
> + static value_type get_value (rtx_type x) { return x; }
> +};
> +typedef generic_subrtx_iterator <const_rtx_accessor> subrtx_iterator;
> +
> +/* Iterators for non-constant rtx. */
> +struct rtx_var_accessor
> +{
> + typedef rtx value_type;
> + typedef rtx rtx_type;
> + typedef rtunion rtunion_type;
> + static rtx_type get_rtx (value_type x) { return x; }
> + static value_type get_value (rtx_type x) { return x; }
> +};
> +typedef generic_subrtx_iterator <rtx_var_accessor> subrtx_var_iterator;
> +
> +/* Iterators for rtx *. */
> +struct rtx_ptr_accessor
> +{
> + typedef rtx *value_type;
> + typedef rtx rtx_type;
> + typedef rtunion rtunion_type;
> + static rtx_type get_rtx (value_type ptr) { return *ptr; }
> + static value_type get_value (rtx_type &x) { return &x; }
> +};
> +typedef generic_subrtx_iterator <rtx_ptr_accessor> subrtx_ptr_iterator;
> +
> +#define ALL_BOUNDS rtx_all_subrtx_bounds
> +#define NONCONST_BOUNDS rtx_nonconst_subrtx_bounds
> +
> +/* Use ITER to iterate over const_rtx X and its recursive subrtxes,
> + using subrtx_iterator::array ARRAY as the storage for the worklist.
> + ARRAY can be reused for multiple consecutive iterations but shouldn't
> + be shared by two concurrent iterations. TYPE is ALL if all subrtxes
> + are of interest or NONCONST if it is safe to ignore subrtxes of
> + constants. */
> +#define FOR_EACH_SUBRTX(ITER, ARRAY, X, TYPE) \
> + for (subrtx_iterator ITER (ARRAY, X, TYPE##_BOUNDS); !ITER.at_end (); \
> + ITER.next ())
> +
> +/* Like FOR_EACH_SUBRTX, but iterate over subrtxes of an rtx X. */
> +#define FOR_EACH_SUBRTX_VAR(ITER, ARRAY, X, TYPE) \
> + for (subrtx_var_iterator ITER (ARRAY, X, TYPE##_BOUNDS); !ITER.at_end (); \
> + ITER.next ())
> +
> +/* Like FOR_EACH_SUBRTX, but iterate over subrtx pointers of rtx pointer X.
> + For example, if X is &PATTERN (insn) and the pattern is a SET, iterate
> + over &PATTERN (insn), &SET_DEST (PATTERN (insn)), etc. */
> +#define FOR_EACH_SUBRTX_PTR(ITER, ARRAY, X, TYPE) \
> + for (subrtx_ptr_iterator ITER (ARRAY, X, TYPE##_BOUNDS); !ITER.at_end (); \
> + ITER.next ())
> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
> index f3471b1..df52788 100644
> --- a/gcc/rtlanal.c
> +++ b/gcc/rtlanal.c
> @@ -37,6 +37,7 @@ along with GCC; see the file COPYING3. If not see
> #include "tree.h"
> #include "emit-rtl.h" /* FIXME: Can go away once crtl is moved to rtl.h. */
> #include "addresses.h"
> +#include "rtl-iter.h"
>
> /* Forward declarations */
> static void set_of_1 (rtx, const_rtx, void *);
> @@ -62,6 +63,9 @@ static unsigned int num_sign_bit_copies1 (const_rtx, enum
> machine_mode, const_rt
> -1 if a code has no such operand. */
> static int non_rtx_starting_operands[NUM_RTX_CODE];
>
> +rtx_subrtx_bound_info rtx_all_subrtx_bounds[NUM_RTX_CODE];
> +rtx_subrtx_bound_info rtx_nonconst_subrtx_bounds[NUM_RTX_CODE];
> +
> /* Truncation narrows the mode from SOURCE mode to DESTINATION mode.
> If TARGET_MODE_REP_EXTENDED (DESTINATION, DESTINATION_REP) is
> SIGN_EXTEND then while narrowing we also have to enforce the
> @@ -78,6 +82,93 @@ static int non_rtx_starting_operands[NUM_RTX_CODE];
> static unsigned int
> num_sign_bit_copies_in_rep[MAX_MODE_INT + 1][MAX_MODE_INT + 1];
>
> +/* Store X into index I of ARRAY. ARRAY is known to have at least I
> + elements. Return the new base of ARRAY. */
> +
> +template <typename T>
> +typename T::value_type *
> +generic_subrtx_iterator <T>::add_single_to_queue (array_type &array,
> + value_type *base,
> + size_t i, value_type x)
> +{
> + if (base == array.stack)
> + {
> + if (i < LOCAL_ELEMS)
> + {
> + base[i] = x;
> + return base;
> + }
it seems like this first little bit might be worth inlining, but I guess
you tested and it wasn't.
> + gcc_checking_assert (i == LOCAL_ELEMS);
> + vec_safe_grow (array.heap, i + 1);
> + base = array.heap->address ();
> + memcpy (base, array.stack, sizeof (array.stack));
that threw me for a sec until I realized it was just manually doing what
auto_vec does in this case.
> + base[LOCAL_ELEMS] = x;
> + return base;
> + }
> + unsigned int length = array.heap->length ();
> + if (length > i)
> + {
> + gcc_checking_assert (base == array.heap->address ());
> + base[i] = x;
> + return base;
> + }
> + else
> + {
> + gcc_checking_assert (i == length);
> + vec_safe_push (array.heap, x);
> + return array.heap->address ();
this chunk seems redundant with what vec_safe_push does do you actually
need to pass in |i| at all?
> + }
> +}
> +
> +/* Add the subrtxes of X to worklist ARRAY, starting at END. Return the
> + number of elements added to the worklist. */
> +
> +template <typename T>
> +size_t
> +generic_subrtx_iterator <T>::add_subrtxes_to_queue (array_type &array,
> + value_type *base,
> + size_t end, rtx_type x)
> +{
> + const char *format = GET_RTX_FORMAT (GET_CODE (x));
> + size_t orig_end = end;
> + for (int i = 0; format[i]; ++i)
> + if (format[i] == 'e')
> + {
> + value_type subx = T::get_value (x->u.fld[i].rt_rtx);
> + if (__builtin_expect (end < LOCAL_ELEMS, true))
> + base[end++] = subx;
> + else
> + base = add_single_to_queue (array, base, end++, subx);
> + }
> + else if (format[i] == 'E')
> + {
> + int length = GET_NUM_ELEM (x->u.fld[i].rt_rtvec);
> + rtx *vec = x->u.fld[i].rt_rtvec->elem;
> + if (__builtin_expect (end + length <= LOCAL_ELEMS, true))
> + for (int j = 0; j < length; j++)
> + base[end++] = T::get_value (vec[j]);
> + else
> + for (int j = 0; j < length; j++)
> + base = add_single_to_queue (array, base, end++,
> + T::get_value (vec[j]));
> + }
> + return end - orig_end;
> +}
> +
> +template <typename T>
> +void
> +generic_subrtx_iterator <T>::free_array (array_type &array)
> +{
> + vec_free (array.heap);
> +}
> +
> +template <typename T>
> +const size_t generic_subrtx_iterator <T>::LOCAL_ELEMS;
> +
> +template class generic_subrtx_iterator <const_rtx_accessor>;
> +template class generic_subrtx_iterator <rtx_var_accessor>;
> +template class generic_subrtx_iterator <rtx_ptr_accessor>;
> +
> /* Return 1 if the value of X is unstable
> (would be different at a different point in the program).
> The frame pointer, arg pointer, etc. are considered stable
> @@ -5324,8 +5415,42 @@ truncated_to_mode (enum machine_mode mode, const_rtx x)
> return false;
> }
>
> +/* Return true if RTX code CODE has a single sequence of zero or more
> + "e" operands and no rtvec operands. Initialize its rtx_all_subrtx_bounds
> + entry in that case. */
> +
> +static bool
> +setup_reg_subrtx_bounds (unsigned int code)
> +{
> + const char *format = GET_RTX_FORMAT ((enum rtx_code) code);
> + unsigned int i = 0;
> + for (; format[i] != 'e'; ++i)
> + {
> + if (!format[i])
> + /* No subrtxes. Leave start and count as 0. */
> + return true;
> + if (format[i] == 'E' || format[i] == 'V')
> + return false;
> + }
> +
> + /* Record the sequence of 'e's. */
> + rtx_all_subrtx_bounds[code].start = i;
> + do
> + ++i;
> + while (format[i] == 'e');
> + rtx_all_subrtx_bounds[code].count = i - rtx_all_subrtx_bounds[code].start;
> + /* rtl-iter.h relies on this. */
> + gcc_checking_assert (rtx_all_subrtx_bounds[code].count <= 3);
> +
> + for (; format[i]; ++i)
> + if (format[i] == 'E' || format[i] == 'V' || format[i] == 'e')
> + return false;
> +
> + return true;
> +}
> +
> /* Initialize non_rtx_starting_operands, which is used to speed up
> - for_each_rtx. */
> + for_each_rtx, and rtx_all_subrtx_bounds. */
> void
> init_rtlanal (void)
> {
> @@ -5335,6 +5460,10 @@ init_rtlanal (void)
> const char *format = GET_RTX_FORMAT (i);
> const char *first = strpbrk (format, "eEV");
> non_rtx_starting_operands[i] = first ? first - format : -1;
> + if (!setup_reg_subrtx_bounds (i))
> + rtx_all_subrtx_bounds[i].count = UCHAR_MAX;
> + if (GET_RTX_CLASS (i) != RTX_CONST_OBJ)
> + rtx_nonconst_subrtx_bounds[i] = rtx_all_subrtx_bounds[i];
> }
>
> init_num_sign_bit_copies_in_rep ();
signature.asc
Description: Digital signature
