On Fri, 27 Oct 2017, Jakub Jelinek wrote: > Hi! > > The following patch attempts to improve store merging, for the time being > it still only optimizes constant stores to adjacent memory. > > The biggest improvement is handling bitfields, it uses the get_bit_range > helper to find the bounds of what can be modified when modifying the bitfield > and instead of requiring all the stores to be adjacent it now only requires > that their bitregion_* regions are adjacent. If get_bit_range fails (e.g. for > non-C/C++ languages), then it still rounds the boundaries down and up to whole > bytes, as any kind of changes within the byte affect the rest. At the end, > if there are any gaps in between the stored values, the old value is loaded > from > memory (had to set TREE_NO_WARNING on it, so that uninit doesn't complain) > masked with mask, ored with the constant maked with negation of mask and > stored, > pretty much what the expansion emits. As incremental improvement, perhaps we > could emit in some cases where all the stored bitfields in one load/store set > are adjacent a BIT_INSERT_EXPR instead of doing the and/or. > > Another improvement is related to alignment handling, previously the code > was using get_object_alignment, which say for the store_merging_11.c testcase > has to return 8-bit alignment, as the whole struct is 64-bit aligned, > but the first store is 1 byte after that. The old code would then on > targets that don't allow unaligned stores or have them slow emit just byte > stores (many of them). The patch uses get_object_alignment_1, so that we > get both the maximum known alignment and misalign, and computes alignment > for that for every bitpos we try, such that for stores starting with > 1 byte after 64-bit alignment we get 1 byte store, then 2 byte and then 4 byte > and then 8 byte. > > Another improvement is for targets that allow unaligned stores, the new > code performs a dry run on split_group and if it determines that aligned > stores are as many or fewer as unaligned stores, it prefers aligned stores. > E.g. for the case in store_merging_11.c, where ptr is 64-bit aligned and > we store 15 bytes, unaligned stores and unpatched gcc would choose to > do an 8 byte, then 4 byte, then 2 byte and then one byte store. > Aligned stores, 1, 2, 4, 8 in that order are also 4, so it is better > to do those. > > The patch also attempts to reuse original stores (well, just their lhs/rhs1), > if we choose a split store that has a single original insn in it. That way > we don't lose ARRAY_REFs/COMPONENT_REFs etc. unnecessarily. Furthermore, if > there is a larger original store than the maximum we try (wordsize), e.g. when > there is originally 8 byte long long store followed by 1 byte store followed > by > 1 byte store, on 32-bit targets we'd previously try to split it into 4 byte > store, 4 byte store and 2 byte store, figure out that is 3 stores like > previously > and give up. With the patch, if we see a single original larger store at the > bitpos we want, we just reuse that store, so we get in that case an 8 byte > original store (lhs/rhs1) followed by 2 byte store. > > In find_constituent_stmts it optimizes by not walking unnecessarily > group->stores > entries that are already known to be before the bitpos we ask. It fixes the > comparisons > which were off by one, so previously it often chose more original stores than > were really > in the split store. > > Another change is that output_merged_store used to emit the new stores into a > sequence > and if it found out there are too many, released all ssa names and failed. > That seems to be unnecessary to me, because we know before entering the loop > how many split stores there are, so can just fail at that point and so only > start > emitting something when we have decided to do the replacement. > > I had to disable store merging in the g++.dg/pr71694.C testcase, but that is > just > because the testcase doesn't test what it should. In my understanding, it > wants to > verify that the c.c store isn't using 32-bit RMW, because it would create > data race > for c.d. But it stores both c.c and c.d next to each other, so even when > c.c's > bitregion is the first 3 bytes and c.d's bitregion is the following byte, we > are > then touching bytes in both of the regions and thus a RMW cycle for the whole > 32-bit word is fine, as c.d is written, it will store the new value and > ignore the > old value of the c.d byte. What is wrong, but store merging doesn't emit, is > what > we emitted before, i.e. 32-bit RMW that just stored c.c, followed by c.d > store. > Another thread could have stored value1 into c.d, we'd R it by 32-bit read, > modify, > while another thread stored value2 into c.d, then we W the 32-bit word and > thus > introduce a value1 store into c.d, then another thread reads it and finds > a value1 instead of expected value2. Finally we store into c.d value3. > So, alternative to -fno-store-merging in the testcase would be probably > separate > functions where one would store to c.c and another one to c.d, then we can > make > sure neither store is using movl. Though, it probably still should only look > at > movl stores or loads, other moves are fine. > > The included store_merging_10.c improves on x86_64 from: > movzbl (%rdi), %eax > - andl $-19, %eax > + andl $-32, %eax > orl $13, %eax > movb %al, (%rdi) > in foo and > - orb $1, (%rdi) > movl (%rdi), %eax > - andl $-131071, %eax > + andl $2147352576, %eax > + orl $1, %eax > movl %eax, (%rdi) > - shrl $24, %eax > - andl $127, %eax > - movb %al, 3(%rdi) > in bar. foo is something combine.c managed to optimize too, but bar it > couldn't. > In store_merging_11.c on x86_64, bar is the same and foo changed: > - movabsq $578437695752115969, %rax > - movl $0, 9(%rdi) > - movb $0, 15(%rdi) > - movq %rax, 1(%rdi) > - xorl %eax, %eax > - movw %ax, 13(%rdi) > + movl $23, %eax > + movb $1, 1(%rdi) > + movl $117835012, 4(%rdi) > + movw %ax, 2(%rdi) > + movq $8, 8(%rdi) > which is not only shorter, but all the stores are aligned. > On ppc64le in store_merging_10.c the difference is: > - lwz 9,0(3) > + lbz 9,0(3) > rlwinm 9,9,0,0,26 > ori 9,9,0xd > - stw 9,0(3) > + stb 9,0(3) > in foo and > lwz 9,0(3) > + rlwinm 9,9,0,1,14 > ori 9,9,0x1 > - rlwinm 9,9,0,31,14 > - rlwinm 9,9,0,1,31 > stw 9,0(3) > in bar, and store_merging_11.c the difference is: > - lis 8,0x807 > - li 9,0 > - ori 8,8,0x605 > - li 10,0 > - sldi 8,8,32 > - stw 9,9(3) > - sth 9,13(3) > - oris 8,8,0x400 > - stb 10,15(3) > - ori 8,8,0x1701 > - mtvsrd 0,8 > - stfd 0,1(3) > + lis 9,0x706 > + li 7,1 > + li 8,23 > + ori 9,9,0x504 > + li 10,8 > + stb 7,1(3) > + sth 8,2(3) > + stw 9,4(3) > + std 10,8(3) > in foo and no changes in bar. > > What the patch doesn't implement yet, but could be also possible for > allow_unaligned case is in store_merging_11.c when we are storing 15 bytes > store 8 bytes at offset 1 and 8 bytes at offset 8 (i.e. create two > overlapping stores, in this case one aligned and one unaligned). > > Bootstrapped/regtested on x86_64-linux, i686-linux and powerpc64le-linux, > ok for trunk?
Ok. Thanks, Richard. > 2017-10-27 Jakub Jelinek <ja...@redhat.com> > > PR middle-end/22141 > * gimple-ssa-store-merging.c: Include rtl.h and expr.h. > (struct store_immediate_info): Add bitregion_start and bitregion_end > fields. > (store_immediate_info::store_immediate_info): Add brs and bre > arguments and initialize bitregion_{start,end} from those. > (struct merged_store_group): Add bitregion_start, bitregion_end, > align_base and mask fields. Drop unnecessary struct keyword from > struct store_immediate_info. Add do_merge method. > (clear_bit_region_be): Use memset instead of loop storing zeros. > (merged_store_group::do_merge): New method. > (merged_store_group::merge_into): Use do_merge. Allow gaps in between > stores as long as the surrounding bitregions have no gaps. > (merged_store_group::merge_overlapping): Use do_merge. > (merged_store_group::apply_stores): Test that bitregion_{start,end} > is byte aligned, rather than requiring that start and width are > byte aligned. Drop unnecessary struct keyword from > struct store_immediate_info. Allocate and populate also mask array. > Make start of the arrays relative to bitregion_start rather than > start and size them according to bitregion_{end,start} difference. > (struct imm_store_chain_info): Drop unnecessary struct keyword from > struct store_immediate_info. > (pass_store_merging::gate): Punt if BITS_PER_UNIT or CHAR_BIT is not 8. > (pass_store_merging::terminate_all_aliasing_chains): Drop unnecessary > struct keyword from struct store_immediate_info. > (imm_store_chain_info::coalesce_immediate_stores): Allow gaps in > between stores as long as the surrounding bitregions have no gaps. > Formatting fixes. > (struct split_store): Add orig non-static data member. > (split_store::split_store): Initialize orig to false. > (find_constituent_stmts): Return store_immediate_info *, non-NULL > if there is exactly a single original stmt. Change stmts argument > to pointer from reference, if NULL, don't push anything to it. Add > first argument, use it to optimize skipping over orig stmts that > are known to be before bitpos already. Simplify. > (split_group): Return unsigned int count how many stores are or > would be needed rather than a bool. Add allow_unaligned argument. > Change split_stores argument from reference to pointer, if NULL, > only do a dry run computing how many stores would be produced. > Rewritten algorithm to use both alignment and misalign if > !allow_unaligned and handle bitfield stores with gaps. > (imm_store_chain_info::output_merged_store): Set start_byte_pos > from bitregion_start instead of start. Compute allow_unaligned > here, if true, do 2 split_group dry runs to compute which one > produces fewer stores and prefer aligned if equal. Punt if > new count is bigger or equal than original before emitting any > statements, rather than during that. Remove no longer needed > new_ssa_names tracking. Replace num_stmts with > split_stores.length (). Use 32-bit stack allocated entries > in split_stores auto_vec. Try to reuse original store lhs/rhs1 > if possible. Handle bitfields with gaps. > (pass_store_merging::execute): Ignore bitsize == 0 stores. > Compute bitregion_{start,end} for the stores and construct > store_immediate_info with that. Formatting fixes. > > * gcc.dg/store_merging_10.c: New test. > * gcc.dg/store_merging_11.c: New test. > * gcc.dg/store_merging_12.c: New test. > * g++.dg/pr71694.C: Add -fno-store-merging to dg-options. > > --- gcc/gimple-ssa-store-merging.c.jj 2017-10-27 14:16:26.074249585 +0200 > +++ gcc/gimple-ssa-store-merging.c 2017-10-27 18:10:18.212762773 +0200 > @@ -126,6 +126,8 @@ > #include "tree-eh.h" > #include "target.h" > #include "gimplify-me.h" > +#include "rtl.h" > +#include "expr.h" /* For get_bit_range. */ > #include "selftest.h" > > /* The maximum size (in bits) of the stores this pass should generate. */ > @@ -142,17 +144,24 @@ struct store_immediate_info > { > unsigned HOST_WIDE_INT bitsize; > unsigned HOST_WIDE_INT bitpos; > + unsigned HOST_WIDE_INT bitregion_start; > + /* This is one past the last bit of the bit region. */ > + unsigned HOST_WIDE_INT bitregion_end; > gimple *stmt; > unsigned int order; > store_immediate_info (unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT, > + unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT, > gimple *, unsigned int); > }; > > store_immediate_info::store_immediate_info (unsigned HOST_WIDE_INT bs, > unsigned HOST_WIDE_INT bp, > + unsigned HOST_WIDE_INT brs, > + unsigned HOST_WIDE_INT bre, > gimple *st, > unsigned int ord) > - : bitsize (bs), bitpos (bp), stmt (st), order (ord) > + : bitsize (bs), bitpos (bp), bitregion_start (brs), bitregion_end (bre), > + stmt (st), order (ord) > { > } > > @@ -164,26 +173,32 @@ struct merged_store_group > { > unsigned HOST_WIDE_INT start; > unsigned HOST_WIDE_INT width; > - /* The size of the allocated memory for val. */ > + unsigned HOST_WIDE_INT bitregion_start; > + unsigned HOST_WIDE_INT bitregion_end; > + /* The size of the allocated memory for val and mask. */ > unsigned HOST_WIDE_INT buf_size; > + unsigned HOST_WIDE_INT align_base; > > unsigned int align; > unsigned int first_order; > unsigned int last_order; > > - auto_vec<struct store_immediate_info *> stores; > + auto_vec<store_immediate_info *> stores; > /* We record the first and last original statements in the sequence because > we'll need their vuse/vdef and replacement position. It's easier to > keep > track of them separately as 'stores' is reordered by apply_stores. */ > gimple *last_stmt; > gimple *first_stmt; > unsigned char *val; > + unsigned char *mask; > > merged_store_group (store_immediate_info *); > ~merged_store_group (); > void merge_into (store_immediate_info *); > void merge_overlapping (store_immediate_info *); > bool apply_stores (); > +private: > + void do_merge (store_immediate_info *); > }; > > /* Debug helper. Dump LEN elements of byte array PTR to FD in hex. */ > @@ -287,8 +302,7 @@ clear_bit_region_be (unsigned char *ptr, > && len > BITS_PER_UNIT) > { > unsigned int nbytes = len / BITS_PER_UNIT; > - for (unsigned int i = 0; i < nbytes; i++) > - ptr[i] = 0U; > + memset (ptr, 0, nbytes); > if (len % BITS_PER_UNIT != 0) > clear_bit_region_be (ptr + nbytes, BITS_PER_UNIT - 1, > len % BITS_PER_UNIT); > @@ -549,10 +563,16 @@ merged_store_group::merged_store_group ( > { > start = info->bitpos; > width = info->bitsize; > + bitregion_start = info->bitregion_start; > + bitregion_end = info->bitregion_end; > /* VAL has memory allocated for it in apply_stores once the group > width has been finalized. */ > val = NULL; > - align = get_object_alignment (gimple_assign_lhs (info->stmt)); > + mask = NULL; > + unsigned HOST_WIDE_INT align_bitpos = 0; > + get_object_alignment_1 (gimple_assign_lhs (info->stmt), > + &align, &align_bitpos); > + align_base = start - align_bitpos; > stores.create (1); > stores.safe_push (info); > last_stmt = info->stmt; > @@ -568,18 +588,24 @@ merged_store_group::~merged_store_group > XDELETEVEC (val); > } > > -/* Merge a store recorded by INFO into this merged store. > - The store is not overlapping with the existing recorded > - stores. */ > - > +/* Helper method for merge_into and merge_overlapping to do > + the common part. */ > void > -merged_store_group::merge_into (store_immediate_info *info) > +merged_store_group::do_merge (store_immediate_info *info) > { > - unsigned HOST_WIDE_INT wid = info->bitsize; > - /* Make sure we're inserting in the position we think we're inserting. */ > - gcc_assert (info->bitpos == start + width); > + bitregion_start = MIN (bitregion_start, info->bitregion_start); > + bitregion_end = MAX (bitregion_end, info->bitregion_end); > + > + unsigned int this_align; > + unsigned HOST_WIDE_INT align_bitpos = 0; > + get_object_alignment_1 (gimple_assign_lhs (info->stmt), > + &this_align, &align_bitpos); > + if (this_align > align) > + { > + align = this_align; > + align_base = info->bitpos - align_bitpos; > + } > > - width += wid; > gimple *stmt = info->stmt; > stores.safe_push (info); > if (info->order > last_order) > @@ -594,6 +620,22 @@ merged_store_group::merge_into (store_im > } > } > > +/* Merge a store recorded by INFO into this merged store. > + The store is not overlapping with the existing recorded > + stores. */ > + > +void > +merged_store_group::merge_into (store_immediate_info *info) > +{ > + unsigned HOST_WIDE_INT wid = info->bitsize; > + /* Make sure we're inserting in the position we think we're inserting. */ > + gcc_assert (info->bitpos >= start + width > + && info->bitregion_start <= bitregion_end); > + > + width += wid; > + do_merge (info); > +} > + > /* Merge a store described by INFO into this merged store. > INFO overlaps in some way with the current store (i.e. it's not contiguous > which is handled by merged_store_group::merge_into). */ > @@ -601,23 +643,11 @@ merged_store_group::merge_into (store_im > void > merged_store_group::merge_overlapping (store_immediate_info *info) > { > - gimple *stmt = info->stmt; > - stores.safe_push (info); > - > /* If the store extends the size of the group, extend the width. */ > - if ((info->bitpos + info->bitsize) > (start + width)) > + if (info->bitpos + info->bitsize > start + width) > width += info->bitpos + info->bitsize - (start + width); > > - if (info->order > last_order) > - { > - last_order = info->order; > - last_stmt = stmt; > - } > - else if (info->order < first_order) > - { > - first_order = info->order; > - first_stmt = stmt; > - } > + do_merge (info); > } > > /* Go through all the recorded stores in this group in program order and > @@ -627,27 +657,28 @@ merged_store_group::merge_overlapping (s > bool > merged_store_group::apply_stores () > { > - /* The total width of the stores must add up to a whole number of bytes > - and start at a byte boundary. We don't support emitting bitfield > - references for now. Also, make sure we have more than one store > - in the group, otherwise we cannot merge anything. */ > - if (width % BITS_PER_UNIT != 0 > - || start % BITS_PER_UNIT != 0 > + /* Make sure we have more than one store in the group, otherwise we cannot > + merge anything. */ > + if (bitregion_start % BITS_PER_UNIT != 0 > + || bitregion_end % BITS_PER_UNIT != 0 > || stores.length () == 1) > return false; > > stores.qsort (sort_by_order); > - struct store_immediate_info *info; > + store_immediate_info *info; > unsigned int i; > /* Create a buffer of a size that is 2 times the number of bytes we're > storing. That way native_encode_expr can write power-of-2-sized > chunks without overrunning. */ > - buf_size = 2 * (ROUND_UP (width, BITS_PER_UNIT) / BITS_PER_UNIT); > - val = XCNEWVEC (unsigned char, buf_size); > + buf_size = 2 * ((bitregion_end - bitregion_start) / BITS_PER_UNIT); > + val = XNEWVEC (unsigned char, 2 * buf_size); > + mask = val + buf_size; > + memset (val, 0, buf_size); > + memset (mask, ~0U, buf_size); > > FOR_EACH_VEC_ELT (stores, i, info) > { > - unsigned int pos_in_buffer = info->bitpos - start; > + unsigned int pos_in_buffer = info->bitpos - bitregion_start; > bool ret = encode_tree_to_bitpos (gimple_assign_rhs1 (info->stmt), > val, info->bitsize, > pos_in_buffer, buf_size); > @@ -668,6 +699,11 @@ merged_store_group::apply_stores () > } > if (!ret) > return false; > + unsigned char *m = mask + (pos_in_buffer / BITS_PER_UNIT); > + if (BYTES_BIG_ENDIAN) > + clear_bit_region_be (m, pos_in_buffer % BITS_PER_UNIT, info->bitsize); > + else > + clear_bit_region (m, pos_in_buffer % BITS_PER_UNIT, info->bitsize); > } > return true; > } > @@ -682,7 +718,7 @@ struct imm_store_chain_info > See pass_store_merging::m_stores_head for more rationale. */ > imm_store_chain_info *next, **pnxp; > tree base_addr; > - auto_vec<struct store_immediate_info *> m_store_info; > + auto_vec<store_immediate_info *> m_store_info; > auto_vec<merged_store_group *> m_merged_store_groups; > > imm_store_chain_info (imm_store_chain_info *&inspt, tree b_a) > @@ -730,11 +766,16 @@ public: > { > } > > - /* Pass not supported for PDP-endianness. */ > + /* Pass not supported for PDP-endianness, nor for insane hosts > + or target character sizes where native_{encode,interpret}_expr > + doesn't work properly. */ > virtual bool > gate (function *) > { > - return flag_store_merging && (WORDS_BIG_ENDIAN == BYTES_BIG_ENDIAN); > + return flag_store_merging > + && WORDS_BIG_ENDIAN == BYTES_BIG_ENDIAN > + && CHAR_BIT == 8 > + && BITS_PER_UNIT == 8; > } > > virtual unsigned int execute (function *); > @@ -811,7 +852,7 @@ pass_store_merging::terminate_all_aliasi > aliases with any of them. */ > else > { > - struct store_immediate_info *info; > + store_immediate_info *info; > unsigned int i; > FOR_EACH_VEC_ELT ((*chain_info)->m_store_info, i, info) > { > @@ -926,8 +967,9 @@ imm_store_chain_info::coalesce_immediate > } > > /* |---store 1---| <gap> |---store 2---|. > - Gap between stores. Start a new group. */ > - if (start != merged_store->start + merged_store->width) > + Gap between stores. Start a new group if there are any gaps > + between bitregions. */ > + if (info->bitregion_start > merged_store->bitregion_end) > { > /* Try to apply all the stores recorded for the group to determine > the bitpattern they write and discard it if that fails. > @@ -948,11 +990,11 @@ imm_store_chain_info::coalesce_immediate > merged_store->merge_into (info); > } > > - /* Record or discard the last store group. */ > - if (!merged_store->apply_stores ()) > - delete merged_store; > - else > - m_merged_store_groups.safe_push (merged_store); > + /* Record or discard the last store group. */ > + if (!merged_store->apply_stores ()) > + delete merged_store; > + else > + m_merged_store_groups.safe_push (merged_store); > > gcc_assert (m_merged_store_groups.length () <= m_store_info.length ()); > bool success > @@ -961,8 +1003,8 @@ imm_store_chain_info::coalesce_immediate > > if (success && dump_file) > fprintf (dump_file, "Coalescing successful!\n" > - "Merged into %u stores\n", > - m_merged_store_groups.length ()); > + "Merged into %u stores\n", > + m_merged_store_groups.length ()); > > return success; > } > @@ -1016,6 +1058,8 @@ struct split_store > unsigned HOST_WIDE_INT size; > unsigned HOST_WIDE_INT align; > auto_vec<gimple *> orig_stmts; > + /* True if there is a single orig stmt covering the whole split store. */ > + bool orig; > split_store (unsigned HOST_WIDE_INT, unsigned HOST_WIDE_INT, > unsigned HOST_WIDE_INT); > }; > @@ -1025,100 +1069,198 @@ struct split_store > split_store::split_store (unsigned HOST_WIDE_INT bp, > unsigned HOST_WIDE_INT sz, > unsigned HOST_WIDE_INT al) > - : bytepos (bp), size (sz), align (al) > + : bytepos (bp), size (sz), align (al), orig (false) > { > orig_stmts.create (0); > } > > /* Record all statements corresponding to stores in GROUP that write to > the region starting at BITPOS and is of size BITSIZE. Record such > - statements in STMTS. The stores in GROUP must be sorted by > - bitposition. */ > + statements in STMTS if non-NULL. The stores in GROUP must be sorted by > + bitposition. Return INFO if there is exactly one original store > + in the range. */ > > -static void > +static store_immediate_info * > find_constituent_stmts (struct merged_store_group *group, > - auto_vec<gimple *> &stmts, > - unsigned HOST_WIDE_INT bitpos, > - unsigned HOST_WIDE_INT bitsize) > + vec<gimple *> *stmts, > + unsigned int *first, > + unsigned HOST_WIDE_INT bitpos, > + unsigned HOST_WIDE_INT bitsize) > { > - struct store_immediate_info *info; > + store_immediate_info *info, *ret = NULL; > unsigned int i; > + bool second = false; > + bool update_first = true; > unsigned HOST_WIDE_INT end = bitpos + bitsize; > - FOR_EACH_VEC_ELT (group->stores, i, info) > + for (i = *first; group->stores.iterate (i, &info); ++i) > { > unsigned HOST_WIDE_INT stmt_start = info->bitpos; > unsigned HOST_WIDE_INT stmt_end = stmt_start + info->bitsize; > - if (stmt_end < bitpos) > - continue; > + if (stmt_end <= bitpos) > + { > + /* BITPOS passed to this function never decreases from within the > + same split_group call, so optimize and don't scan info records > + which are known to end before or at BITPOS next time. > + Only do it if all stores before this one also pass this. */ > + if (update_first) > + *first = i + 1; > + continue; > + } > + else > + update_first = false; > + > /* The stores in GROUP are ordered by bitposition so if we're past > - the region for this group return early. */ > - if (stmt_start > end) > - return; > - > - if (IN_RANGE (stmt_start, bitpos, bitpos + bitsize) > - || IN_RANGE (stmt_end, bitpos, end) > - /* The statement writes a region that completely encloses the region > - that this group writes. Unlikely to occur but let's > - handle it. */ > - || IN_RANGE (bitpos, stmt_start, stmt_end)) > - stmts.safe_push (info->stmt); > + the region for this group return early. */ > + if (stmt_start >= end) > + return ret; > + > + if (stmts) > + { > + stmts->safe_push (info->stmt); > + if (ret) > + { > + ret = NULL; > + second = true; > + } > + } > + else if (ret) > + return NULL; > + if (!second) > + ret = info; > } > + return ret; > } > > /* Split a merged store described by GROUP by populating the SPLIT_STORES > - vector with split_store structs describing the byte offset (from the > base), > - the bit size and alignment of each store as well as the original > statements > - involved in each such split group. > + vector (if non-NULL) with split_store structs describing the byte offset > + (from the base), the bit size and alignment of each store as well as the > + original statements involved in each such split group. > This is to separate the splitting strategy from the statement > building/emission/linking done in output_merged_store. > - At the moment just start with the widest possible size and keep emitting > - the widest we can until we have emitted all the bytes, halving the size > - when appropriate. */ > - > -static bool > -split_group (merged_store_group *group, > - auto_vec<struct split_store *> &split_stores) > + Return number of new stores. > + If SPLIT_STORES is NULL, it is just a dry run to count number of > + new stores. */ > + > +static unsigned int > +split_group (merged_store_group *group, bool allow_unaligned, > + vec<struct split_store *> *split_stores) > { > - unsigned HOST_WIDE_INT pos = group->start; > - unsigned HOST_WIDE_INT size = group->width; > + unsigned HOST_WIDE_INT pos = group->bitregion_start; > + unsigned HOST_WIDE_INT size = group->bitregion_end - pos; > unsigned HOST_WIDE_INT bytepos = pos / BITS_PER_UNIT; > - unsigned HOST_WIDE_INT align = group->align; > + unsigned HOST_WIDE_INT group_align = group->align; > + unsigned HOST_WIDE_INT align_base = group->align_base; > > - /* We don't handle partial bitfields for now. We shouldn't have > - reached this far. */ > gcc_assert ((size % BITS_PER_UNIT == 0) && (pos % BITS_PER_UNIT == 0)); > > - bool allow_unaligned > - = !STRICT_ALIGNMENT && PARAM_VALUE (PARAM_STORE_MERGING_ALLOW_UNALIGNED); > - > - unsigned int try_size = MAX_STORE_BITSIZE; > - while (try_size > size > - || (!allow_unaligned > - && try_size > align)) > - { > - try_size /= 2; > - if (try_size < BITS_PER_UNIT) > - return false; > - } > - > + unsigned int ret = 0, first = 0; > unsigned HOST_WIDE_INT try_pos = bytepos; > group->stores.qsort (sort_by_bitpos); > > while (size > 0) > { > - struct split_store *store = new split_store (try_pos, try_size, align); > + if ((allow_unaligned || group_align <= BITS_PER_UNIT) > + && group->mask[try_pos - bytepos] == (unsigned char) ~0U) > + { > + /* Skip padding bytes. */ > + ++try_pos; > + size -= BITS_PER_UNIT; > + continue; > + } > + > unsigned HOST_WIDE_INT try_bitpos = try_pos * BITS_PER_UNIT; > - find_constituent_stmts (group, store->orig_stmts, try_bitpos, > try_size); > - split_stores.safe_push (store); > + unsigned int try_size = MAX_STORE_BITSIZE, nonmasked; > + unsigned HOST_WIDE_INT align_bitpos > + = (try_bitpos - align_base) & (group_align - 1); > + unsigned HOST_WIDE_INT align = group_align; > + if (align_bitpos) > + align = least_bit_hwi (align_bitpos); > + if (!allow_unaligned) > + try_size = MIN (try_size, align); > + store_immediate_info *info > + = find_constituent_stmts (group, NULL, &first, try_bitpos, try_size); > + if (info) > + { > + /* If there is just one original statement for the range, see if > + we can just reuse the original store which could be even larger > + than try_size. */ > + unsigned HOST_WIDE_INT stmt_end > + = ROUND_UP (info->bitpos + info->bitsize, BITS_PER_UNIT); > + info = find_constituent_stmts (group, NULL, &first, try_bitpos, > + stmt_end - try_bitpos); > + if (info && info->bitpos >= try_bitpos) > + { > + try_size = stmt_end - try_bitpos; > + goto found; > + } > + } > > - try_pos += try_size / BITS_PER_UNIT; > + /* Approximate store bitsize for the case when there are no padding > + bits. */ > + while (try_size > size) > + try_size /= 2; > + /* Now look for whole padding bytes at the end of that bitsize. */ > + for (nonmasked = try_size / BITS_PER_UNIT; nonmasked > 0; --nonmasked) > + if (group->mask[try_pos - bytepos + nonmasked - 1] > + != (unsigned char) ~0U) > + break; > + if (nonmasked == 0) > + { > + /* If entire try_size range is padding, skip it. */ > + try_pos += try_size / BITS_PER_UNIT; > + size -= try_size; > + continue; > + } > + /* Otherwise try to decrease try_size if second half, last 3 quarters > + etc. are padding. */ > + nonmasked *= BITS_PER_UNIT; > + while (nonmasked <= try_size / 2) > + try_size /= 2; > + if (!allow_unaligned && group_align > BITS_PER_UNIT) > + { > + /* Now look for whole padding bytes at the start of that bitsize. */ > + unsigned int try_bytesize = try_size / BITS_PER_UNIT, masked; > + for (masked = 0; masked < try_bytesize; ++masked) > + if (group->mask[try_pos - bytepos + masked] != (unsigned char) ~0U) > + break; > + masked *= BITS_PER_UNIT; > + gcc_assert (masked < try_size); > + if (masked >= try_size / 2) > + { > + while (masked >= try_size / 2) > + { > + try_size /= 2; > + try_pos += try_size / BITS_PER_UNIT; > + size -= try_size; > + masked -= try_size; > + } > + /* Need to recompute the alignment, so just retry at the new > + position. */ > + continue; > + } > + } > > + found: > + ++ret; > + > + if (split_stores) > + { > + struct split_store *store > + = new split_store (try_pos, try_size, align); > + info = find_constituent_stmts (group, &store->orig_stmts, > + &first, try_bitpos, try_size); > + if (info > + && info->bitpos >= try_bitpos > + && info->bitpos + info->bitsize <= try_bitpos + try_size) > + store->orig = true; > + split_stores->safe_push (store); > + } > + > + try_pos += try_size / BITS_PER_UNIT; > size -= try_size; > - align = try_size; > - while (size < try_size) > - try_size /= 2; > } > - return true; > + > + return ret; > } > > /* Given a merged store group GROUP output the widened version of it. > @@ -1132,31 +1274,50 @@ split_group (merged_store_group *group, > bool > imm_store_chain_info::output_merged_store (merged_store_group *group) > { > - unsigned HOST_WIDE_INT start_byte_pos = group->start / BITS_PER_UNIT; > + unsigned HOST_WIDE_INT start_byte_pos > + = group->bitregion_start / BITS_PER_UNIT; > > unsigned int orig_num_stmts = group->stores.length (); > if (orig_num_stmts < 2) > return false; > > - auto_vec<struct split_store *> split_stores; > + auto_vec<struct split_store *, 32> split_stores; > split_stores.create (0); > - if (!split_group (group, split_stores)) > - return false; > + bool allow_unaligned > + = !STRICT_ALIGNMENT && PARAM_VALUE (PARAM_STORE_MERGING_ALLOW_UNALIGNED); > + if (allow_unaligned) > + { > + /* If unaligned stores are allowed, see how many stores we'd emit > + for unaligned and how many stores we'd emit for aligned stores. > + Only use unaligned stores if it allows fewer stores than aligned. */ > + unsigned aligned_cnt = split_group (group, false, NULL); > + unsigned unaligned_cnt = split_group (group, true, NULL); > + if (aligned_cnt <= unaligned_cnt) > + allow_unaligned = false; > + } > + split_group (group, allow_unaligned, &split_stores); > + > + if (split_stores.length () >= orig_num_stmts) > + { > + /* We didn't manage to reduce the number of statements. Bail out. */ > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "Exceeded original number of stmts (%u)." > + " Not profitable to emit new sequence.\n", > + orig_num_stmts); > + } > + return false; > + } > > gimple_stmt_iterator last_gsi = gsi_for_stmt (group->last_stmt); > gimple_seq seq = NULL; > - unsigned int num_stmts = 0; > tree last_vdef, new_vuse; > last_vdef = gimple_vdef (group->last_stmt); > new_vuse = gimple_vuse (group->last_stmt); > > gimple *stmt = NULL; > - /* The new SSA names created. Keep track of them so that we can free them > - if we decide to not use the new sequence. */ > - auto_vec<tree> new_ssa_names; > split_store *split_store; > unsigned int i; > - bool fail = false; > > tree addr = force_gimple_operand_1 (unshare_expr (base_addr), &seq, > is_gimple_mem_ref_addr, NULL_TREE); > @@ -1165,48 +1326,76 @@ imm_store_chain_info::output_merged_stor > unsigned HOST_WIDE_INT try_size = split_store->size; > unsigned HOST_WIDE_INT try_pos = split_store->bytepos; > unsigned HOST_WIDE_INT align = split_store->align; > - tree offset_type = get_alias_type_for_stmts (split_store->orig_stmts); > - location_t loc = get_location_for_stmts (split_store->orig_stmts); > - > - tree int_type = build_nonstandard_integer_type (try_size, UNSIGNED); > - int_type = build_aligned_type (int_type, align); > - tree dest = fold_build2 (MEM_REF, int_type, addr, > - build_int_cst (offset_type, try_pos)); > - > - tree src = native_interpret_expr (int_type, > - group->val + try_pos - start_byte_pos, > - group->buf_size); > + tree dest, src; > + location_t loc; > + if (split_store->orig) > + { > + /* If there is just a single constituent store which covers > + the whole area, just reuse the lhs and rhs. */ > + dest = gimple_assign_lhs (split_store->orig_stmts[0]); > + src = gimple_assign_rhs1 (split_store->orig_stmts[0]); > + loc = gimple_location (split_store->orig_stmts[0]); > + } > + else > + { > + tree offset_type > + = get_alias_type_for_stmts (split_store->orig_stmts); > + loc = get_location_for_stmts (split_store->orig_stmts); > + > + tree int_type = build_nonstandard_integer_type (try_size, UNSIGNED); > + int_type = build_aligned_type (int_type, align); > + dest = fold_build2 (MEM_REF, int_type, addr, > + build_int_cst (offset_type, try_pos)); > + src = native_interpret_expr (int_type, > + group->val + try_pos - start_byte_pos, > + group->buf_size); > + tree mask > + = native_interpret_expr (int_type, > + group->mask + try_pos - start_byte_pos, > + group->buf_size); > + if (!integer_zerop (mask)) > + { > + tree tem = make_ssa_name (int_type); > + tree load_src = unshare_expr (dest); > + /* The load might load some or all bits uninitialized, > + avoid -W*uninitialized warnings in that case. > + As optimization, it would be nice if all the bits are > + provably uninitialized (no stores at all yet or previous > + store a CLOBBER) we'd optimize away the load and replace > + it e.g. with 0. */ > + TREE_NO_WARNING (load_src) = 1; > + stmt = gimple_build_assign (tem, load_src); > + gimple_set_location (stmt, loc); > + gimple_set_vuse (stmt, new_vuse); > + gimple_seq_add_stmt_without_update (&seq, stmt); > + > + /* FIXME: If there is a single chunk of zero bits in mask, > + perhaps use BIT_INSERT_EXPR instead? */ > + stmt = gimple_build_assign (make_ssa_name (int_type), > + BIT_AND_EXPR, tem, mask); > + gimple_set_location (stmt, loc); > + gimple_seq_add_stmt_without_update (&seq, stmt); > + tem = gimple_assign_lhs (stmt); > + > + src = wide_int_to_tree (int_type, > + wi::bit_and_not (wi::to_wide (src), > + wi::to_wide (mask))); > + stmt = gimple_build_assign (make_ssa_name (int_type), > + BIT_IOR_EXPR, tem, src); > + gimple_set_location (stmt, loc); > + gimple_seq_add_stmt_without_update (&seq, stmt); > + src = gimple_assign_lhs (stmt); > + } > + } > > stmt = gimple_build_assign (dest, src); > gimple_set_location (stmt, loc); > gimple_set_vuse (stmt, new_vuse); > gimple_seq_add_stmt_without_update (&seq, stmt); > > - /* We didn't manage to reduce the number of statements. Bail out. */ > - if (++num_stmts == orig_num_stmts) > - { > - if (dump_file && (dump_flags & TDF_DETAILS)) > - { > - fprintf (dump_file, "Exceeded original number of stmts (%u)." > - " Not profitable to emit new sequence.\n", > - orig_num_stmts); > - } > - unsigned int ssa_count; > - tree ssa_name; > - /* Don't forget to cleanup the temporary SSA names. */ > - FOR_EACH_VEC_ELT (new_ssa_names, ssa_count, ssa_name) > - release_ssa_name (ssa_name); > - > - fail = true; > - break; > - } > - > tree new_vdef; > if (i < split_stores.length () - 1) > - { > - new_vdef = make_ssa_name (gimple_vop (cfun), stmt); > - new_ssa_names.safe_push (new_vdef); > - } > + new_vdef = make_ssa_name (gimple_vop (cfun), stmt); > else > new_vdef = last_vdef; > > @@ -1218,15 +1407,12 @@ imm_store_chain_info::output_merged_stor > FOR_EACH_VEC_ELT (split_stores, i, split_store) > delete split_store; > > - if (fail) > - return false; > - > gcc_assert (seq); > if (dump_file) > { > fprintf (dump_file, > "New sequence of %u stmts to replace old one of %u stmts\n", > - num_stmts, orig_num_stmts); > + split_stores.length (), orig_num_stmts); > if (dump_flags & TDF_DETAILS) > print_gimple_seq (dump_file, seq, 0, TDF_VOPS | TDF_MEMSYMS); > } > @@ -1387,12 +1573,25 @@ pass_store_merging::execute (function *f > tree rhs = gimple_assign_rhs1 (stmt); > > HOST_WIDE_INT bitsize, bitpos; > + unsigned HOST_WIDE_INT bitregion_start = 0; > + unsigned HOST_WIDE_INT bitregion_end = 0; > machine_mode mode; > int unsignedp = 0, reversep = 0, volatilep = 0; > tree offset, base_addr; > base_addr > = get_inner_reference (lhs, &bitsize, &bitpos, &offset, &mode, > &unsignedp, &reversep, &volatilep); > + if (TREE_CODE (lhs) == COMPONENT_REF > + && DECL_BIT_FIELD_TYPE (TREE_OPERAND (lhs, 1))) > + { > + get_bit_range (&bitregion_start, &bitregion_end, lhs, > + &bitpos, &offset); > + if (bitregion_end) > + ++bitregion_end; > + } > + if (bitsize == 0) > + continue; > + > /* As a future enhancement we could handle stores with the same > base and offset. */ > bool invalid = reversep > @@ -1414,7 +1613,26 @@ pass_store_merging::execute (function *f > bit_off = byte_off << LOG2_BITS_PER_UNIT; > bit_off += bitpos; > if (!wi::neg_p (bit_off) && wi::fits_shwi_p (bit_off)) > - bitpos = bit_off.to_shwi (); > + { > + bitpos = bit_off.to_shwi (); > + if (bitregion_end) > + { > + bit_off = byte_off << LOG2_BITS_PER_UNIT; > + bit_off += bitregion_start; > + if (wi::fits_uhwi_p (bit_off)) > + { > + bitregion_start = bit_off.to_uhwi (); > + bit_off = byte_off << LOG2_BITS_PER_UNIT; > + bit_off += bitregion_end; > + if (wi::fits_uhwi_p (bit_off)) > + bitregion_end = bit_off.to_uhwi (); > + else > + bitregion_end = 0; > + } > + else > + bitregion_end = 0; > + } > + } > else > invalid = true; > base_addr = TREE_OPERAND (base_addr, 0); > @@ -1428,6 +1646,12 @@ pass_store_merging::execute (function *f > base_addr = build_fold_addr_expr (base_addr); > } > > + if (!bitregion_end) > + { > + bitregion_start = ROUND_DOWN (bitpos, BITS_PER_UNIT); > + bitregion_end = ROUND_UP (bitpos + bitsize, BITS_PER_UNIT); > + } > + > if (! invalid > && offset != NULL_TREE) > { > @@ -1457,9 +1681,11 @@ pass_store_merging::execute (function *f > store_immediate_info *info; > if (chain_info) > { > - info = new store_immediate_info ( > - bitsize, bitpos, stmt, > - (*chain_info)->m_store_info.length ()); > + unsigned int ord = (*chain_info)->m_store_info.length (); > + info = new store_immediate_info (bitsize, bitpos, > + bitregion_start, > + bitregion_end, > + stmt, ord); > if (dump_file && (dump_flags & TDF_DETAILS)) > { > fprintf (dump_file, > @@ -1488,6 +1714,8 @@ pass_store_merging::execute (function *f > struct imm_store_chain_info *new_chain > = new imm_store_chain_info (m_stores_head, base_addr); > info = new store_immediate_info (bitsize, bitpos, > + bitregion_start, > + bitregion_end, > stmt, 0); > new_chain->m_store_info.safe_push (info); > m_stores.put (base_addr, new_chain); > --- gcc/testsuite/gcc.dg/store_merging_10.c.jj 2017-10-27 > 14:52:29.724755656 +0200 > +++ gcc/testsuite/gcc.dg/store_merging_10.c 2017-10-27 14:52:29.724755656 > +0200 > @@ -0,0 +1,56 @@ > +/* { dg-do compile } */ > +/* { dg-require-effective-target store_merge } */ > +/* { dg-options "-O2 -fdump-tree-store-merging" } */ > + > +struct S { > + unsigned int b1:1; > + unsigned int b2:1; > + unsigned int b3:1; > + unsigned int b4:1; > + unsigned int b5:1; > + unsigned int b6:27; > +}; > + > +struct T { > + unsigned int b1:1; > + unsigned int b2:16; > + unsigned int b3:14; > + unsigned int b4:1; > +}; > + > +__attribute__((noipa)) void > +foo (struct S *x) > +{ > + x->b1 = 1; > + x->b2 = 0; > + x->b3 = 1; > + x->b4 = 1; > + x->b5 = 0; > +} > + > +__attribute__((noipa)) void > +bar (struct T *x) > +{ > + x->b1 = 1; > + x->b2 = 0; > + x->b4 = 0; > +} > + > +struct S s = { 0, 1, 0, 0, 1, 0x3a5f05a }; > +struct T t = { 0, 0xf5af, 0x3a5a, 1 }; > + > +int > +main () > +{ > + asm volatile ("" : : : "memory"); > + foo (&s); > + bar (&t); > + asm volatile ("" : : : "memory"); > + if (s.b1 != 1 || s.b2 != 0 || s.b3 != 1 || s.b4 != 1 || s.b5 != 0 || s.b6 > != 0x3a5f05a) > + __builtin_abort (); > + if (t.b1 != 1 || t.b2 != 0 || t.b3 != 0x3a5a || t.b4 != 0) > + __builtin_abort (); > + return 0; > +} > + > +/* { dg-final { scan-tree-dump-times "Merging successful" 2 "store-merging" > } } */ > --- gcc/testsuite/gcc.dg/store_merging_11.c.jj 2017-10-27 > 14:52:29.725755644 +0200 > +++ gcc/testsuite/gcc.dg/store_merging_11.c 2017-10-27 14:52:29.725755644 > +0200 > @@ -0,0 +1,47 @@ > +/* { dg-do compile } */ > +/* { dg-require-effective-target store_merge } */ > +/* { dg-options "-O2 -fdump-tree-store-merging" } */ > + > +struct S { unsigned char b[2]; unsigned short c; unsigned char d[4]; > unsigned long e; }; > + > +__attribute__((noipa)) void > +foo (struct S *p) > +{ > + p->b[1] = 1; > + p->c = 23; > + p->d[0] = 4; > + p->d[1] = 5; > + p->d[2] = 6; > + p->d[3] = 7; > + p->e = 8; > +} > + > +__attribute__((noipa)) void > +bar (struct S *p) > +{ > + p->b[1] = 9; > + p->c = 112; > + p->d[0] = 10; > + p->d[1] = 11; > +} > + > +struct S s = { { 30, 31 }, 32, { 33, 34, 35, 36 }, 37 }; > + > +int > +main () > +{ > + asm volatile ("" : : : "memory"); > + foo (&s); > + asm volatile ("" : : : "memory"); > + if (s.b[0] != 30 || s.b[1] != 1 || s.c != 23 || s.d[0] != 4 || s.d[1] != 5 > + || s.d[2] != 6 || s.d[3] != 7 || s.e != 8) > + __builtin_abort (); > + bar (&s); > + asm volatile ("" : : : "memory"); > + if (s.b[0] != 30 || s.b[1] != 9 || s.c != 112 || s.d[0] != 10 || s.d[1] != > 11 > + || s.d[2] != 6 || s.d[3] != 7 || s.e != 8) > + __builtin_abort (); > + return 0; > +} > + > +/* { dg-final { scan-tree-dump-times "Merging successful" 2 "store-merging" > } } */ > --- gcc/testsuite/gcc.dg/store_merging_12.c.jj 2017-10-27 > 15:00:20.046976487 +0200 > +++ gcc/testsuite/gcc.dg/store_merging_12.c 2017-10-27 14:59:56.000000000 > +0200 > @@ -0,0 +1,11 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -Wall" } */ > + > +struct S { unsigned int b1:1, b2:1, b3:1, b4:1, b5:1, b6:27; }; > +void bar (struct S *); > +void foo (int x) > +{ > + struct S s; > + s.b2 = 1; s.b3 = 0; s.b4 = 1; s.b5 = 0; s.b1 = x; s.b6 = x; /* { > dg-bogus "is used uninitialized in this function" } */ > + bar (&s); > +} > --- gcc/testsuite/g++.dg/pr71694.C.jj 2016-12-16 11:24:32.000000000 +0100 > +++ gcc/testsuite/g++.dg/pr71694.C 2017-10-27 16:53:09.278596219 +0200 > @@ -1,5 +1,5 @@ > /* { dg-do compile } */ > -/* { dg-options "-O2" } */ > +/* { dg-options "-O2 -fno-store-merging" } */ > > struct B { > B() {} > > > Jakub > > -- Richard Biener <rguent...@suse.de> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)