On 30/04/14 03:52, bin.cheng wrote: > Hi, > This patch expands small memset calls into direct memory set instructions by > introducing "setmemsi" pattern. For processors without NEON support, it > expands memset using general store instruction. For example, strd for > 4-bytes aligned addresses. For processors with NEON support, it expands > memset using neon instructions like vstr and miscellaneous vst1.* > instructions for both aligned and unaligned cases. > > This patch depends on > http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html otherwise vst1.64 > will be generated for 32-bit aligned memory unit. > > There is also one leftover work of this patch: Since vst1.* instructions > only support post-increment addressing mode, the inlined memset for > unaligned neon cases should be like: > vmov.i32 q8, #... > vst1.8 {q8}, [r3]! > vst1.8 {q8}, [r3]! > vst1.8 {q8}, [r3]! > vst1.8 {q8}, [r3]
Other than for zero, I'd expect the vmov to be vmov.i8 to move an arbitrary byte value into all lanes in a vector. After that, if the alignment is known to be more than 8-bit, I'd expect the vst1 instructions (with the exception of the last store if the length is not a multiple of the alignment) to use vst1.<align> {reg}, [addr-reg :<align>]! Hence, for 16-bit aligned data, we want vst1.16 {q8}, [r3:16]! > But for now, gcc can't do this and below code is generated: > vmov.i32 q8, #... > vst1.8 {q8}, [r3] > add r2, r3, #16 > add r3, r2, #16 > vst1.8 {q8}, [r2] > vst1.8 {q8}, [r3] > add r2, r3, #16 > vst1.8 {q8}, [r2] > > I investigated this issue. The root cause lies in rtx cost returned by ARM > backend. Anyway, I think this is another issue and should be fixed in > separated patch. > > Bootstrap and reg-test on cortex-a15, with or without neon support. Is it > OK? > Some more comments inline. > Thanks, > bin > > > 2014-04-29 Bin Cheng <bin.ch...@arm.com> > > PR target/55701 > * config/arm/arm.md (setmem): New pattern. > * config/arm/arm-protos.h (struct tune_params): New field. > (arm_gen_setmem): New prototype. > * config/arm/arm.c (arm_slowmul_tune): Initialize new field. > (arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto. > (arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto. > (arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto. > (arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto. > (arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto. > (arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto. > (arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto. > (arm_const_inline_cost): New function. > (arm_block_set_max_insns): New function. > (arm_block_set_straight_profit_p): New function. > (arm_block_set_vect_profit_p): New function. > (arm_block_set_unaligned_vect): New function. > (arm_block_set_aligned_vect): New function. > (arm_block_set_unaligned_straight): New function. > (arm_block_set_aligned_straight): New function. > (arm_block_set_vect, arm_gen_setmem): New functions. > > gcc/testsuite/ChangeLog > 2014-04-29 Bin Cheng <bin.ch...@arm.com> > > PR target/55701 > * gcc.target/arm/memset-inline-1.c: New test. > * gcc.target/arm/memset-inline-2.c: New test. > * gcc.target/arm/memset-inline-3.c: New test. > * gcc.target/arm/memset-inline-4.c: New test. > * gcc.target/arm/memset-inline-5.c: New test. > * gcc.target/arm/memset-inline-6.c: New test. > * gcc.target/arm/memset-inline-7.c: New test. > * gcc.target/arm/memset-inline-8.c: New test. > * gcc.target/arm/memset-inline-9.c: New test. > > > j1328-20140429.txt > > > Index: gcc/config/arm/arm.c > =================================================================== > --- gcc/config/arm/arm.c (revision 209852) > +++ gcc/config/arm/arm.c (working copy) > @@ -1585,10 +1585,11 @@ const struct tune_params arm_slowmul_tune = > true, /* Prefer constant > pool. */ > arm_default_branch_cost, > false, /* Prefer LDRD/STRD. */ > - {true, true}, /* Prefer non short > circuit. */ > - &arm_default_vec_cost, /* Vectorizer costs. */ > - false, /* Prefer Neon for 64-bits > bitops. */ > - false, false /* Prefer 32-bit encodings. > */ > + {true, true}, /* Prefer non short circuit. */ > + &arm_default_vec_cost, /* Vectorizer costs. */ > + false, /* Prefer Neon for 64-bits bitops. > */ > + false, false, /* Prefer 32-bit encodings. */ > + false /* Prefer Neon for stringops. */ > }; > Please make sure that all the white space before the comments is using TAB, not spaces. Similarly for the other tables. > @@ -16788,6 +16806,14 @@ arm_const_double_inline_cost (rtx val) > NULL_RTX, NULL_RTX, 0, 0)); > } > > +/* Cost of loading a SImode constant. */ > +static inline int > +arm_const_inline_cost (rtx val) > +{ > + return arm_gen_constant (SET, SImode, NULL_RTX, INTVAL (val), > + NULL_RTX, NULL_RTX, 0, 0); > +} > + This could be used more widely if you passed the SET in as a parameter (there are cases in arm_new_rtx_cost that could use it, for example). Also, you want to enable sub-targets (only once you can't create new pseudos is that not safe), so the penultimate argument in the call to arm_gen_constant should be 1. > /* Return true if it is worthwhile to split a 64-bit constant into two > 32-bit operations. This is the case if optimizing for size, or > if we have load delay slots, or if one 32-bit part can be done with > @@ -31350,6 +31383,504 @@ arm_validize_comparison (rtx *comparison, rtx * op > > } > > +/* Maximum number of instructions to set block of memory. */ > +static int > +arm_block_set_max_insns (void) > +{ > + return (optimize_function_for_size_p (cfun) ? 4 : 8); > +} I think the non-size_p alternative should really be a parameter in the per-cpu costs table. > + > +/* Return TRUE if it's profitable to set block of memory for straight > + case. */ "Straight" is confusing here. Do you mean non-vectorized? If so, then non_vect might be clearer. The arguments should really be documented (see comment below about align, for example). > +static bool > +arm_block_set_straight_profit_p (rtx val, > + unsigned HOST_WIDE_INT length, > + unsigned HOST_WIDE_INT align, > + bool unaligned_p, bool use_strd_p) > +{ > + int num = 0; > + /* For leftovers in bytes of 0-7, we can set the memory block using > + strb/strh/str with minimum instruction number. */ > + int leftover[8] = {0, 1, 1, 2, 1, 2, 2, 3}; This should be marked const. > + > + if (unaligned_p) > + { > + num = arm_const_inline_cost (val); > + num += length / align + length % align; Isn't align in bits here, when you really want it in bytes? What if align > 4 bytes? > + } > + else if (use_strd_p) > + { > + num = arm_const_double_inline_cost (val); > + num += (length >> 3) + leftover[length & 7]; > + } > + else > + { > + num = arm_const_inline_cost (val); > + num += (length >> 2) + leftover[length & 3]; > + } > + > + /* We may be able to combine last pair STRH/STRB into a single STR > + by shifting one byte back. */ > + if (unaligned_access && length > 3 && (length & 3) == 3) > + num--; > + > + return (num <= arm_block_set_max_insns ()); > +} > + > +/* Return TRUE if it's profitable to set block of memory for vector case. */ > +static bool > +arm_block_set_vect_profit_p (unsigned HOST_WIDE_INT length, > + unsigned HOST_WIDE_INT align ATTRIBUTE_UNUSED, > + bool unaligned_p, enum machine_mode mode) I'm not sure what you mean by unaligned here. Again, documenting the arguments might help. > +{ > + int num; > + unsigned int nelt = GET_MODE_NUNITS (mode); > + > + /* Num of instruction loading constant value. */ Use either "Number" or, in this case, simply drop that bit and write: /* Instruction loading constant value. */ > + num = 1; > + /* Num of store instructions. */ Likewise. > + num += (length + nelt - 1) / nelt; > + /* Num of address adjusting instructions. */ Can't we work on the premise that the address adjusting instructions will be merged into the stores? I know you said that they currently do not, but that's not a problem that this bit of code should have to worry about. > + if (unaligned_p) > + /* For unaligned case, it's one less than the store instructions. */ > + num += (length + nelt - 1) / nelt - 1; > + else if ((length & 3) != 0) > + /* For aligned case, it's one if bytes leftover can only be stored > + by mis-aligned store instruction. */ > + num++; > + > + /* Store the first 16 bytes using vst1:v16qi for the aligned case. */ > + if (!unaligned_p && mode == V16QImode) > + num--; > + > + return (num <= arm_block_set_max_insns ()); > +} > + > +/* Set a block of memory using vectorization instructions for the > + unaligned case. We fill the first LENGTH bytes of the memory > + area starting from DSTBASE with byte constant VALUE. ALIGN is > + the alignment requirement of memory. */ What's the return value mean? > +static bool > +arm_block_set_unaligned_vect (rtx dstbase, > + unsigned HOST_WIDE_INT length, > + unsigned HOST_WIDE_INT value, > + unsigned HOST_WIDE_INT align) > +{ > + unsigned int i = 0, j = 0, nelt_v16, nelt_v8, nelt_mode; Don't mix initialized declarations with unitialized ones on the same line. You don't appear to use either I or J until their first use in the loop control below, so why initialize them here? > + rtx dst, mem; > + rtx val_elt, val_vec, reg; > + rtx rval[MAX_VECT_LEN]; > + rtx (*gen_func) (rtx, rtx); > + enum machine_mode mode; > + unsigned HOST_WIDE_INT v = value; > + > + gcc_assert ((align & 0x3) != 0); > + nelt_v8 = GET_MODE_NUNITS (V8QImode); > + nelt_v16 = GET_MODE_NUNITS (V16QImode); > + if (length >= nelt_v16) > + { > + mode = V16QImode; > + gen_func = gen_movmisalignv16qi; > + } > + else > + { > + mode = V8QImode; > + gen_func = gen_movmisalignv8qi; > + } > + nelt_mode = GET_MODE_NUNITS (mode); > + gcc_assert (length >= nelt_mode); > + /* Skip if it isn't profitable. */ > + if (!arm_block_set_vect_profit_p (length, align, true, mode)) > + return false; > + > + dst = copy_addr_to_reg (XEXP (dstbase, 0)); > + mem = adjust_automodify_address (dstbase, mode, dst, 0); > + > + v = sext_hwi (v, BITS_PER_WORD); > + val_elt = GEN_INT (v); > + for (; j < nelt_mode; j++) > + rval[j] = val_elt; Is this the first use of J? If so, initialize it here. > + > + reg = gen_reg_rtx (mode); > + val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval)); > + /* Emit instruction loading the constant value. */ > + emit_move_insn (reg, val_vec); > + > + /* Handle nelt_mode bytes in a vector. */ > + for (; (i + nelt_mode <= length); i += nelt_mode) Similarly for I. > + { > + emit_insn ((*gen_func) (mem, reg)); > + if (i + 2 * nelt_mode <= length) > + emit_insn (gen_add2_insn (dst, GEN_INT (nelt_mode))); > + } > + > + if (i + nelt_v8 <= length) > + gcc_assert (mode == V16QImode); Why not drop the if and write: gcc_assert ((i + nelt_v8) > length || mode == V16QImode); > + > + /* Handle (8, 16) bytes leftover. */ > + if (i + nelt_v8 < length) Your assertion above checked <=, but here you use <. Is that correct? > + { > + emit_insn (gen_add2_insn (dst, GEN_INT (length - i))); > + /* We are shifting bytes back, set the alignment accordingly. */ > + if ((length & 1) != 0 && align >= 2) > + set_mem_align (mem, BITS_PER_UNIT); > + > + emit_insn (gen_movmisalignv16qi (mem, reg)); > + } > + /* Handle (0, 8] bytes leftover. */ > + else if (i < length && i + nelt_v8 >= length) > + { > + if (mode == V16QImode) > + { > + reg = gen_lowpart (V8QImode, reg); > + mem = adjust_automodify_address (dstbase, V8QImode, dst, 0); > + } > + emit_insn (gen_add2_insn (dst, GEN_INT ((length - i) > + + (nelt_mode - nelt_v8)))); > + /* We are shifting bytes back, set the alignment accordingly. */ > + if ((length & 1) != 0 && align >= 2) > + set_mem_align (mem, BITS_PER_UNIT); > + > + emit_insn (gen_movmisalignv8qi (mem, reg)); > + } > + > + return true; > +} > + > +/* Set a block of memory using vectorization instructions for the > + aligned case. We fill the first LENGTH bytes of the memory area > + starting from DSTBASE with byte constant VALUE. ALIGN is the > + alignment requirement of memory. */ See all the comments above for the unaligend case. > +static bool > +arm_block_set_aligned_vect (rtx dstbase, > + unsigned HOST_WIDE_INT length, > + unsigned HOST_WIDE_INT value, > + unsigned HOST_WIDE_INT align) > +{ > + unsigned int i = 0, j = 0, nelt_v8, nelt_v16, nelt_mode; > + rtx dst, addr, mem; > + rtx val_elt, val_vec, reg; > + rtx rval[MAX_VECT_LEN]; > + enum machine_mode mode; > + unsigned HOST_WIDE_INT v = value; > + > + gcc_assert ((align & 0x3) == 0); > + nelt_v8 = GET_MODE_NUNITS (V8QImode); > + nelt_v16 = GET_MODE_NUNITS (V16QImode); > + if (length >= nelt_v16 && unaligned_access && !BYTES_BIG_ENDIAN) > + mode = V16QImode; > + else > + mode = V8QImode; > + > + nelt_mode = GET_MODE_NUNITS (mode); > + gcc_assert (length >= nelt_mode); > + /* Skip if it isn't profitable. */ > + if (!arm_block_set_vect_profit_p (length, align, false, mode)) > + return false; > + > + dst = copy_addr_to_reg (XEXP (dstbase, 0)); > + > + v = sext_hwi (v, BITS_PER_WORD); > + val_elt = GEN_INT (v); > + for (; j < nelt_mode; j++) > + rval[j] = val_elt; > + > + reg = gen_reg_rtx (mode); > + val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval)); > + /* Emit instruction loading the constant value. */ > + emit_move_insn (reg, val_vec); > + > + /* Handle first 16 bytes specially using vst1:v16qi instruction. */ > + if (mode == V16QImode) > + { > + mem = adjust_automodify_address (dstbase, mode, dst, 0); > + emit_insn (gen_movmisalignv16qi (mem, reg)); > + i += nelt_mode; > + /* Handle (8, 16) bytes leftover using vst1:v16qi again. */ > + if (i + nelt_v8 < length && i + nelt_v16 > length) > + { > + emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode))); > + mem = adjust_automodify_address (dstbase, mode, dst, 0); > + /* We are shifting bytes back, set the alignment accordingly. */ > + if ((length & 0x3) == 0) > + set_mem_align (mem, BITS_PER_UNIT * 4); > + else if ((length & 0x1) == 0) > + set_mem_align (mem, BITS_PER_UNIT * 2); > + else > + set_mem_align (mem, BITS_PER_UNIT); > + > + emit_insn (gen_movmisalignv16qi (mem, reg)); > + return true; > + } > + /* Fall through for bytes leftover. */ > + mode = V8QImode; > + nelt_mode = GET_MODE_NUNITS (mode); > + reg = gen_lowpart (V8QImode, reg); > + } > + > + /* Handle 8 bytes in a vector. */ > + for (; (i + nelt_mode <= length); i += nelt_mode) > + { > + addr = plus_constant (Pmode, dst, i); > + mem = adjust_automodify_address (dstbase, mode, addr, i); > + emit_move_insn (mem, reg); > + } > + > + /* Handle single word leftover by shifting 4 bytes back. We can > + use aligned access for this case. */ > + if (i + UNITS_PER_WORD == length) > + { > + addr = plus_constant (Pmode, dst, i - UNITS_PER_WORD); > + mem = adjust_automodify_address (dstbase, mode, > + addr, i - UNITS_PER_WORD); > + /* We are shifting 4 bytes back, set the alignment accordingly. */ > + if (align > UNITS_PER_WORD) > + set_mem_align (mem, BITS_PER_UNIT * UNITS_PER_WORD); > + > + emit_move_insn (mem, reg); > + } > + /* Handle (0, 4), (4, 8) bytes leftover by shifting bytes back. > + We have to use unaligned access for this case. */ > + else if (i < length) > + { > + emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode))); > + mem = adjust_automodify_address (dstbase, mode, dst, 0); > + /* We are shifting bytes back, set the alignment accordingly. */ > + if ((length & 1) == 0) > + set_mem_align (mem, BITS_PER_UNIT * 2); > + else > + set_mem_align (mem, BITS_PER_UNIT); > + > + emit_insn (gen_movmisalignv8qi (mem, reg)); > + } > + > + return true; > +} > + > +/* Set a block of memory using plain strh/strb instructions, only > + using instructions allowed by ALIGN on processor. We fill the > + first LENGTH bytes of the memory area starting from DSTBASE > + with byte constant VALUE. ALIGN is the alignment requirement > + of memory. */ > +static bool > +arm_block_set_unaligned_straight (rtx dstbase, > + unsigned HOST_WIDE_INT length, > + unsigned HOST_WIDE_INT value, > + unsigned HOST_WIDE_INT align) > +{ > + unsigned int i; > + rtx dst, addr, mem; > + rtx val_exp, val_reg, reg; > + enum machine_mode mode; > + HOST_WIDE_INT v = value; > + > + gcc_assert (align == 1 || align == 2); > + > + if (align == 2) > + v |= (value << BITS_PER_UNIT); > + > + v = sext_hwi (v, BITS_PER_WORD); > + val_exp = GEN_INT (v); > + /* Skip if it isn't profitable. */ > + if (!arm_block_set_straight_profit_p (val_exp, length, > + align, true, false)) > + return false; > + > + dst = copy_addr_to_reg (XEXP (dstbase, 0)); > + mode = (align == 2 ? HImode : QImode); > + val_reg = force_reg (SImode, val_exp); > + reg = gen_lowpart (mode, val_reg); > + > + for (i = 0; (i + GET_MODE_SIZE (mode) <= length); i += GET_MODE_SIZE > (mode)) > + { > + addr = plus_constant (Pmode, dst, i); > + mem = adjust_automodify_address (dstbase, mode, addr, i); > + emit_move_insn (mem, reg); > + } > + > + /* Handle single byte leftover. */ > + if (i + 1 == length) > + { > + reg = gen_lowpart (QImode, val_reg); > + addr = plus_constant (Pmode, dst, i); > + mem = adjust_automodify_address (dstbase, QImode, addr, i); > + emit_move_insn (mem, reg); > + i++; > + } > + > + gcc_assert (i == length); > + return true; > +} > + > +/* Set a block of memory using plain strd/str/strh/strb instructions, > + to permit unaligned copies on processors which support unaligned > + semantics for those instructions. We fill the first LENGTH bytes > + of the memory area starting from DSTBASE with byte constant VALUE. > + ALIGN is the alignment requirement of memory. */ > +static bool > +arm_block_set_aligned_straight (rtx dstbase, > + unsigned HOST_WIDE_INT length, > + unsigned HOST_WIDE_INT value, > + unsigned HOST_WIDE_INT align) > +{ > + unsigned int i = 0; > + rtx dst, addr, mem; > + rtx val_exp, val_reg, reg; > + unsigned HOST_WIDE_INT v; > + bool use_strd_p; > + > + use_strd_p = (length >= 2 * UNITS_PER_WORD && (align & 3) == 0 > + && TARGET_LDRD && current_tune->prefer_ldrd_strd); > + > + v = (value | (value << 8) | (value << 16) | (value << 24)); > + if (length < UNITS_PER_WORD) > + v &= (0xFFFFFFFF >> (UNITS_PER_WORD - length) * BITS_PER_UNIT); > + > + if (use_strd_p) > + v |= (v << BITS_PER_WORD); > + else > + v = sext_hwi (v, BITS_PER_WORD); > + > + val_exp = GEN_INT (v); > + /* Skip if it isn't profitable. */ > + if (!arm_block_set_straight_profit_p (val_exp, length, > + align, false, use_strd_p)) > + { > + /* Try without strd. */ > + v = (v >> BITS_PER_WORD); > + v = sext_hwi (v, BITS_PER_WORD); > + val_exp = GEN_INT (v); > + use_strd_p = false; > + if (!arm_block_set_straight_profit_p (val_exp, length, > + align, false, use_strd_p)) > + return false; > + } > + > + dst = copy_addr_to_reg (XEXP (dstbase, 0)); > + /* Handle double words using strd if possible. */ > + if (use_strd_p) > + { > + val_reg = force_reg (DImode, val_exp); > + reg = val_reg; > + for (; (i + 8 <= length); i += 8) > + { > + addr = plus_constant (Pmode, dst, i); > + mem = adjust_automodify_address (dstbase, DImode, addr, i); > + emit_move_insn (mem, reg); > + } > + } > + else > + val_reg = force_reg (SImode, val_exp); > + > + /* Handle words. */ > + reg = (use_strd_p ? gen_lowpart (SImode, val_reg) : val_reg); > + for (; (i + 4 <= length); i += 4) > + { > + addr = plus_constant (Pmode, dst, i); > + mem = adjust_automodify_address (dstbase, SImode, addr, i); > + if ((align & 3) == 0) > + emit_move_insn (mem, reg); > + else > + emit_insn (gen_unaligned_storesi (mem, reg)); > + } > + > + /* Merge last pair of STRH and STRB into a STR if possible. */ > + if (unaligned_access && i > 0 && (i + 3) == length) > + { > + addr = plus_constant (Pmode, dst, i - 1); > + mem = adjust_automodify_address (dstbase, SImode, addr, i - 1); > + /* We are shifting one byte back, set the alignment accordingly. */ > + if ((align & 1) == 0) > + set_mem_align (mem, BITS_PER_UNIT); > + > + /* Most likely this is an unaligned access, and we can't tell at > + compilation time. */ > + emit_insn (gen_unaligned_storesi (mem, reg)); > + return true; > + } > + > + /* Handle half word leftover. */ > + if (i + 2 <= length) > + { > + reg = gen_lowpart (HImode, val_reg); > + addr = plus_constant (Pmode, dst, i); > + mem = adjust_automodify_address (dstbase, HImode, addr, i); > + if ((align & 1) == 0) > + emit_move_insn (mem, reg); > + else > + emit_insn (gen_unaligned_storehi (mem, reg)); > + > + i += 2; > + } > + > + /* Handle single byte leftover. */ > + if (i + 1 == length) > + { > + reg = gen_lowpart (QImode, val_reg); > + addr = plus_constant (Pmode, dst, i); > + mem = adjust_automodify_address (dstbase, QImode, addr, i); > + emit_move_insn (mem, reg); > + } > + > + return true; > +} > + > +/* Set a block of memory using vectorization instructions for both > + aligned and unaligned cases. We fill the first LENGTH bytes of > + the memory area starting from DSTBASE with byte constant VALUE. > + ALIGN is the alignment requirement of memory. */ > +static bool > +arm_block_set_vect (rtx dstbase, > + unsigned HOST_WIDE_INT length, > + unsigned HOST_WIDE_INT value, > + unsigned HOST_WIDE_INT align) > +{ > + /* Check whether we need to use unaligned store instruction. */ > + if (((align & 3) != 0 || (length & 3) != 0) > + /* Check whether unaligned store instruction is available. */ > + && (!unaligned_access || BYTES_BIG_ENDIAN)) > + return false; Huh! vst1.8 can work for unaligned accesses even when hw alignment checking is strict. > + > + if ((align & 3) == 0) > + return arm_block_set_aligned_vect (dstbase, length, value, align); > + else > + return arm_block_set_unaligned_vect (dstbase, length, value, align); > +} > + > +/* Expand string store operation. Firstly we try to do that by using > + vectorization instructions, then try with ARM unaligned access and > + double-word store if profitable. OPERANDS[0] is the destination, > + OPERANDS[1] is the number of bytes, operands[2] is the value to > + initialize the memory, OPERANDS[3] is the known alignment of the > + destination. */ > +bool > +arm_gen_setmem (rtx *operands) > +{ > + rtx dstbase = operands[0]; > + unsigned HOST_WIDE_INT length; > + unsigned HOST_WIDE_INT value; > + unsigned HOST_WIDE_INT align; > + > + if (!CONST_INT_P (operands[2]) || !CONST_INT_P (operands[1])) > + return false; > + > + length = UINTVAL (operands[1]); > + if (length > 64) > + return false; > + > + value = (UINTVAL (operands[2]) & 0xFF); > + align = UINTVAL (operands[3]); > + if (TARGET_NEON && length >= 8 > + && current_tune->string_ops_prefer_neon > + && arm_block_set_vect (dstbase, length, value, align)) > + return true; > + > + if (!unaligned_access && (align & 3) != 0) > + return arm_block_set_unaligned_straight (dstbase, length, value, align); > + > + return arm_block_set_aligned_straight (dstbase, length, value, align); > +} > + > /* Implement the TARGET_ASAN_SHADOW_OFFSET hook. */ > > static unsigned HOST_WIDE_INT > Index: gcc/config/arm/arm-protos.h > =================================================================== > --- gcc/config/arm/arm-protos.h (revision 209852) > +++ gcc/config/arm/arm-protos.h (working copy) > @@ -277,6 +277,8 @@ struct tune_params > /* Prefer 32-bit encoding instead of 16-bit encoding where subset of flags > would be set. */ > bool disparage_partial_flag_setting_t16_encodings; > + /* Prefer to inline string operations like memset by using Neon. */ > + bool string_ops_prefer_neon; > }; > > extern const struct tune_params *current_tune; > @@ -289,6 +291,7 @@ extern void arm_emit_coreregs_64bit_shift (enum rt > extern bool arm_validize_comparison (rtx *, rtx *, rtx *); > #endif /* RTX_CODE */ > > +extern bool arm_gen_setmem (rtx *); > extern void arm_expand_vec_perm (rtx target, rtx op0, rtx op1, rtx sel); > extern bool arm_expand_vec_perm_const (rtx target, rtx op0, rtx op1, rtx > sel); > > Index: gcc/config/arm/arm.md > =================================================================== > --- gcc/config/arm/arm.md (revision 209852) > +++ gcc/config/arm/arm.md (working copy) > @@ -7555,6 +7555,20 @@ > }) > > > +(define_expand "setmemsi" > + [(match_operand:BLK 0 "general_operand" "") > + (match_operand:SI 1 "const_int_operand" "") > + (match_operand:SI 2 "const_int_operand" "") > + (match_operand:SI 3 "const_int_operand" "")] > + "TARGET_32BIT" > +{ > + if (arm_gen_setmem (operands)) > + DONE; > + > + FAIL; > +}) > + > + > ;; Move a block of memory if it is word aligned and MORE than 2 words long. > ;; We could let this apply for blocks of less than this, but it clobbers so > ;; many registers that there is then probably a better way. > Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c > =================================================================== > --- gcc/testsuite/gcc.target/arm/memset-inline-6.c (revision 0) > +++ gcc/testsuite/gcc.target/arm/memset-inline-6.c (revision 0) Have you tested these when the compiler was configured with "--with-cpu=cortex-a9"?