Re: [PATCH][MIPS] Enable load-load/store-store bonding
Hi Mike, Thanks for your comments. Please find my comments inlined. - Thanks and regards, Sameera D. On Monday 11 May 2015 10:09 PM, Mike Stump wrote: On May 11, 2015, at 4:05 AM, sameera wrote: +(define_insn "*join2_loadhi" + [(set (match_operand:SI 0 "register_operand" "=r") + (any_extend:SI (match_operand:HI 1 "non_volatile_mem_operand" "m"))) + (set (match_operand:SI 2 "register_operand" "=r") + (any_extend:SI (match_operand:HI 3 "non_volatile_mem_operand" "m")))] + "ENABLE_LD_ST_PAIRS && reload_completed" + { +/* Reg-renaming pass reuses base register if it is dead after bonded loads. + Hardware does not bond those loads, even when they are consecutive. + However, order of the loads need to be checked for correctness. */ +if (!reg_overlap_mentioned_p (operands[0], operands[1])) + { + output_asm_insn ("lh\t%0,%1", operands); + output_asm_insn ("lh\t%2,%3", operands); + } +else + { + output_asm_insn ("lh\t%2,%3", operands); + output_asm_insn ("lh\t%0,%1", operands); + } + +return ""; + } + [(set_attr "move_type" "load") + (set_attr "insn_count" "2")]) However, unlike other architectures, we do not generate single instruction for bonded pair, Actually, you do. The above is 1 instruction pattern. Doesn’t matter much what it prints as or what the CPU thinks of it. The pattern is single, however, the asm code will have multiple instructions generated for the pattern. because of which it is difficult to check if bonding is happening or not. Hence, an assembly file is generated with debug dumps, and the bonded loads/stores are identified by their pattern names. Nothing wrong with that approach. Also, in the assembly, one can look for sequences of instruction if they way. Load/store bonding is not just contiguous load/store instructions, but they also need to have same base register and offset with specific difference. Hence, The way you suggested might not be useful always. Hence, I am comparing the pattern name instead. See gcc/testsuite/gcc.target/aarch64/fuse_adrp_add_1.c: /* { dg-final { scan-assembler "adrp\tx.*, fixed_regs\n\tadd\tx.*, x.*fixed_regs" } } */ in the test suite for example. I am trying FUSION for MIPS as suggested by Mike, and testing the perf impact of it along with other mips specific options. I think you will discover it is virtually what you have now, and works better. The fusion just can peephole over greater distances, that’s the only real difference. Yes, in many cases I see clear improvement. However, it also tries to bring loads/stores together, which were split intentionally by msched-weight option, introduced for MIPS. I need to measure performance and do perf tuning (if needed) for that option before sending it for review.
Re: [PATCH][MIPS] Enable load-load/store-store bonding
Gentle reminder! - Thanks and regards, Sameera D. On Monday 30 March 2015 04:58 PM, sameera wrote: Hi! Sorry for delay in sending this patch for review. Please find attached updated patch. In P5600, 2 consecutive loads/stores of same type which access contiguous memory locations are bonded together by instruction issue unit to dispatch single load/store instruction which accesses both locations. This allows 2X improvement in memory intensive code. This optimization can be performed for LH, SH, LW, SW, LWC, SWC, LDC, SDC instructions. This patch adds peephole2 patterns to identify such loads/stores, and put them in parallel, so that the scheduler will not split it - thereby guaranteeing h/w level load/store bonding. The patch is tested with dejagnu for correctness, and tested on hardware for performance. Ok for trunk? Changelog: gcc/ * config/mips/mips.md (JOIN_MODE): New mode iterator. (join2_load_Store): New pattern. (join2_loadhi): Likewise. (define_peehole2): Add peephole2 patterns to join 2 HI/SI/SF/DF-mode load-load and store-stores. * config/mips/mips.opt (mload-store-pairs): New option. (TARGET_LOAD_STORE_PAIRS): New macro. *config/mips/mips.h (ENABLE_LD_ST_PAIRS): Likewise. *config/mips/mips-protos.h (mips_load_store_bonding_p): New prototype. *config/mips/mips.c(mips_load_store_bonding_p): New function. - Thanks and regards, Sameera D. On Tuesday 24 June 2014 04:12 PM, Sameera Deshpande wrote: Hi Richard, Thanks for the review. Please find attached updated patch after your review comments. Changelog: gcc/ * config/mips/mips.md (JOIN_MODE): New mode iterator. (join2_load_Store): New pattern. (join2_loadhi): Likewise. (define_peehole2): Add peephole2 patterns to join 2 HI/SI/SF/DF-mode load-load and store-stores. * config/mips/mips.opt (mload-store-pairs): New option. (TARGET_LOAD_STORE_PAIRS): New macro. *config/mips/mips.h (ENABLE_P5600_LD_ST_PAIRS): Likewise. *config/mips/mips-protos.h (mips_load_store_bonding_p): New prototype. *config/mips/mips.c(mips_load_store_bonding_p): New function. The change is tested with dejagnu with additional options -mload-store-pairs and -mtune=p5600. The perf measurement is yet to finish. We had offline discussion based on your comment. There is additional view on the same. Only ISAs mips32r2, mips32r3 and mips32r5 support P5600. Remaining ISAs do not support P5600. For mips32r2 (24K) and mips32r3 (micromips), load-store pairing is implemented separately, and hence, as you suggested, P5600 Ld-ST bonding optimization should not be enabled for them. So, is it fine if I emit error for any ISAs other than mips32r2, mips32r3 and mips32r5 when P5600 is enabled, or the compilation should continue by emitting warning and disabling P5600? No, the point is that we have two separate concepts: ISA and optimisation target. -mipsN and -march=N control the ISA (which instructions are available) and -mtune=M controls optimisation decisions within the constraints of that N, such as scheduling and the cost of things like multiplication and division. E.g. you could have -mips2 -mtune=p5600 -mfix-24k: generate MIPS II- compatible code, optimise it for p5600, but make sure that 24k workarounds are used. The code would run correctly on any MIPS II-compatible processor without known errata and also on the 24k. Ok, disabled the peephole pattern for fix-24k and micromips - to allow specific patterns to be matched. + +mld-st-pairing +Target Report Var(TARGET_ENABLE_LD_ST_PAIRING) Enable load/store +pairing Other options are just "TARGET_" + the captialised form of the option name, so I'd prefer TARGET_LD_ST_PAIRING instead. Although "ld" might be misleading since it's an abbreviation for "load" rather than the LD instruction. Maybe -mload-store-pairs, since plurals are more common than "-ing"? Not sure that's a great suggestion though. Renamed the option and corresponding macro as suggested. Performance testing for this patch is not yet done. If the patch proves beneficial in most of the testcases (which we believe will do on P5600) we will enable this optimization by default for P5600 - in which case this option can be removed. OK. Sending the patch for comments before performance testing is fine, but I think it'd be better to commit the patch only after the testing is done, since otherwise the patch might need to be tweaked. I don't see any problem with keeping the option in case people want to experiment with it. I just think the patch should only go in once it can be enabled by default for p5600. I.e. the option would exist to turn off the pairing. Not having the option is fine too of course. Yes, after perf analysis, I will share the results across, and then depending upon the impact, the decision can be made - whether to make the option as default or not, and then the patch
Re: [PATCH][MIPS] Enable load-load/store-store bonding
On Tuesday 21 April 2015 12:39 AM, Matthew Fortune wrote: Sameera Deshpande writes: Gentle reminder! Thanks Sameera. Just a couple of comments inline below and a question for Catherine at the end. - Thanks and regards, Sameera D. On Monday 30 March 2015 04:58 PM, sameera wrote: Hi! Sorry for delay in sending this patch for review. Please find attached updated patch. In P5600, 2 consecutive loads/stores of same type which access contiguous memory locations are bonded together by instruction issue unit to dispatch single load/store instruction which accesses both locations. This allows 2X improvement in memory intensive code. This optimization can be performed for LH, SH, LW, SW, LWC, SWC, LDC, SDC instructions. This patch adds peephole2 patterns to identify such loads/stores, and put them in parallel, so that the scheduler will not split it - thereby guaranteeing h/w level load/store bonding. The patch is tested with dejagnu for correctness, and tested on hardware for performance. Ok for trunk? Changelog: gcc/ * config/mips/mips.md (JOIN_MODE): New mode iterator. (join2_load_Store): New pattern. (join2_loadhi): Likewise. (define_peehole2): Add peephole2 patterns to join 2 HI/SI/SF/DF- mode load-load and store-stores. * config/mips/mips.opt (mload-store-pairs): New option. (TARGET_LOAD_STORE_PAIRS): New macro. *config/mips/mips.h (ENABLE_LD_ST_PAIRS): Likewise. *config/mips/mips-protos.h (mips_load_store_bonding_p): New prototype. *config/mips/mips.c(mips_load_store_bonding_p): New function. I don't know if this has been corrupted by mail clients but a single space after '*' and a space before '('. diff --git a/gcc/config/mips/mips-protos.h b/gcc/config/mips/mips-protos.h index b48e04f..244eb8d 100644 --- a/gcc/config/mips/mips-protos.h +++ b/gcc/config/mips/mips-protos.h @@ -360,6 +360,7 @@ extern bool mips_epilogue_uses (unsigned int); extern void mips_final_prescan_insn (rtx_insn *, rtx *, int); extern int mips_trampoline_code_size (void); extern void mips_function_profiler (FILE *); +extern bool mips_load_store_bonding_p (rtx *, machine_mode, bool); typedef rtx (*mulsidi3_gen_fn) (rtx, rtx, rtx); #ifdef RTX_CODE diff --git a/gcc/config/mips/mips.c b/gcc/config/mips/mips.c index 1733457..85f0591 100644 --- a/gcc/config/mips/mips.c +++ b/gcc/config/mips/mips.c @@ -18241,6 +18241,64 @@ umips_load_store_pair_p_1 (bool load_p, bool swap_p, return true; } +bool +mips_load_store_bonding_p (rtx *operands, enum machine_mode mode, bool load_p) Remove enum from machine_mode. +{ + rtx reg1, reg2, mem1, mem2, base1, base2; + enum reg_class rc1, rc2; + HOST_WIDE_INT offset1, offset2; + + if (load_p) +{ + reg1 = operands[0]; + reg2 = operands[2]; + mem1 = operands[1]; + mem2 = operands[3]; +} + else +{ + reg1 = operands[1]; + reg2 = operands[3]; + mem1 = operands[0]; + mem2 = operands[2]; +} + + if (mips_address_insns (XEXP (mem1, 0), mode, false) == 0 + || mips_address_insns (XEXP (mem2, 0), mode, false) == 0) +return false; + + mips_split_plus (XEXP (mem1, 0), &base1, &offset1); + mips_split_plus (XEXP (mem2, 0), &base2, &offset2); + + /* Base regs do not match. */ + if (!REG_P (base1) || !rtx_equal_p (base1, base2)) +return false; + + /* Either of the loads is clobbering base register. */ + if (load_p + && (REGNO (reg1) == REGNO (base1) + || (REGNO (reg2) == REGNO (base1 +return false; Can you add a comment saying that this case does not get bonded by any known hardware even though it could be valid to bond them if it is the second load that clobbers the base. + /* Loading in same registers. */ + if (load_p + && REGNO (reg1) == REGNO (reg2)) +return false; + + /* The loads/stores are not of same type. */ + rc1 = REGNO_REG_CLASS (REGNO (reg1)); + rc2 = REGNO_REG_CLASS (REGNO (reg2)); + if (rc1 != rc2 + && !reg_class_subset_p (rc1, rc2) + && !reg_class_subset_p (rc2, rc1)) +return false; + + if (abs (offset1 - offset2) != GET_MODE_SIZE (mode)) +return false; + + return true; +} + /* OPERANDS describes the operands to a pair of SETs, in the order dest1, src1, dest2, src2. Return true if the operands can be used in an LWP or SWP instruction; LOAD_P says which. */ diff --git a/gcc/config/mips/mips.h b/gcc/config/mips/mips.h index ec69ed5..1bd0dae 100644 --- a/gcc/config/mips/mips.h +++ b/gcc/config/mips/mips.h @@ -3147,3 +3147,7 @@ extern GTY(()) struct target_globals *mips16_globals; #define STANDARD_STARTFILE_PREFIX_1 "/lib64/" #define STANDARD_STARTFILE_PREFIX_2 "/usr/lib64/" #endif + +#define ENABLE_LD_ST_PAIRS \ + (TARGET_LOAD_STORE_PAIRS && TUNE_P5600 \ + && !TARGET_MICROMIPS && !TARGET_FIX_24K) I've already forgotten why these e
Re: [PATCH][MIPS] Enable load-load/store-store bonding
On Monday 11 May 2015 05:43 PM, Matthew Fortune wrote: Hi Sameera, Sameera Deshpande writes: Changelog: gcc/ * config/mips/mips.md (JOIN_MODE): New mode iterator. (join2_load_Store): New pattern. (join2_loadhi): Likewise. (define_peehole2): Add peephole2 patterns to join 2 HI/SI/SF/DF-mode load-load and store-stores. * config/mips/mips.opt (mload-store-pairs): New option. (TARGET_LOAD_STORE_PAIRS): New macro. * config/mips/mips.h (ENABLE_LD_ST_PAIRS): Likewise. * config/mips/mips-protos.h (mips_load_store_bonding_p): New prototype. * config/mips/mips.c (mips_load_store_bonding_p): New function. gcc/testsuite/ * gcc.target/mips/p5600-bonding.c : New testcase to test bonding. Just 'New file.' is fine for the changelog. diff --git a/gcc/testsuite/gcc.target/mips/p5600-bonding.c b/gcc/testsuite/gcc.target/mips/p5600-bonding.c new file mode 100644 index 000..122b9f8 --- /dev/null +++ b/gcc/testsuite/gcc.target/mips/p5600-bonding.c @@ -0,0 +1,19 @@ +/* { dg-do compile } */ +/* { dg-options "-dp -mtune=p5600 -mno-micromips -mno-mips16" } */ +/* { dg-skip-if "Bonding needs peephole optimization." { *-*-* } { "-O0" "-O1" } { "" } } */ +typedef int VINT32 __attribute__ ((vector_size((16; + +void memory_operation_fun2_si(void * __restrict src, void * __restrict dest, int num) Code style applies for testcases too, return type on line above, space after function name, line length. +{ +VINT32 *vsrc = (VINT32 *)src; Indentation. +VINT32 *vdest = (VINT32 *)dest; +int i; + +for (i = 0; i < num - 1; i+=2) +{ Indentation + vdest[i] = (vdest[i] + vsrc[i]); Unnecessary brackets. + vdest[i + 1] = vdest[i + 1] + vsrc[i + 1]; +} +} +/* { dg-final { scan-assembler "join2_" } } */ + OK with those changes. Thanks, Matthew Hi Matthew, Thanks for the comments. Please find attached updated patch. I do not have permissions to apply the patch in GCC. Can you please submit the patch for me? Changelog: gcc/ * config/mips/mips.md (JOIN_MODE): New mode iterator. (join2_load_Store): New pattern. (join2_loadhi): Likewise. (define_peehole2): Add peephole2 patterns to join 2 HI/SI/SF/DF-mode load-load and store-stores. * config/mips/mips.opt (mload-store-pairs): New option. (TARGET_LOAD_STORE_PAIRS): New macro. * config/mips/mips.h (ENABLE_LD_ST_PAIRS): Likewise. * config/mips/mips-protos.h (mips_load_store_bonding_p): New prototype. * config/mips/mips.c (mips_load_store_bonding_p): New function. gcc/testsuite/ * gcc.target/mips/p5600-bonding.c : New file. diff --git a/gcc/config/mips/mips-protos.h b/gcc/config/mips/mips-protos.h index b48e04f..244eb8d 100644 --- a/gcc/config/mips/mips-protos.h +++ b/gcc/config/mips/mips-protos.h @@ -360,6 +360,7 @@ extern bool mips_epilogue_uses (unsigned int); extern void mips_final_prescan_insn (rtx_insn *, rtx *, int); extern int mips_trampoline_code_size (void); extern void mips_function_profiler (FILE *); +extern bool mips_load_store_bonding_p (rtx *, machine_mode, bool); typedef rtx (*mulsidi3_gen_fn) (rtx, rtx, rtx); #ifdef RTX_CODE diff --git a/gcc/config/mips/mips.c b/gcc/config/mips/mips.c index bf69850..4fc15c4 100644 --- a/gcc/config/mips/mips.c +++ b/gcc/config/mips/mips.c @@ -18241,6 +18241,66 @@ umips_load_store_pair_p_1 (bool load_p, bool swap_p, return true; } +bool +mips_load_store_bonding_p (rtx *operands, machine_mode mode, bool load_p) +{ + rtx reg1, reg2, mem1, mem2, base1, base2; + enum reg_class rc1, rc2; + HOST_WIDE_INT offset1, offset2; + + if (load_p) +{ + reg1 = operands[0]; + reg2 = operands[2]; + mem1 = operands[1]; + mem2 = operands[3]; +} + else +{ + reg1 = operands[1]; + reg2 = operands[3]; + mem1 = operands[0]; + mem2 = operands[2]; +} + + if (mips_address_insns (XEXP (mem1, 0), mode, false) == 0 + || mips_address_insns (XEXP (mem2, 0), mode, false) == 0) +return false; + + mips_split_plus (XEXP (mem1, 0), &base1, &offset1); + mips_split_plus (XEXP (mem2, 0), &base2, &offset2); + + /* Base regs do not match. */ + if (!REG_P (base1) || !rtx_equal_p (base1, base2)) +return false; + + /* Either of the loads is clobbering base register. It is legitimate to bond + loads if second load clobbers base register. However, hardware does not + support such bonding. */ + if (load_p + && (REGNO (reg1) == REGNO (base1) + || (REGNO (reg2) == REGNO (base1 +return false; + + /* Loading in same registers. */ + if (load_p + && REGNO (reg1) == REGNO (reg2)) +return false; + + /* The loads/stores are not of same type. */ + rc1 = REGNO_REG_CLASS (REGNO (reg1)); + rc2 = REGNO_REG_CLASS
Re: [PATCH][MIPS] Enable load-load/store-store bonding
Hi! Sorry for delay in sending this patch for review. Please find attached updated patch. In P5600, 2 consecutive loads/stores of same type which access contiguous memory locations are bonded together by instruction issue unit to dispatch single load/store instruction which accesses both locations. This allows 2X improvement in memory intensive code. This optimization can be performed for LH, SH, LW, SW, LWC, SWC, LDC, SDC instructions. This patch adds peephole2 patterns to identify such loads/stores, and put them in parallel, so that the scheduler will not split it - thereby guaranteeing h/w level load/store bonding. The patch is tested with dejagnu for correctness, and tested on hardware for performance. Ok for trunk? Changelog: gcc/ * config/mips/mips.md (JOIN_MODE): New mode iterator. (join2_load_Store): New pattern. (join2_loadhi): Likewise. (define_peehole2): Add peephole2 patterns to join 2 HI/SI/SF/DF-mode load-load and store-stores. * config/mips/mips.opt (mload-store-pairs): New option. (TARGET_LOAD_STORE_PAIRS): New macro. *config/mips/mips.h (ENABLE_LD_ST_PAIRS): Likewise. *config/mips/mips-protos.h (mips_load_store_bonding_p): New prototype. *config/mips/mips.c(mips_load_store_bonding_p): New function. - Thanks and regards, Sameera D. On Tuesday 24 June 2014 04:12 PM, Sameera Deshpande wrote: Hi Richard, Thanks for the review. Please find attached updated patch after your review comments. Changelog: gcc/ * config/mips/mips.md (JOIN_MODE): New mode iterator. (join2_load_Store): New pattern. (join2_loadhi): Likewise. (define_peehole2): Add peephole2 patterns to join 2 HI/SI/SF/DF-mode load-load and store-stores. * config/mips/mips.opt (mload-store-pairs): New option. (TARGET_LOAD_STORE_PAIRS): New macro. *config/mips/mips.h (ENABLE_P5600_LD_ST_PAIRS): Likewise. *config/mips/mips-protos.h (mips_load_store_bonding_p): New prototype. *config/mips/mips.c(mips_load_store_bonding_p): New function. The change is tested with dejagnu with additional options -mload-store-pairs and -mtune=p5600. The perf measurement is yet to finish. We had offline discussion based on your comment. There is additional view on the same. Only ISAs mips32r2, mips32r3 and mips32r5 support P5600. Remaining ISAs do not support P5600. For mips32r2 (24K) and mips32r3 (micromips), load-store pairing is implemented separately, and hence, as you suggested, P5600 Ld-ST bonding optimization should not be enabled for them. So, is it fine if I emit error for any ISAs other than mips32r2, mips32r3 and mips32r5 when P5600 is enabled, or the compilation should continue by emitting warning and disabling P5600? No, the point is that we have two separate concepts: ISA and optimisation target. -mipsN and -march=N control the ISA (which instructions are available) and -mtune=M controls optimisation decisions within the constraints of that N, such as scheduling and the cost of things like multiplication and division. E.g. you could have -mips2 -mtune=p5600 -mfix-24k: generate MIPS II- compatible code, optimise it for p5600, but make sure that 24k workarounds are used. The code would run correctly on any MIPS II-compatible processor without known errata and also on the 24k. Ok, disabled the peephole pattern for fix-24k and micromips - to allow specific patterns to be matched. + +mld-st-pairing +Target Report Var(TARGET_ENABLE_LD_ST_PAIRING) Enable load/store +pairing Other options are just "TARGET_" + the captialised form of the option name, so I'd prefer TARGET_LD_ST_PAIRING instead. Although "ld" might be misleading since it's an abbreviation for "load" rather than the LD instruction. Maybe -mload-store-pairs, since plurals are more common than "-ing"? Not sure that's a great suggestion though. Renamed the option and corresponding macro as suggested. Performance testing for this patch is not yet done. If the patch proves beneficial in most of the testcases (which we believe will do on P5600) we will enable this optimization by default for P5600 - in which case this option can be removed. OK. Sending the patch for comments before performance testing is fine, but I think it'd be better to commit the patch only after the testing is done, since otherwise the patch might need to be tweaked. I don't see any problem with keeping the option in case people want to experiment with it. I just think the patch should only go in once it can be enabled by default for p5600. I.e. the option would exist to turn off the pairing. Not having the option is fine too of course. Yes, after perf analysis, I will share the results across, and then depending upon the impact, the decision can be made - whether to make the option as default or not, and then the patch will be submitted. We should al
Re: [Aarch64] Fix conditional branches with target far away.
On Mon 9 Apr, 2018, 2:06 PM Sameera Deshpande, wrote: > Hi Richard, > > I do not see the said patch applied in ToT yet. When do you expect it > to be available in ToT? > > - Thanks and regards, > Sameera D. > > On 30 March 2018 at 17:01, Sameera Deshpande > wrote: > > Hi Richard, > > > > The testcase is working with the patch you suggested, thanks for > > pointing that out. > > > > On 30 March 2018 at 16:54, Sameera Deshpande > > wrote: > >> On 30 March 2018 at 16:39, Richard Sandiford > >> wrote: > >>>> Hi Sudakshina, > >>>> > >>>> Thanks for pointing that out. Updated the conditions for attribute > >>>> length to take care of boundary conditions for offset range. > >>>> > >>>> Please find attached the updated patch. > >>>> > >>>> I have tested it for gcc testsuite and the failing testcase. Ok for > trunk? > >>>> > >>>> On 22 March 2018 at 19:06, Sudakshina Das wrote: > >>>>> Hi Sameera > >>>>> > >>>>> On 22/03/18 02:07, Sameera Deshpande wrote: > >>>>>> > >>>>>> Hi Sudakshina, > >>>>>> > >>>>>> As per the ARMv8 ARM, for the offset range (-1048576 ,1048572), the > >>>>>> far branch instruction offset is inclusive of both the offsets. > Hence, > >>>>>> I am using <=||=> and not <||>= as it was in previous > implementation. > >>>>> > >>>>> > >>>>> I have to admit earlier I was only looking at the patch mechanically > and > >>>>> found a difference with the previous implementation in offset > comparison. > >>>>> After you pointed out, I looked up the ARMv8 ARM and I have a couple > of > >>>>> doubts: > >>>>> > >>>>> 1. My understanding is that any offset in [-1048576 ,1048572] both > inclusive > >>>>> qualifies as an 'in range' offset. However, the code for both > attribute > >>>>> length and far_branch has been using [-1048576 ,1048572), that is, ( > >= && < > >>>>> ). If the far_branch was incorrectly calculated, then maybe the > length > >>>>> calculations with similar magic numbers should also be corrected? Of > course, > >>>>> I am not an expert in this and maybe this was a conscience decision > so I > >>>>> would ask Ramana to maybe clarify if he remembers. > >>>>> > >>>>> 2. Now to come back to your patch, if my understanding is correct, I > think a > >>>>> far_branch would be anything outside of this range, that is, > >>>>> (offset < -1048576 || offset > 1048572), anything that can not be > >>>>> represented in the 21-bit range. > >>>>> > >>>>> Thanks > >>>>> Sudi > >>> > >>> [...] > >>> > >>>> @@ -466,14 +459,9 @@ > >>>>[(set_attr "type" "branch") > >>>> (set (attr "length") > >>>> (if_then_else (and (ge (minus (match_dup 2) (pc)) (const_int > -1048576)) > >>>> -(lt (minus (match_dup 2) (pc)) (const_int > 1048572))) > >>>> +(le (minus (match_dup 2) (pc)) (const_int > 1048572))) > >>>> (const_int 4) > >>>> - (const_int 8))) > >>> > >>> Sorry for not replying earlier, but I think the use of "lt" rather than > >>> "le" in the current length attribute is deliberate. Distances measured > >>> from (pc) in "length" are a bit special in that backward distances are > >>> measured from the start of the instruction and forward distances are > >>> measured from the end of the instruction: > >>> > >>> /* The address of the current insn. We implement this actually > as the > >>> address of the current insn for backward branches, but the > last > >>> address of the next insn for forward branches, and both with > >>> adjustments that account for the worst-case possible > stretching of > >>> intervening alignments between this insn and its > destination. */ > >>> > >>> This avoids the c
[AARCH64] Neon vld1_*_x3, vst1_*_x2 and vst1_*_x3 intrinsics
Hi! Please find attached the patch implementing vld1_*_x3, vst1_*_x2 and vst1_*_x3 intrinsics as defined by Neon document. Ok for trunk? - Thanks and regards, Sameera D. gcc/Changelog: 2017-11-14 Sameera Deshpande * config/aarch64/aarch64-simd-builtins.def (ld1x3): New. (st1x2): Likewise. (st1x3): Likewise. * config/aarch64/aarch64-simd.md (aarch64_ld1x3): New pattern. (aarch64_ld1_x3_): Likewise (aarch64_st1x2): Likewise (aarch64_st1_x2_): Likewise (aarch64_st1x3): Likewise (aarch64_st1_x3_): Likewise * config/aarch64/arm_neon.h (vld1_u8_x3): New function. (vld1_s8_x3): Likewise. (vld1_u16_x3): Likewise. (vld1_s16_x3): Likewise. (vld1_u32_x3): Likewise. (vld1_s32_x3): Likewise. (vld1_u64_x3): Likewise. (vld1_s64_x3): Likewise. (vld1_fp16_x3): Likewise. (vld1_f32_x3): Likewise. (vld1_f64_x3): Likewise. (vld1_p8_x3): Likewise. (vld1_p16_x3): Likewise. (vld1_p64_x3): Likewise. (vld1q_u8_x3): Likewise. (vld1q_s8_x3): Likewise. (vld1q_u16_x3): Likewise. (vld1q_s16_x3): Likewise. (vld1q_u32_x3): Likewise. (vld1q_s32_x3): Likewise. (vld1q_u64_x3): Likewise. (vld1q_s64_x3): Likewise. (vld1q_f16_x3): Likewise. (vld1q_f32_x3): Likewise. (vld1q_f64_x3): Likewise. (vld1q_p8_x3): Likewise. (vld1q_p16_x3): Likewise. (vld1q_p64_x3): Likewise. (vst1_s64_x2): Likewise. (vst1_u64_x2): Likewise. (vst1_f64_x2): Likewise. (vst1_s8_x2): Likewise. (vst1_p8_x2): Likewise. (vst1_s16_x2): Likewise. (vst1_p16_x2): Likewise. (vst1_s32_x2): Likewise. (vst1_u8_x2): Likewise. (vst1_u16_x2): Likewise. (vst1_u32_x2): Likewise. (vst1_f16_x2): Likewise. (vst1_f32_x2): Likewise. (vst1_p64_x2): Likewise. (vst1q_s8_x2): Likewise. (vst1q_p8_x2): Likewise. (vst1q_s16_x2): Likewise. (vst1q_p16_x2): Likewise. (vst1q_s32_x2): Likewise. (vst1q_s64_x2): Likewise. (vst1q_u8_x2): Likewise. (vst1q_u16_x2): Likewise. (vst1q_u32_x2): Likewise. (vst1q_u64_x2): Likewise. (vst1q_f16_x2): Likewise. (vst1q_f32_x2): Likewise. (vst1q_f64_x2): Likewise. (vst1q_p64_x2): Likewise. (vst1_s64_x3): Likewise. (vst1_u64_x3): Likewise. (vst1_f64_x3): Likewise. (vst1_s8_x3): Likewise. (vst1_p8_x3): Likewise. (vst1_s16_x3): Likewise. (vst1_p16_x3): Likewise. (vst1_s32_x3): Likewise. (vst1_u8_x3): Likewise. (vst1_u16_x3): Likewise. (vst1_u32_x3): Likewise. (vst1_f16_x3): Likewise. (vst1_f32_x3): Likewise. (vst1_p64_x3): Likewise. (vst1q_s8_x3): Likewise. (vst1q_p8_x3): Likewise. (vst1q_s16_x3): Likewise. (vst1q_p16_x3): Likewise. (vst1q_s32_x3): Likewise. (vst1q_s64_x3): Likewise. (vst1q_u8_x3): Likewise. (vst1q_u16_x3): Likewise. (vst1q_u32_x3): Likewise. (vst1q_u64_x3): Likewise. (vst1q_f16_x3): Likewise. (vst1q_f32_x3): Likewise. (vst1q_f64_x3): Likewise. (vst1q_p64_x3): Likewise. diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def index 52d01342372..fa623e90017 100644 --- a/gcc/config/aarch64/aarch64-simd-builtins.def +++ b/gcc/config/aarch64/aarch64-simd-builtins.def @@ -441,6 +441,15 @@ BUILTIN_VALL_F16 (STORE1, st1, 0) VAR1(STORE1P, st1, 0, v2di) + /* Implemented by aarch64_ld1x3. */ + BUILTIN_VALLDIF (LOADSTRUCT, ld1x3, 0) + + /* Implemented by aarch64_st1x2. */ + BUILTIN_VALLDIF (STORESTRUCT, st1x2, 0) + + /* Implemented by aarch64_st1x3. */ + BUILTIN_VALLDIF (STORESTRUCT, st1x3, 0) + /* Implemented by fma4. */ BUILTIN_VHSDF (TERNOP, fma, 4) VAR1 (TERNOP, fma, 4, hf) diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 4fd34c18f95..852bcf0c16a 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -5038,6 +5038,70 @@ } }) + +(define_expand "aarch64_ld1x3" + [(match_operand:CI 0 "register_operand" "=w") + (match_operand:DI 1 "register_operand" "r") + (unspec:VALLDIF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] + "TARGET_SIMD" +{ + rtx mem = gen_rtx_MEM (CImode, operands[1]); + emit_insn (gen_aarch64_ld1_x3_ (operands[0], mem)); + DONE; +}) + +(define_insn "aarch64_ld1_x3_" + [(set (match_operand:CI 0 "register_operand" "=w") +(unspec:CI + [(match_operand:CI 1 "aarch64_simd_struct_operand" "Utv") + (unspec:VALLDIF [(const_int 3)] UNSPEC_VS
[AARCH64] Add support of ARMv8.4 in saphira for Qualcomm server part
Hi! Please find attached the patch to add support of ARMv8.4 in saphira for Qualcomm server part. Tested on aarch64, without any regressions. Ok for trunk? -- - Thanks and regards, Sameera D. diff --git a/gcc/config/aarch64/aarch64-cores.def b/gcc/config/aarch64/aarch64-cores.def index 33b96ca2861..e64d8314fa9 100644 --- a/gcc/config/aarch64/aarch64-cores.def +++ b/gcc/config/aarch64/aarch64-cores.def @@ -86,10 +86,10 @@ AARCH64_CORE("thunderx2t99", thunderx2t99, thunderx2t99, 8_1A, AARCH64_FL_FOR AARCH64_CORE("cortex-a55", cortexa55, cortexa53, 8_2A, AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD, cortexa53, 0x41, 0xd05, -1) AARCH64_CORE("cortex-a75", cortexa75, cortexa57, 8_2A, AARCH64_FL_FOR_ARCH8_2 | AARCH64_FL_F16 | AARCH64_FL_RCPC | AARCH64_FL_DOTPROD, cortexa73, 0x41, 0xd0a, -1) -/* ARMv8.3-A Architecture Processors. */ +/* ARMv8.4-A Architecture Processors. */ /* Qualcomm ('Q') cores. */ -AARCH64_CORE("saphira", saphira,falkor,8_3A, AARCH64_FL_FOR_ARCH8_3 | AARCH64_FL_CRYPTO | AARCH64_FL_RCPC, saphira, 0x51, 0xC01, -1) +AARCH64_CORE("saphira", saphira,falkor,8_4A, AARCH64_FL_FOR_ARCH8_4 | AARCH64_FL_CRYPTO | AARCH64_FL_RCPC, saphira, 0x51, 0xC01, -1) /* ARMv8-A big.LITTLE implementations. */
Re: [AARCH64] Add support of ARMv8.4 in saphira for Qualcomm server part
On Tue 29 May, 2018, 9:19 PM Siddhesh Poyarekar, < siddhesh.poyare...@linaro.org> wrote: > On 29 May 2018 at 21:17, James Greenhalgh > wrote: > > On Tue, May 29, 2018 at 05:01:42AM -0500, Sameera Deshpande wrote: > >> Hi! > >> > >> Please find attached the patch to add support of ARMv8.4 in saphira > >> for Qualcomm server part. Tested on aarch64, without any regressions. > >> > >> Ok for trunk? > > > > I'm trusting that this is the right thing to do for this core. As > Siddhesh > > contributed the original patch; I'd like him to also sign off on this > > modification. > > > > OK for trunk with Siddhesh's ack. > > LGTM too. > > Thanks, > Siddhesh > Thanks James and Siddhesh. - Sameera >
[AARCH64]Bug in fix for branch offsets over 1 MiB?
Hi! I am seeing multiple assembler errors with error message "Error: conditional branch out of range" for customer code. The root cause of the bug is that conditional branches are generated whose branch target ends up being too far away to be encoded in the instruction. It appears that there was an attempt to fix this issue in the below change: commit 050af05b9761f1979f11c151519e7244d5becd7c Author: thopre01 Date: Thu Aug 27 10:08:54 2015 + 2015-08-27 Ramana Radhakrishnan <[ramana.radhakrish...@arm.com|mailto:ramana.radhakrish...@arm.com]> Andre Vieira <[andre.simoesdiasvie...@arm.com|mailto:andre.simoesdiasvie...@arm.com]> gcc/ * config/aarch64/[aarch64.md|http://aarch64.md/] (*condjump): Handle functions > 1 MiB. (*cb1): Likewise. (*tb1): Likewise. (*cb1): Likewise. * config/aarch64/[iterators.md|http://iterators.md/] (inv_cb): New code attribute. (inv_tb): Likewise. * config/aarch64/aarch64.c (aarch64_gen_far_branch): New. * config/aarch64/aarch64-protos.h (aarch64_gen_far_branch): New. gcc/testsuite/ * gcc.target/aarch64/long_branch_1.c: New test. However, as per GCC Internal documentation, only special attribute "length" should use PC and match_dup while defining an attribute. I verified by looking at code in final pass, and realised that get_attribute_length does not map directly to the functions generated from the definition of attribute length in RTL patterns, but computes the lengths in shorten_branches and uses insn_current_length as intermediate function. The far_branch attribute defined similar to attribute length expects same values to be returned by (minus (match_dup 2) (pc)) which is incorrect. I am looking at TARGET_MACHINE_DEPENDENT_REORG macro instead like few other architectures, to emit far branches. Is that approach acceptable? PS: I am waiting for customer's approval for attaching the test case. -- - Thanks and regards, Sameera D.
Re: [AARCH64]Bug in fix for branch offsets over 1 MiB?
On 30-Jan-2018 2:37 AM, "Richard Sandiford" wrote: Sameera Deshpande writes: > Hi! > > I am seeing multiple assembler errors with error message "Error: > conditional branch out of range" for customer code. > > The root cause of the bug is that conditional branches are generated > whose branch target ends up being too far away to be encoded in the > instruction. It appears that there was an attempt to fix this issue > in the below change: > > commit 050af05b9761f1979f11c151519e7244d5becd7c > Author: thopre01 > Date: Thu Aug 27 10:08:54 2015 + > > 2015-08-27 Ramana Radhakrishnan > <[ramana.radhakrish...@arm.com|mailto:ramana.radhakrish...@arm.com]> > Andre Vieira > <[andre.simoesdiasvie...@arm.com|mailto:andre.simoesdiasvie...@arm.com]> > > gcc/ > * config/aarch64/[aarch64.md|http://aarch64.md/] (*condjump): > Handle functions > 1 MiB. > (*cb1): Likewise. > (*tb1): Likewise. > (*cb1): Likewise. > * config/aarch64/[iterators.md|http://iterators.md/] (inv_cb): > New code attribute. > (inv_tb): Likewise. > * config/aarch64/aarch64.c (aarch64_gen_far_branch): New. > * config/aarch64/aarch64-protos.h (aarch64_gen_far_branch): New. > > gcc/testsuite/ > * gcc.target/aarch64/long_branch_1.c: New test. > > However, as per GCC Internal documentation, only special attribute > "length" should use PC and match_dup while defining an attribute. I > verified by looking at code in final pass, and realised that > get_attribute_length does not map directly to the functions generated > from the definition of attribute length in RTL patterns, but computes > the lengths in shorten_branches and uses insn_current_length as > intermediate function. > > The far_branch attribute defined similar to attribute length expects > same values to be returned by (minus (match_dup 2) (pc)) which is > incorrect. > > I am looking at TARGET_MACHINE_DEPENDENT_REORG macro instead like few > other architectures, to emit far branches. > > Is that approach acceptable? I don't think we need to go that far. The INSN_ADDRESSES should be correct when outputting the instructions, so does it work if we use those instead of get_attr_far_branch? Thanks, Richard > PS: I am waiting for customer's approval for attaching the test case. Hi Richard, Thanks for your reply. I will try using INSN_ADDRESSES and will get back to you. - Thanks and regards, Sameera D.
Re: [AARCH64]Bug in fix for branch offsets over 1 MiB?
On 30 January 2018 at 09:28, Sameera Deshpande wrote: > On 30-Jan-2018 2:37 AM, "Richard Sandiford" > wrote: > > Sameera Deshpande writes: >> Hi! >> >> I am seeing multiple assembler errors with error message "Error: >> conditional branch out of range" for customer code. >> >> The root cause of the bug is that conditional branches are generated >> whose branch target ends up being too far away to be encoded in the >> instruction. It appears that there was an attempt to fix this issue >> in the below change: >> >> commit 050af05b9761f1979f11c151519e7244d5becd7c >> Author: thopre01 >> Date: Thu Aug 27 10:08:54 2015 + >> >> 2015-08-27 Ramana Radhakrishnan >> <[ramana.radhakrish...@arm.com|mailto:ramana.radhakrish...@arm.com]> >> Andre Vieira >> <[andre.simoesdiasvie...@arm.com|mailto:andre.simoesdiasvie...@arm.com]> >> >> gcc/ >> * config/aarch64/[aarch64.md|http://aarch64.md/] (*condjump): >> Handle functions > 1 MiB. >> (*cb1): Likewise. >> (*tb1): Likewise. >> (*cb1): Likewise. >> * config/aarch64/[iterators.md|http://iterators.md/] (inv_cb): >> New code attribute. >> (inv_tb): Likewise. >> * config/aarch64/aarch64.c (aarch64_gen_far_branch): New. >> * config/aarch64/aarch64-protos.h (aarch64_gen_far_branch): New. >> >> gcc/testsuite/ >> * gcc.target/aarch64/long_branch_1.c: New test. >> >> However, as per GCC Internal documentation, only special attribute >> "length" should use PC and match_dup while defining an attribute. I >> verified by looking at code in final pass, and realised that >> get_attribute_length does not map directly to the functions generated >> from the definition of attribute length in RTL patterns, but computes >> the lengths in shorten_branches and uses insn_current_length as >> intermediate function. >> >> The far_branch attribute defined similar to attribute length expects >> same values to be returned by (minus (match_dup 2) (pc)) which is >> incorrect. >> >> I am looking at TARGET_MACHINE_DEPENDENT_REORG macro instead like few >> other architectures, to emit far branches. >> >> Is that approach acceptable? > > I don't think we need to go that far. The INSN_ADDRESSES should be > correct when outputting the instructions, so does it work if we use > those instead of get_attr_far_branch? > > Thanks, > Richard > >> PS: I am waiting for customer's approval for attaching the test case. > > > Hi Richard, > > Thanks for your reply. I will try using INSN_ADDRESSES and will get back to > you. > > - Thanks and regards, > Sameera D. > Hi Richard, I verified that it works. Thanks a lot! Will do the testing, and update the patch. -- - Thanks and regards, Sameera D.
[Aarch64] Fix conditional branches with target far away.
Hi! Please find attached the patch to fix bug in branches with offsets over 1MiB. There has been an attempt to fix this issue in commit 050af05b9761f1979f11c151519e7244d5becd7c However, the far_branch attribute defined in above patch used insn_length - which computes incorrect offset. Hence, eliminated the attribute completely, and computed the offset from insn_addresses instead. Ok for trunk? gcc/Changelog 2018-02-13 Sameera Deshpande * config/aarch64/aarch64.md (far_branch): Remove attribute. Eliminate all the dependencies on the attribute from RTL patterns. -- - Thanks and regards, Sameera D. Index: gcc/config/aarch64/aarch64.md === --- gcc/config/aarch64/aarch64.md (revision 257620) +++ gcc/config/aarch64/aarch64.md (working copy) @@ -244,13 +244,6 @@ (const_string "no") ] (const_string "yes"))) -;; Attribute that specifies whether we are dealing with a branch to a -;; label that is far away, i.e. further away than the maximum/minimum -;; representable in a signed 21-bits number. -;; 0 :=: no -;; 1 :=: yes -(define_attr "far_branch" "" (const_int 0)) - ;; Strictly for compatibility with AArch32 in pipeline models, since AArch64 has ;; no predicated insns. (define_attr "predicated" "yes,no" (const_string "no")) @@ -448,12 +441,7 @@ (if_then_else (and (ge (minus (match_dup 2) (pc)) (const_int -1048576)) (lt (minus (match_dup 2) (pc)) (const_int 1048572))) (const_int 4) - (const_int 8))) - (set (attr "far_branch") - (if_then_else (and (ge (minus (match_dup 2) (pc)) (const_int -1048576)) - (lt (minus (match_dup 2) (pc)) (const_int 1048572))) - (const_int 0) - (const_int 1)))] + (const_int 8)))] ) ;; For a 24-bit immediate CST we can optimize the compare for equality @@ -670,12 +658,7 @@ (if_then_else (and (ge (minus (match_dup 1) (pc)) (const_int -1048576)) (lt (minus (match_dup 1) (pc)) (const_int 1048572))) (const_int 4) - (const_int 8))) - (set (attr "far_branch") - (if_then_else (and (ge (minus (match_dup 2) (pc)) (const_int -1048576)) - (lt (minus (match_dup 2) (pc)) (const_int 1048572))) - (const_int 0) - (const_int 1)))] + (const_int 8)))] ) (define_insn "*tb1" @@ -692,7 +675,11 @@ { if (get_attr_length (insn) == 8) { - if (get_attr_far_branch (insn) == 1) + long long int offset; + offset = INSN_ADDRESSES (INSN_UID (XEXP (operands[2], 0))) + - INSN_ADDRESSES (INSN_UID (insn)); + + if (offset <= -1048576 || offset >= 1048572) return aarch64_gen_far_branch (operands, 2, "Ltb", "\\t%0, %1, "); else @@ -709,12 +696,7 @@ (if_then_else (and (ge (minus (match_dup 2) (pc)) (const_int -32768)) (lt (minus (match_dup 2) (pc)) (const_int 32764))) (const_int 4) - (const_int 8))) - (set (attr "far_branch") - (if_then_else (and (ge (minus (match_dup 2) (pc)) (const_int -1048576)) - (lt (minus (match_dup 2) (pc)) (const_int 1048572))) - (const_int 0) - (const_int 1)))] + (const_int 8)))] ) @@ -727,8 +709,12 @@ "" { if (get_attr_length (insn) == 8) - { - if (get_attr_far_branch (insn) == 1) + { +long long int offset; +offset = INSN_ADDRESSES (INSN_UID (XEXP (operands[1], 0))) + - INSN_ADDRESSES (INSN_UID (insn)); + + if (offset <= -1048576 || offset >= 1048572) return aarch64_gen_far_branch (operands, 1, "Ltb", "\\t%0, , "); else @@ -740,7 +726,7 @@ output_asm_insn (buf, operands); return "\t%l1"; } - } + } else return "\t%0, , %l1"; } @@ -749,12 +735,7 @@ (if_then_else (and (ge (minus (match_dup 1) (pc)) (const_int -32768)) (lt (minus (match_dup 1) (pc)) (const_int 32764))) (const_int 4) - (const_int 8))) - (set (attr "far_branch") - (if_then_else (and (ge (minus (match_dup 1) (pc)) (const_int -1048576)) - (lt (minus (match_dup 1) (pc)) (const_int 1048572))) - (const_int 0) - (const_int 1)))] + (const_int 8)))] ) ;; ---
Re: [Aarch64] Fix conditional branches with target far away.
On 14 February 2018 at 14:00, Sameera Deshpande wrote: > Hi! > > Please find attached the patch to fix bug in branches with offsets over 1MiB. > There has been an attempt to fix this issue in commit > 050af05b9761f1979f11c151519e7244d5becd7c > > However, the far_branch attribute defined in above patch used > insn_length - which computes incorrect offset. Hence, eliminated the > attribute completely, and computed the offset from insn_addresses > instead. > > Ok for trunk? > > gcc/Changelog > > 2018-02-13 Sameera Deshpande > * config/aarch64/aarch64.md (far_branch): Remove attribute. Eliminate > all the dependencies on the attribute from RTL patterns. > > -- > - Thanks and regards, > Sameera D. Gentle reminder! -- - Thanks and regards, Sameera D.
Re: [Aarch64] Fix conditional branches with target far away.
On 27 February 2018 at 18:25, Ramana Radhakrishnan wrote: > On Wed, Feb 14, 2018 at 8:30 AM, Sameera Deshpande > wrote: >> Hi! >> >> Please find attached the patch to fix bug in branches with offsets over 1MiB. >> There has been an attempt to fix this issue in commit >> 050af05b9761f1979f11c151519e7244d5becd7c >> >> However, the far_branch attribute defined in above patch used >> insn_length - which computes incorrect offset. Hence, eliminated the >> attribute completely, and computed the offset from insn_addresses >> instead. >> >> Ok for trunk? >> >> gcc/Changelog >> >> 2018-02-13 Sameera Deshpande >> * config/aarch64/aarch64.md (far_branch): Remove attribute. Eliminate >> all the dependencies on the attribute from RTL patterns. >> > > I'm not a maintainer but this looks good to me modulo notes about how > this was tested. What would be nice is a testcase for the testsuite as > well as ensuring that the patch has been bootstrapped and regression > tested. AFAIR, the original patch was put in because match.pd failed > when bootstrap in another context. > > > regards > Ramana > >> -- >> - Thanks and regards, >> Sameera D. The patch is tested with GCC testsuite and bootstrapping successfully. Also tested for spec benchmark. -- - Thanks and regards, Sameera D.
Re: [Aarch64] Fix conditional branches with target far away.
Ping! On 28 February 2018 at 16:18, Sameera Deshpande wrote: > On 27 February 2018 at 18:25, Ramana Radhakrishnan > wrote: >> On Wed, Feb 14, 2018 at 8:30 AM, Sameera Deshpande >> wrote: >>> Hi! >>> >>> Please find attached the patch to fix bug in branches with offsets over >>> 1MiB. >>> There has been an attempt to fix this issue in commit >>> 050af05b9761f1979f11c151519e7244d5becd7c >>> >>> However, the far_branch attribute defined in above patch used >>> insn_length - which computes incorrect offset. Hence, eliminated the >>> attribute completely, and computed the offset from insn_addresses >>> instead. >>> >>> Ok for trunk? >>> >>> gcc/Changelog >>> >>> 2018-02-13 Sameera Deshpande >>> * config/aarch64/aarch64.md (far_branch): Remove attribute. >>> Eliminate >>> all the dependencies on the attribute from RTL patterns. >>> >> >> I'm not a maintainer but this looks good to me modulo notes about how >> this was tested. What would be nice is a testcase for the testsuite as >> well as ensuring that the patch has been bootstrapped and regression >> tested. AFAIR, the original patch was put in because match.pd failed >> when bootstrap in another context. >> >> >> regards >> Ramana >> >>> -- >>> - Thanks and regards, >>> Sameera D. > > The patch is tested with GCC testsuite and bootstrapping successfully. > Also tested for spec benchmark. > > -- > - Thanks and regards, > Sameera D. -- - Thanks and regards, Sameera D.
Re: [Aarch64] Fix conditional branches with target far away.
Hi Sudakshina, As per the ARMv8 ARM, for the offset range (-1048576 ,1048572), the far branch instruction offset is inclusive of both the offsets. Hence, I am using <=||=> and not <||>= as it was in previous implementation. On 16 March 2018 at 00:51, Sudakshina Das wrote: > On 15/03/18 15:27, Sameera Deshpande wrote: >> >> Ping! >> >> On 28 February 2018 at 16:18, Sameera Deshpande >> wrote: >>> >>> On 27 February 2018 at 18:25, Ramana Radhakrishnan >>> wrote: >>>> >>>> On Wed, Feb 14, 2018 at 8:30 AM, Sameera Deshpande >>>> wrote: >>>>> >>>>> Hi! >>>>> >>>>> Please find attached the patch to fix bug in branches with offsets over >>>>> 1MiB. >>>>> There has been an attempt to fix this issue in commit >>>>> 050af05b9761f1979f11c151519e7244d5becd7c >>>>> >>>>> However, the far_branch attribute defined in above patch used >>>>> insn_length - which computes incorrect offset. Hence, eliminated the >>>>> attribute completely, and computed the offset from insn_addresses >>>>> instead. >>>>> >>>>> Ok for trunk? >>>>> >>>>> gcc/Changelog >>>>> >>>>> 2018-02-13 Sameera Deshpande >>>>> * config/aarch64/aarch64.md (far_branch): Remove attribute. >>>>> Eliminate >>>>> all the dependencies on the attribute from RTL patterns. >>>>> >>>> >>>> I'm not a maintainer but this looks good to me modulo notes about how >>>> this was tested. What would be nice is a testcase for the testsuite as >>>> well as ensuring that the patch has been bootstrapped and regression >>>> tested. AFAIR, the original patch was put in because match.pd failed >>>> when bootstrap in another context. >>>> >>>> >>>> regards >>>> Ramana >>>> >>>>> -- >>>>> - Thanks and regards, >>>>>Sameera D. >>> >>> >>> The patch is tested with GCC testsuite and bootstrapping successfully. >>> Also tested for spec benchmark. >>> > > I am not a maintainer either. I noticed that the range check you do for > the offset has a (<= || >=). The "far_branch" however did (< || >=) for a > positive value. Was that also part of the incorrect offset calculation? > > @@ -692,7 +675,11 @@ > { > if (get_attr_length (insn) =3D=3D 8) > { > - if (get_attr_far_branch (insn) =3D=3D 1) > + long long int offset; > + offset =3D INSN_ADDRESSES (INSN_UID (XEXP (operands[2], 0))) > + - INSN_ADDRESSES (INSN_UID (insn)); > + > + if (offset <=3D -1048576 || offset >=3D 1048572) >return aarch64_gen_far_branch (operands, 2, "Ltb", > "\\t%0, %1, "); > else > @@ -709,12 +696,7 @@ > (if_then_else (and (ge (minus (match_dup 2) (pc)) (const_int > -32768)) > (lt (minus (match_dup 2) (pc)) (const_int > 32764))) >(const_int 4) > - (const_int 8))) > - (set (attr "far_branch") > - (if_then_else (and (ge (minus (match_dup 2) (pc)) (const_int > -1048576)) > - (lt (minus (match_dup 2) (pc)) (const_int > 1048572))) > - (const_int 0) > - (const_int 1)))] > + (const_int 8)))] > > ) > > Thanks > Sudi > >>> -- >>> - Thanks and regards, >>>Sameera D. >> >> >> >> > -- - Thanks and regards, Sameera D.
Re: [Aarch64] Fix conditional branches with target far away.
Hi Sudakshina, Thanks for pointing that out. Updated the conditions for attribute length to take care of boundary conditions for offset range. Please find attached the updated patch. I have tested it for gcc testsuite and the failing testcase. Ok for trunk? On 22 March 2018 at 19:06, Sudakshina Das wrote: > Hi Sameera > > On 22/03/18 02:07, Sameera Deshpande wrote: >> >> Hi Sudakshina, >> >> As per the ARMv8 ARM, for the offset range (-1048576 ,1048572), the >> far branch instruction offset is inclusive of both the offsets. Hence, >> I am using <=||=> and not <||>= as it was in previous implementation. > > > I have to admit earlier I was only looking at the patch mechanically and > found a difference with the previous implementation in offset comparison. > After you pointed out, I looked up the ARMv8 ARM and I have a couple of > doubts: > > 1. My understanding is that any offset in [-1048576 ,1048572] both inclusive > qualifies as an 'in range' offset. However, the code for both attribute > length and far_branch has been using [-1048576 ,1048572), that is, ( >= && < > ). If the far_branch was incorrectly calculated, then maybe the length > calculations with similar magic numbers should also be corrected? Of course, > I am not an expert in this and maybe this was a conscience decision so I > would ask Ramana to maybe clarify if he remembers. > > 2. Now to come back to your patch, if my understanding is correct, I think a > far_branch would be anything outside of this range, that is, > (offset < -1048576 || offset > 1048572), anything that can not be > represented in the 21-bit range. > > Thanks > Sudi > > >> >> On 16 March 2018 at 00:51, Sudakshina Das wrote: >>> >>> On 15/03/18 15:27, Sameera Deshpande wrote: >>>> >>>> >>>> Ping! >>>> >>>> On 28 February 2018 at 16:18, Sameera Deshpande >>>> wrote: >>>>> >>>>> >>>>> On 27 February 2018 at 18:25, Ramana Radhakrishnan >>>>> wrote: >>>>>> >>>>>> >>>>>> On Wed, Feb 14, 2018 at 8:30 AM, Sameera Deshpande >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> Hi! >>>>>>> >>>>>>> Please find attached the patch to fix bug in branches with offsets >>>>>>> over >>>>>>> 1MiB. >>>>>>> There has been an attempt to fix this issue in commit >>>>>>> 050af05b9761f1979f11c151519e7244d5becd7c >>>>>>> >>>>>>> However, the far_branch attribute defined in above patch used >>>>>>> insn_length - which computes incorrect offset. Hence, eliminated the >>>>>>> attribute completely, and computed the offset from insn_addresses >>>>>>> instead. >>>>>>> >>>>>>> Ok for trunk? >>>>>>> >>>>>>> gcc/Changelog >>>>>>> >>>>>>> 2018-02-13 Sameera Deshpande >>>>>>> * config/aarch64/aarch64.md (far_branch): Remove attribute. >>>>>>> Eliminate >>>>>>> all the dependencies on the attribute from RTL patterns. >>>>>>> >>>>>> >>>>>> I'm not a maintainer but this looks good to me modulo notes about how >>>>>> this was tested. What would be nice is a testcase for the testsuite as >>>>>> well as ensuring that the patch has been bootstrapped and regression >>>>>> tested. AFAIR, the original patch was put in because match.pd failed >>>>>> when bootstrap in another context. >>>>>> >>>>>> >>>>>> regards >>>>>> Ramana >>>>>> >>>>>>> -- >>>>>>> - Thanks and regards, >>>>>>> Sameera D. >>>>> >>>>> >>>>> >>>>> The patch is tested with GCC testsuite and bootstrapping successfully. >>>>> Also tested for spec benchmark. >>>>> >>> >>> I am not a maintainer either. I noticed that the range check you do for >>> the offset has a (<= || >=). The "far_branch" however did (< || >=) for a >>> positive value. Was that also part of the incorrect offset calculation? >>> >>> @@ -692,7 +675,11 @@ >&
Re: [Aarch64] Fix conditional branches with target far away.
Hi Sudakshina, That testcase cannot be addwd as of now, as it needs approval from client. On Thu 29 Mar, 2018, 9:01 PM Sudakshina Das, wrote: > Hi Sameera > > On 29/03/18 11:44, Sameera Deshpande wrote: > > Hi Sudakshina, > > > > Thanks for pointing that out. Updated the conditions for attribute > > length to take care of boundary conditions for offset range. > > > > Please find attached the updated patch. > > > > I have tested it for gcc testsuite and the failing testcase. Ok for > trunk? > > Thank you so much for fixing the length as well along with you patch. > You mention a failing testcase? Maybe it would be helpful to add that > to the patch for the gcc testsuite. > > Sudi > > > > > On 22 March 2018 at 19:06, Sudakshina Das wrote: > >> Hi Sameera > >> > >> On 22/03/18 02:07, Sameera Deshpande wrote: > >>> > >>> Hi Sudakshina, > >>> > >>> As per the ARMv8 ARM, for the offset range (-1048576 ,1048572), the > >>> far branch instruction offset is inclusive of both the offsets. Hence, > >>> I am using <=||=> and not <||>= as it was in previous implementation. > >> > >> > >> I have to admit earlier I was only looking at the patch mechanically and > >> found a difference with the previous implementation in offset > comparison. > >> After you pointed out, I looked up the ARMv8 ARM and I have a couple of > >> doubts: > >> > >> 1. My understanding is that any offset in [-1048576 ,1048572] both > inclusive > >> qualifies as an 'in range' offset. However, the code for both attribute > >> length and far_branch has been using [-1048576 ,1048572), that is, ( >= > && < > >> ). If the far_branch was incorrectly calculated, then maybe the length > >> calculations with similar magic numbers should also be corrected? Of > course, > >> I am not an expert in this and maybe this was a conscience decision so I > >> would ask Ramana to maybe clarify if he remembers. > >> > >> 2. Now to come back to your patch, if my understanding is correct, I > think a > >> far_branch would be anything outside of this range, that is, > >> (offset < -1048576 || offset > 1048572), anything that can not be > >> represented in the 21-bit range. > >> > >> Thanks > >> Sudi > >> > >> > >>> > >>> On 16 March 2018 at 00:51, Sudakshina Das wrote: > >>>> > >>>> On 15/03/18 15:27, Sameera Deshpande wrote: > >>>>> > >>>>> > >>>>> Ping! > >>>>> > >>>>> On 28 February 2018 at 16:18, Sameera Deshpande > >>>>> wrote: > >>>>>> > >>>>>> > >>>>>> On 27 February 2018 at 18:25, Ramana Radhakrishnan > >>>>>> wrote: > >>>>>>> > >>>>>>> > >>>>>>> On Wed, Feb 14, 2018 at 8:30 AM, Sameera Deshpande > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> Hi! > >>>>>>>> > >>>>>>>> Please find attached the patch to fix bug in branches with offsets > >>>>>>>> over > >>>>>>>> 1MiB. > >>>>>>>> There has been an attempt to fix this issue in commit > >>>>>>>> 050af05b9761f1979f11c151519e7244d5becd7c > >>>>>>>> > >>>>>>>> However, the far_branch attribute defined in above patch used > >>>>>>>> insn_length - which computes incorrect offset. Hence, eliminated > the > >>>>>>>> attribute completely, and computed the offset from insn_addresses > >>>>>>>> instead. > >>>>>>>> > >>>>>>>> Ok for trunk? > >>>>>>>> > >>>>>>>> gcc/Changelog > >>>>>>>> > >>>>>>>> 2018-02-13 Sameera Deshpande > >>>>>>>>* config/aarch64/aarch64.md (far_branch): Remove > attribute. > >>>>>>>> Eliminate > >>>>>>>>all the dependencies on the attribute from RTL > patterns. > >>>>>>>> > >>>>>>> > >>>&
Re: [Aarch64] Fix conditional branches with target far away.
On 30 March 2018 at 16:39, Richard Sandiford wrote: >> Hi Sudakshina, >> >> Thanks for pointing that out. Updated the conditions for attribute >> length to take care of boundary conditions for offset range. >> >> Please find attached the updated patch. >> >> I have tested it for gcc testsuite and the failing testcase. Ok for trunk? >> >> On 22 March 2018 at 19:06, Sudakshina Das wrote: >>> Hi Sameera >>> >>> On 22/03/18 02:07, Sameera Deshpande wrote: >>>> >>>> Hi Sudakshina, >>>> >>>> As per the ARMv8 ARM, for the offset range (-1048576 ,1048572), the >>>> far branch instruction offset is inclusive of both the offsets. Hence, >>>> I am using <=||=> and not <||>= as it was in previous implementation. >>> >>> >>> I have to admit earlier I was only looking at the patch mechanically and >>> found a difference with the previous implementation in offset comparison. >>> After you pointed out, I looked up the ARMv8 ARM and I have a couple of >>> doubts: >>> >>> 1. My understanding is that any offset in [-1048576 ,1048572] both inclusive >>> qualifies as an 'in range' offset. However, the code for both attribute >>> length and far_branch has been using [-1048576 ,1048572), that is, ( >= && < >>> ). If the far_branch was incorrectly calculated, then maybe the length >>> calculations with similar magic numbers should also be corrected? Of course, >>> I am not an expert in this and maybe this was a conscience decision so I >>> would ask Ramana to maybe clarify if he remembers. >>> >>> 2. Now to come back to your patch, if my understanding is correct, I think a >>> far_branch would be anything outside of this range, that is, >>> (offset < -1048576 || offset > 1048572), anything that can not be >>> represented in the 21-bit range. >>> >>> Thanks >>> Sudi > > [...] > >> @@ -466,14 +459,9 @@ >>[(set_attr "type" "branch") >> (set (attr "length") >> (if_then_else (and (ge (minus (match_dup 2) (pc)) (const_int -1048576)) >> -(lt (minus (match_dup 2) (pc)) (const_int 1048572))) >> +(le (minus (match_dup 2) (pc)) (const_int 1048572))) >> (const_int 4) >> - (const_int 8))) > > Sorry for not replying earlier, but I think the use of "lt" rather than > "le" in the current length attribute is deliberate. Distances measured > from (pc) in "length" are a bit special in that backward distances are > measured from the start of the instruction and forward distances are > measured from the end of the instruction: > > /* The address of the current insn. We implement this actually as the > address of the current insn for backward branches, but the last > address of the next insn for forward branches, and both with > adjustments that account for the worst-case possible stretching of > intervening alignments between this insn and its destination. */ > > This avoids the chicken-and-egg situation of the length of the branch > depending on the forward distance and the forward distance depending > on the length of the branch. > > In contrast, this code: > >> @@ -712,7 +695,11 @@ >>{ >> if (get_attr_length (insn) == 8) >>{ >> - if (get_attr_far_branch (insn) == 1) >> + long long int offset; >> + offset = INSN_ADDRESSES (INSN_UID (XEXP (operands[2], 0))) >> + - INSN_ADDRESSES (INSN_UID (insn)); >> + >> + if (offset < -1048576 || offset > 1048572) >> return aarch64_gen_far_branch (operands, 2, "Ltb", >>"\\t%0, %1, "); >> else > > is reading the final computed addresses, so the code is right to use > the real instruction range. (FWIW I agree with Kyrill that using > IN_RANGE with hex constants would be clearer.) > > That said... a possible problem comes from situations like: > >address length insn >..c 8 A > ..align to 8 bytes... >..8B >..c 4 C > ..align to 16 bytes... >..0D, branch to B > > when D is at the maximum extent of the branch range and when GCC's length > for A is only a conservative estimate. If the length of A turns out to > be 4 rather than 8 at assembly time, the align
Re: [Aarch64] Fix conditional branches with target far away.
Hi Richard, The testcase is working with the patch you suggested, thanks for pointing that out. On 30 March 2018 at 16:54, Sameera Deshpande wrote: > On 30 March 2018 at 16:39, Richard Sandiford > wrote: >>> Hi Sudakshina, >>> >>> Thanks for pointing that out. Updated the conditions for attribute >>> length to take care of boundary conditions for offset range. >>> >>> Please find attached the updated patch. >>> >>> I have tested it for gcc testsuite and the failing testcase. Ok for trunk? >>> >>> On 22 March 2018 at 19:06, Sudakshina Das wrote: >>>> Hi Sameera >>>> >>>> On 22/03/18 02:07, Sameera Deshpande wrote: >>>>> >>>>> Hi Sudakshina, >>>>> >>>>> As per the ARMv8 ARM, for the offset range (-1048576 ,1048572), the >>>>> far branch instruction offset is inclusive of both the offsets. Hence, >>>>> I am using <=||=> and not <||>= as it was in previous implementation. >>>> >>>> >>>> I have to admit earlier I was only looking at the patch mechanically and >>>> found a difference with the previous implementation in offset comparison. >>>> After you pointed out, I looked up the ARMv8 ARM and I have a couple of >>>> doubts: >>>> >>>> 1. My understanding is that any offset in [-1048576 ,1048572] both >>>> inclusive >>>> qualifies as an 'in range' offset. However, the code for both attribute >>>> length and far_branch has been using [-1048576 ,1048572), that is, ( >= && >>>> < >>>> ). If the far_branch was incorrectly calculated, then maybe the length >>>> calculations with similar magic numbers should also be corrected? Of >>>> course, >>>> I am not an expert in this and maybe this was a conscience decision so I >>>> would ask Ramana to maybe clarify if he remembers. >>>> >>>> 2. Now to come back to your patch, if my understanding is correct, I think >>>> a >>>> far_branch would be anything outside of this range, that is, >>>> (offset < -1048576 || offset > 1048572), anything that can not be >>>> represented in the 21-bit range. >>>> >>>> Thanks >>>> Sudi >> >> [...] >> >>> @@ -466,14 +459,9 @@ >>>[(set_attr "type" "branch") >>> (set (attr "length") >>> (if_then_else (and (ge (minus (match_dup 2) (pc)) (const_int >>> -1048576)) >>> -(lt (minus (match_dup 2) (pc)) (const_int >>> 1048572))) >>> +(le (minus (match_dup 2) (pc)) (const_int >>> 1048572))) >>> (const_int 4) >>> - (const_int 8))) >> >> Sorry for not replying earlier, but I think the use of "lt" rather than >> "le" in the current length attribute is deliberate. Distances measured >> from (pc) in "length" are a bit special in that backward distances are >> measured from the start of the instruction and forward distances are >> measured from the end of the instruction: >> >> /* The address of the current insn. We implement this actually as the >> address of the current insn for backward branches, but the last >> address of the next insn for forward branches, and both with >> adjustments that account for the worst-case possible stretching of >> intervening alignments between this insn and its destination. */ >> >> This avoids the chicken-and-egg situation of the length of the branch >> depending on the forward distance and the forward distance depending >> on the length of the branch. >> >> In contrast, this code: >> >>> @@ -712,7 +695,11 @@ >>>{ >>> if (get_attr_length (insn) == 8) >>>{ >>> - if (get_attr_far_branch (insn) == 1) >>> + long long int offset; >>> + offset = INSN_ADDRESSES (INSN_UID (XEXP (operands[2], 0))) >>> + - INSN_ADDRESSES (INSN_UID (insn)); >>> + >>> + if (offset < -1048576 || offset > 1048572) >>> return aarch64_gen_far_branch (operands, 2, "Ltb", >>>"\\t%0, %1, "); >>> else >> >> is reading the final computed addresses, so the code is right to use >> the real instruction
Re: [AARCH64] Neon vld1_*_x3, vst1_*_x2 and vst1_*_x3 intrinsics
Hi Christophe, Please find attached the updated patch with testcases. Ok for trunk? - Thanks and regards, Sameera D. 2017-12-14 22:17 GMT+05:30 Christophe Lyon : > 2017-12-14 9:29 GMT+01:00 Sameera Deshpande : >> Hi! >> >> Please find attached the patch implementing vld1_*_x3, vst1_*_x2 and >> vst1_*_x3 intrinsics as defined by Neon document. >> >> Ok for trunk? >> >> - Thanks and regards, >> Sameera D. >> >> gcc/Changelog: >> >> 2017-11-14 Sameera Deshpande >> >> >> * config/aarch64/aarch64-simd-builtins.def (ld1x3): New. >> (st1x2): Likewise. >> (st1x3): Likewise. >> * config/aarch64/aarch64-simd.md >> (aarch64_ld1x3): New pattern. >> (aarch64_ld1_x3_): Likewise >> (aarch64_st1x2): Likewise >> (aarch64_st1_x2_): Likewise >> (aarch64_st1x3): Likewise >> (aarch64_st1_x3_): Likewise >> * config/aarch64/arm_neon.h (vld1_u8_x3): New function. >> (vld1_s8_x3): Likewise. >> (vld1_u16_x3): Likewise. >> (vld1_s16_x3): Likewise. >> (vld1_u32_x3): Likewise. >> (vld1_s32_x3): Likewise. >> (vld1_u64_x3): Likewise. >> (vld1_s64_x3): Likewise. >> (vld1_fp16_x3): Likewise. >> (vld1_f32_x3): Likewise. >> (vld1_f64_x3): Likewise. >> (vld1_p8_x3): Likewise. >> (vld1_p16_x3): Likewise. >> (vld1_p64_x3): Likewise. >> (vld1q_u8_x3): Likewise. >> (vld1q_s8_x3): Likewise. >> (vld1q_u16_x3): Likewise. >> (vld1q_s16_x3): Likewise. >> (vld1q_u32_x3): Likewise. >> (vld1q_s32_x3): Likewise. >> (vld1q_u64_x3): Likewise. >> (vld1q_s64_x3): Likewise. >> (vld1q_f16_x3): Likewise. >> (vld1q_f32_x3): Likewise. >> (vld1q_f64_x3): Likewise. >> (vld1q_p8_x3): Likewise. >> (vld1q_p16_x3): Likewise. >> (vld1q_p64_x3): Likewise. >> (vst1_s64_x2): Likewise. >> (vst1_u64_x2): Likewise. >> (vst1_f64_x2): Likewise. >> (vst1_s8_x2): Likewise. >> (vst1_p8_x2): Likewise. >> (vst1_s16_x2): Likewise. >> (vst1_p16_x2): Likewise. >> (vst1_s32_x2): Likewise. >> (vst1_u8_x2): Likewise. >> (vst1_u16_x2): Likewise. >> (vst1_u32_x2): Likewise. >> (vst1_f16_x2): Likewise. >> (vst1_f32_x2): Likewise. >> (vst1_p64_x2): Likewise. >> (vst1q_s8_x2): Likewise. >> (vst1q_p8_x2): Likewise. >> (vst1q_s16_x2): Likewise. >> (vst1q_p16_x2): Likewise. >> (vst1q_s32_x2): Likewise. >> (vst1q_s64_x2): Likewise. >> (vst1q_u8_x2): Likewise. >> (vst1q_u16_x2): Likewise. >> (vst1q_u32_x2): Likewise. >> (vst1q_u64_x2): Likewise. >> (vst1q_f16_x2): Likewise. >> (vst1q_f32_x2): Likewise. >> (vst1q_f64_x2): Likewise. >> (vst1q_p64_x2): Likewise. >> (vst1_s64_x3): Likewise. >> (vst1_u64_x3): Likewise. >> (vst1_f64_x3): Likewise. >> (vst1_s8_x3): Likewise. >> (vst1_p8_x3): Likewise. >> (vst1_s16_x3): Likewise. >> (vst1_p16_x3): Likewise. >> (vst1_s32_x3): Likewise. >> (vst1_u8_x3): Likewise. >> (vst1_u16_x3): Likewise. >> (vst1_u32_x3): Likewise. >> (vst1_f16_x3): Likewise. >> (vst1_f32_x3): Likewise. >> (vst1_p64_x3): Likewise. >> (vst1q_s8_x3): Likewise. >> (vst1q_p8_x3): Likewise. >> (vst1q_s16_x3): Likewise. >> (vst1q_p16_x3): Likewise. >> (vst1q_s32_x3): Likewise. >> (vst1q_s64_x3): Likewise. >> (vst1q_u8_x3): Likewise. >> (vst1q_u16_x3): Likewise. >> (vst1q_u32_x3): Likewise. >> (vst1q_u64_x3): Likewise. >> (vst1q_f16_x3): Likewise. >> (vst1q_f32_x3): Likewise. >> (vst1q_f64_x3): Likewise. >> (vst1q_p64_x3): Likewise. > > Hi, > I'm not a maintainer, but I suspect you should add some tests. > > Christophe -- - Thanks and regards, Sameera D. diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def index b383f24..2fd072a 100644 --- a/gcc/config/aarch64/aarch64-simd-builtins.def +++ b/gcc/config/aarch64/aarc
Re: [Aarch64] Fix conditional branches with target far away.
Hi Richard, I do not see the said patch applied in ToT yet. When do you expect it to be available in ToT? - Thanks and regards, Sameera D. On 30 March 2018 at 17:01, Sameera Deshpande wrote: > Hi Richard, > > The testcase is working with the patch you suggested, thanks for > pointing that out. > > On 30 March 2018 at 16:54, Sameera Deshpande > wrote: >> On 30 March 2018 at 16:39, Richard Sandiford >> wrote: >>>> Hi Sudakshina, >>>> >>>> Thanks for pointing that out. Updated the conditions for attribute >>>> length to take care of boundary conditions for offset range. >>>> >>>> Please find attached the updated patch. >>>> >>>> I have tested it for gcc testsuite and the failing testcase. Ok for trunk? >>>> >>>> On 22 March 2018 at 19:06, Sudakshina Das wrote: >>>>> Hi Sameera >>>>> >>>>> On 22/03/18 02:07, Sameera Deshpande wrote: >>>>>> >>>>>> Hi Sudakshina, >>>>>> >>>>>> As per the ARMv8 ARM, for the offset range (-1048576 ,1048572), the >>>>>> far branch instruction offset is inclusive of both the offsets. Hence, >>>>>> I am using <=||=> and not <||>= as it was in previous implementation. >>>>> >>>>> >>>>> I have to admit earlier I was only looking at the patch mechanically and >>>>> found a difference with the previous implementation in offset comparison. >>>>> After you pointed out, I looked up the ARMv8 ARM and I have a couple of >>>>> doubts: >>>>> >>>>> 1. My understanding is that any offset in [-1048576 ,1048572] both >>>>> inclusive >>>>> qualifies as an 'in range' offset. However, the code for both attribute >>>>> length and far_branch has been using [-1048576 ,1048572), that is, ( >= >>>>> && < >>>>> ). If the far_branch was incorrectly calculated, then maybe the length >>>>> calculations with similar magic numbers should also be corrected? Of >>>>> course, >>>>> I am not an expert in this and maybe this was a conscience decision so I >>>>> would ask Ramana to maybe clarify if he remembers. >>>>> >>>>> 2. Now to come back to your patch, if my understanding is correct, I >>>>> think a >>>>> far_branch would be anything outside of this range, that is, >>>>> (offset < -1048576 || offset > 1048572), anything that can not be >>>>> represented in the 21-bit range. >>>>> >>>>> Thanks >>>>> Sudi >>> >>> [...] >>> >>>> @@ -466,14 +459,9 @@ >>>>[(set_attr "type" "branch") >>>> (set (attr "length") >>>> (if_then_else (and (ge (minus (match_dup 2) (pc)) (const_int >>>> -1048576)) >>>> -(lt (minus (match_dup 2) (pc)) (const_int >>>> 1048572))) >>>> +(le (minus (match_dup 2) (pc)) (const_int >>>> 1048572))) >>>> (const_int 4) >>>> - (const_int 8))) >>> >>> Sorry for not replying earlier, but I think the use of "lt" rather than >>> "le" in the current length attribute is deliberate. Distances measured >>> from (pc) in "length" are a bit special in that backward distances are >>> measured from the start of the instruction and forward distances are >>> measured from the end of the instruction: >>> >>> /* The address of the current insn. We implement this actually as the >>> address of the current insn for backward branches, but the last >>> address of the next insn for forward branches, and both with >>> adjustments that account for the worst-case possible stretching of >>> intervening alignments between this insn and its destination. */ >>> >>> This avoids the chicken-and-egg situation of the length of the branch >>> depending on the forward distance and the forward distance depending >>> on the length of the branch. >>> >>> In contrast, this code: >>> >>>> @@ -712,7 +695,11 @@ >>>>{ >>>> if (get_attr_length (insn) == 8) >>>>{ >>>> - if (get_attr_far_branch (insn) =
Re: [AARCH64] Neon vld1_*_x3, vst1_*_x2 and vst1_*_x3 intrinsics
On 7 April 2018 at 01:25, Christophe Lyon wrote: > Hi, > > 2018-04-06 12:15 GMT+02:00 Sameera Deshpande : >> Hi Christophe, >> >> Please find attached the updated patch with testcases. >> >> Ok for trunk? > > Thanks for the update. > > Since the new intrinsics are only available on aarch64, you want to > prevent the tests from running on arm. > Indeed gcc.target/aarch64/advsimd-intrinsics/ is shared between the two > targets. > There are several examples on how to do that in that directory. > > I have also noticed that the tests fail at execution on aarch64_be. > > I didn't look at the patch in details. > > Christophe > > >> >> - Thanks and regards, >> Sameera D. >> >> 2017-12-14 22:17 GMT+05:30 Christophe Lyon : >>> 2017-12-14 9:29 GMT+01:00 Sameera Deshpande : >>>> Hi! >>>> >>>> Please find attached the patch implementing vld1_*_x3, vst1_*_x2 and >>>> vst1_*_x3 intrinsics as defined by Neon document. >>>> >>>> Ok for trunk? >>>> >>>> - Thanks and regards, >>>> Sameera D. >>>> >>>> gcc/Changelog: >>>> >>>> 2017-11-14 Sameera Deshpande >>>> >>>> >>>> * config/aarch64/aarch64-simd-builtins.def (ld1x3): New. >>>> (st1x2): Likewise. >>>> (st1x3): Likewise. >>>> * config/aarch64/aarch64-simd.md >>>> (aarch64_ld1x3): New pattern. >>>> (aarch64_ld1_x3_): Likewise >>>> (aarch64_st1x2): Likewise >>>> (aarch64_st1_x2_): Likewise >>>> (aarch64_st1x3): Likewise >>>> (aarch64_st1_x3_): Likewise >>>> * config/aarch64/arm_neon.h (vld1_u8_x3): New function. >>>> (vld1_s8_x3): Likewise. >>>> (vld1_u16_x3): Likewise. >>>> (vld1_s16_x3): Likewise. >>>> (vld1_u32_x3): Likewise. >>>> (vld1_s32_x3): Likewise. >>>> (vld1_u64_x3): Likewise. >>>> (vld1_s64_x3): Likewise. >>>> (vld1_fp16_x3): Likewise. >>>> (vld1_f32_x3): Likewise. >>>> (vld1_f64_x3): Likewise. >>>> (vld1_p8_x3): Likewise. >>>> (vld1_p16_x3): Likewise. >>>> (vld1_p64_x3): Likewise. >>>> (vld1q_u8_x3): Likewise. >>>> (vld1q_s8_x3): Likewise. >>>> (vld1q_u16_x3): Likewise. >>>> (vld1q_s16_x3): Likewise. >>>> (vld1q_u32_x3): Likewise. >>>> (vld1q_s32_x3): Likewise. >>>> (vld1q_u64_x3): Likewise. >>>> (vld1q_s64_x3): Likewise. >>>> (vld1q_f16_x3): Likewise. >>>> (vld1q_f32_x3): Likewise. >>>> (vld1q_f64_x3): Likewise. >>>> (vld1q_p8_x3): Likewise. >>>> (vld1q_p16_x3): Likewise. >>>> (vld1q_p64_x3): Likewise. >>>> (vst1_s64_x2): Likewise. >>>> (vst1_u64_x2): Likewise. >>>> (vst1_f64_x2): >>>> Likewise.patchurl=http://people.linaro.org/~christophe.lyon/armv8_2-fp16-scalar-2.patch3 > patchname=armv8_2-fp16-scalar-2.patch3 > refrev=259064 > email_to=christophe.l...@linaro.org > >>>> (vst1_s8_x2): Likewise. >>>> (vst1_p8_x2): Likewise. >>>> (vst1_s16_x2): Likewise. >>>> (vst1_p16_x2): Likewise. >>>> (vst1_s32_x2): Likewise. >>>> (vst1_u8_x2): Likewise. >>>> (vst1_u16_x2): Likewise. >>>> (vst1_u32_x2): Likewise. >>>> (vst1_f16_x2): Likewise. >>>> (vst1_f32_x2): Likewise. >>>> (vst1_p64_x2): Likewise. >>>> (vst1q_s8_x2): Likewise. >>>> (vst1q_p8_x2): Likewise. >>>> (vst1q_s16_x2): Likewise. >>>> (vst1q_p16_x2): Likewise. >>>> (vst1q_s32_x2): Likewise. >>>> (vst1q_s64_x2): Likewise. >>>> (vst1q_u8_x2): Likewise. >>>> (vst1q_u16_x2): Likewise. >>>> (vst1q_u32_x2): Likewise. >>>> (vst1q_u64_x2): Likewise. >>>> (vst1q_f16_x2): Likewise. >>>> (vst1q_f32_x2): Likewise. >>>> (vst1q_f64_x2): Likewise. >>>> (vst1q_p64_x2): Likewise. >>>>
Re: [AARCH64] Neon vld1_*_x3, vst1_*_x2 and vst1_*_x3 intrinsics
On 10 April 2018 at 20:07, Sudakshina Das wrote: > Hi Sameera > > > On 10/04/18 11:20, Sameera Deshpande wrote: >> >> On 7 April 2018 at 01:25, Christophe Lyon >> wrote: >>> >>> Hi, >>> >>> 2018-04-06 12:15 GMT+02:00 Sameera Deshpande >>> : >>>> >>>> Hi Christophe, >>>> >>>> Please find attached the updated patch with testcases. >>>> >>>> Ok for trunk? >>> >>> >>> Thanks for the update. >>> >>> Since the new intrinsics are only available on aarch64, you want to >>> prevent the tests from running on arm. >>> Indeed gcc.target/aarch64/advsimd-intrinsics/ is shared between the two >>> targets. >>> There are several examples on how to do that in that directory. >>> >>> I have also noticed that the tests fail at execution on aarch64_be. >>> >>> I didn't look at the patch in details. >>> >>> Christophe >>> >>> >>>> >>>> - Thanks and regards, >>>>Sameera D. >>>> >>>> 2017-12-14 22:17 GMT+05:30 Christophe Lyon : >>>>> >>>>> 2017-12-14 9:29 GMT+01:00 Sameera Deshpande >>>>> : >>>>>> >>>>>> Hi! >>>>>> >>>>>> Please find attached the patch implementing vld1_*_x3, vst1_*_x2 and >>>>>> vst1_*_x3 intrinsics as defined by Neon document. >>>>>> >>>>>> Ok for trunk? >>>>>> >>>>>> - Thanks and regards, >>>>>>Sameera D. >>>>>> >>>>>> gcc/Changelog: >>>>>> >>>>>> 2017-11-14 Sameera Deshpande >>>>>> >>>>>> >>>>>> * config/aarch64/aarch64-simd-builtins.def (ld1x3): New. >>>>>> (st1x2): Likewise. >>>>>> (st1x3): Likewise. >>>>>> * config/aarch64/aarch64-simd.md >>>>>> (aarch64_ld1x3): New pattern. >>>>>> (aarch64_ld1_x3_): Likewise >>>>>> (aarch64_st1x2): Likewise >>>>>> (aarch64_st1_x2_): Likewise >>>>>> (aarch64_st1x3): Likewise >>>>>> (aarch64_st1_x3_): Likewise >>>>>> * config/aarch64/arm_neon.h (vld1_u8_x3): New function. >>>>>> (vld1_s8_x3): Likewise. >>>>>> (vld1_u16_x3): Likewise. >>>>>> (vld1_s16_x3): Likewise. >>>>>> (vld1_u32_x3): Likewise. >>>>>> (vld1_s32_x3): Likewise. >>>>>> (vld1_u64_x3): Likewise. >>>>>> (vld1_s64_x3): Likewise. >>>>>> (vld1_fp16_x3): Likewise. >>>>>> (vld1_f32_x3): Likewise. >>>>>> (vld1_f64_x3): Likewise. >>>>>> (vld1_p8_x3): Likewise. >>>>>> (vld1_p16_x3): Likewise. >>>>>> (vld1_p64_x3): Likewise. >>>>>> (vld1q_u8_x3): Likewise. >>>>>> (vld1q_s8_x3): Likewise. >>>>>> (vld1q_u16_x3): Likewise. >>>>>> (vld1q_s16_x3): Likewise. >>>>>> (vld1q_u32_x3): Likewise. >>>>>> (vld1q_s32_x3): Likewise. >>>>>> (vld1q_u64_x3): Likewise. >>>>>> (vld1q_s64_x3): Likewise. >>>>>> (vld1q_f16_x3): Likewise. >>>>>> (vld1q_f32_x3): Likewise. >>>>>> (vld1q_f64_x3): Likewise. >>>>>> (vld1q_p8_x3): Likewise. >>>>>> (vld1q_p16_x3): Likewise. >>>>>> (vld1q_p64_x3): Likewise. >>>>>> (vst1_s64_x2): Likewise. >>>>>> (vst1_u64_x2): Likewise. >>>>>> (vst1_f64_x2): >>>>>> Likewise.patchurl=http://people.linaro.org/~christophe.lyon/armv8_2-fp16-scalar-2.patch3 >>> >>> patchname=armv8_2-fp16-scalar-2.patch3 >>> refrev=259064 >>> email_to=christophe.l...@linaro.org >>> >>>>>> (vst1_s8_x2): Likewise. >>>>>> (vst1_p8_x2): Likewise. >>>>>> (vst1_s16_x2): Likewise. >>>>>> (vst1_p16_x2): Lik
Re: [AARCH64] Neon vld1_*_x3, vst1_*_x2 and vst1_*_x3 intrinsics
On 11 April 2018 at 15:53, Sudakshina Das wrote: > Hi Sameera > > > On 11/04/18 09:04, Sameera Deshpande wrote: >> >> On 10 April 2018 at 20:07, Sudakshina Das wrote: >>> >>> Hi Sameera >>> >>> >>> On 10/04/18 11:20, Sameera Deshpande wrote: >>>> >>>> >>>> On 7 April 2018 at 01:25, Christophe Lyon >>>> wrote: >>>>> >>>>> >>>>> Hi, >>>>> >>>>> 2018-04-06 12:15 GMT+02:00 Sameera Deshpande >>>>> : >>>>>> >>>>>> >>>>>> Hi Christophe, >>>>>> >>>>>> Please find attached the updated patch with testcases. >>>>>> >>>>>> Ok for trunk? >>>>> >>>>> >>>>> >>>>> Thanks for the update. >>>>> >>>>> Since the new intrinsics are only available on aarch64, you want to >>>>> prevent the tests from running on arm. >>>>> Indeed gcc.target/aarch64/advsimd-intrinsics/ is shared between the two >>>>> targets. >>>>> There are several examples on how to do that in that directory. >>>>> >>>>> I have also noticed that the tests fail at execution on aarch64_be. >>>>> >>>>> I didn't look at the patch in details. >>>>> >>>>> Christophe >>>>> >>>>> >>>>>> >>>>>> - Thanks and regards, >>>>>> Sameera D. >>>>>> >>>>>> 2017-12-14 22:17 GMT+05:30 Christophe Lyon >>>>>> : >>>>>>> >>>>>>> >>>>>>> 2017-12-14 9:29 GMT+01:00 Sameera Deshpande >>>>>>> : >>>>>>>> >>>>>>>> >>>>>>>> Hi! >>>>>>>> >>>>>>>> Please find attached the patch implementing vld1_*_x3, vst1_*_x2 and >>>>>>>> vst1_*_x3 intrinsics as defined by Neon document. >>>>>>>> >>>>>>>> Ok for trunk? >>>>>>>> >>>>>>>> - Thanks and regards, >>>>>>>> Sameera D. >>>>>>>> >>>>>>>> gcc/Changelog: >>>>>>>> >>>>>>>> 2017-11-14 Sameera Deshpande >>>>>>>> >>>>>>>> >>>>>>>> * config/aarch64/aarch64-simd-builtins.def (ld1x3): New. >>>>>>>> (st1x2): Likewise. >>>>>>>> (st1x3): Likewise. >>>>>>>> * config/aarch64/aarch64-simd.md >>>>>>>> (aarch64_ld1x3): New pattern. >>>>>>>> (aarch64_ld1_x3_): Likewise >>>>>>>> (aarch64_st1x2): Likewise >>>>>>>> (aarch64_st1_x2_): Likewise >>>>>>>> (aarch64_st1x3): Likewise >>>>>>>> (aarch64_st1_x3_): Likewise >>>>>>>> * config/aarch64/arm_neon.h (vld1_u8_x3): New function. >>>>>>>> (vld1_s8_x3): Likewise. >>>>>>>> (vld1_u16_x3): Likewise. >>>>>>>> (vld1_s16_x3): Likewise. >>>>>>>> (vld1_u32_x3): Likewise. >>>>>>>> (vld1_s32_x3): Likewise. >>>>>>>> (vld1_u64_x3): Likewise. >>>>>>>> (vld1_s64_x3): Likewise. >>>>>>>> (vld1_fp16_x3): Likewise. >>>>>>>> (vld1_f32_x3): Likewise. >>>>>>>> (vld1_f64_x3): Likewise. >>>>>>>> (vld1_p8_x3): Likewise. >>>>>>>> (vld1_p16_x3): Likewise. >>>>>>>> (vld1_p64_x3): Likewise. >>>>>>>> (vld1q_u8_x3): Likewise. >>>>>>>> (vld1q_s8_x3): Likewise. >>>>>>>> (vld1q_u16_x3): Likewise. >>>>>>>> (vld1q_s16_x3): Likewise. >>>>>>>> (vld1q_u32_x3): Likewise. >>>>>>>> (vld1q_s32_x3): Likewise. >>>>>>>> (vld1q_u64_x3): Likewis
Re: [AARCH64] Neon vld1_*_x3, vst1_*_x2 and vst1_*_x3 intrinsics
On Fri 13 Apr, 2018, 8:04 PM James Greenhalgh, wrote: > On Fri, Apr 06, 2018 at 08:55:47PM +0100, Christophe Lyon wrote: > > Hi, > > > > 2018-04-06 12:15 GMT+02:00 Sameera Deshpande < > sameera.deshpa...@linaro.org>: > > > Hi Christophe, > > > > > > Please find attached the updated patch with testcases. > > > > > > Ok for trunk? > > > > Thanks for the update. > > > > Since the new intrinsics are only available on aarch64, you want to > > prevent the tests from running on arm. > > Indeed gcc.target/aarch64/advsimd-intrinsics/ is shared between the two > targets. > > There are several examples on how to do that in that directory. > > > > I have also noticed that the tests fail at execution on aarch64_be. > > I think this is important to fix. We don't want the big-endian target to > have > failing implementations of the Neon intrinsics. What is the nature of the > failure? > > From what I can see, nothing in the patch prevents using these intrinsics > on big-endian, so either the intrinsics behaviour is wrong (we have a wrong > code bug), or the testcase expected behaviour is wrong. > > I don't think disabling the test for big-endian is the right fix. We should > either fix the intrinsics, or fix the testcase. > > Thanks, > James > > Hi James, As the tests assume the little endian order of elements while checking the results, the tests are failing for big endian targets. So, the failures are not because of intrinsic implementations, but because of the testcase. - Thanks and regards, Sameera D.
[Patch, regrename] Fix PR87330 : ICE in scan_rtx_reg, at regrename.c
Hi! Please find attached the patch fixing the issue PR87330 : ICE in scan_rtx_reg, at regrename.c:1097. The regrename pass does not rename the registers which are in notes, because of which the REG_DEAD note had previous register names, which caused conflicting liveness information generated for tag collision pass. It is better to do it in regrename_do_replace instead while regrename_analyze, because the note information does not really contribute into the regrename analysis, hence need not be added in the def-use chains that are computed. regrename_do_replace is where the decision to finally rename the register is made - where the note can be altered with new regname. Other notes need not be changed, as they don't hold renamed register information. Ok for trunk? Changelog: 2018-10-09 Sameera Deshpande diff --git a/gcc/regrename.c b/gcc/regrename.c index 8424093..a3446a2 100644 --- a/gcc/regrename.c +++ b/gcc/regrename.c @@ -970,6 +970,7 @@ regrename_do_replace (struct du_head *head, int reg) unsigned int regno = ORIGINAL_REGNO (*chain->loc); struct reg_attrs *attr = REG_ATTRS (*chain->loc); int reg_ptr = REG_POINTER (*chain->loc); + rtx note; if (DEBUG_INSN_P (chain->insn) && REGNO (*chain->loc) != base_regno) validate_change (chain->insn, &(INSN_VAR_LOCATION_LOC (chain->insn)), @@ -986,6 +987,11 @@ regrename_do_replace (struct du_head *head, int reg) last_reg = *chain->loc; } validate_change (chain->insn, chain->loc, last_repl, true); + note = find_regno_note (chain->insn, REG_DEAD, base_regno); + if (note != 0) + { + validate_change (chain->insn, &XEXP (note, 0), last_repl, true); + } } }
Re: [AARCH64] Neon vld1_*_x3, vst1_*_x2 and vst1_*_x3 intrinsics
On 13 April 2018 at 20:21, James Greenhalgh wrote: > On Fri, Apr 13, 2018 at 03:39:32PM +0100, Sameera Deshpande wrote: >> On Fri 13 Apr, 2018, 8:04 PM James Greenhalgh, >> mailto:james.greenha...@arm.com>> wrote: >> On Fri, Apr 06, 2018 at 08:55:47PM +0100, Christophe Lyon wrote: >> > Hi, >> > >> > 2018-04-06 12:15 GMT+02:00 Sameera Deshpande >> > mailto:sameera.deshpa...@linaro.org>>: >> > > Hi Christophe, >> > > >> > > Please find attached the updated patch with testcases. >> > > >> > > Ok for trunk? >> > >> > Thanks for the update. >> > >> > Since the new intrinsics are only available on aarch64, you want to >> > prevent the tests from running on arm. >> > Indeed gcc.target/aarch64/advsimd-intrinsics/ is shared between the two >> > targets. >> > There are several examples on how to do that in that directory. >> > >> > I have also noticed that the tests fail at execution on aarch64_be. >> >> I think this is important to fix. We don't want the big-endian target to have >> failing implementations of the Neon intrinsics. What is the nature of the >> failure? >> >> From what I can see, nothing in the patch prevents using these intrinsics >> on big-endian, so either the intrinsics behaviour is wrong (we have a wrong >> code bug), or the testcase expected behaviour is wrong. >> >> I don't think disabling the test for big-endian is the right fix. We should >> either fix the intrinsics, or fix the testcase. >> >> Thanks, >> James >> >> Hi James, >> >> As the tests assume the little endian order of elements while checking the >> results, the tests are failing for big endian targets. So, the failures are >> not because of intrinsic implementations, but because of the testcase. > > The testcase is a little hard to follow through the macros, but why would > this be the case? > > ld1 is deterministic on big and little endian for which elements will be > loaded from memory, as is st1. > > My expectation would be that: > > int __attribute__ ((noinline)) > test_vld_u16_x3 () > { > uint16_t data[3 * 3]; > uint16_t temp[3 * 3]; > uint16x4x3_t vectors; > int i,j; > for (i = 0; i < 3 * 3; i++) > data [i] = (uint16_t) 3*i; > asm volatile ("" : : : "memory"); > vectors = vld1_u16_x3 (data); > vst1_u16 (temp, vectors.val[0]); > vst1_u16 (&temp[3], vectors.val[1]); > vst1_u16 (&temp[3 * 2], vectors.val[2]); > asm volatile ("" : : : "memory"); > for (j = 0; j < 3 * 3; j++) > if (temp[j] != data[j]) > return 1; > return 0; > } > > would work equally well for big- or little-endian. > > I think this is more likely to be an intrinsics implementation bug. > > Thanks, > James > Hi James, Please find attached the updated patch, which now passes for little as well as big endian. Ok for trunk? -- - Thanks and regards, Sameera D. gcc/Changelog: 2018-05-01 Sameera Deshpande * config/aarch64/aarch64-simd-builtins.def (ld1x3): New. (st1x2): Likewise. (st1x3): Likewise. * config/aarch64/aarch64-simd.md (aarch64_ld1x3): New pattern. (aarch64_ld1_x3_): Likewise (aarch64_st1x2): Likewise (aarch64_st1_x2_): Likewise (aarch64_st1x3): Likewise (aarch64_st1_x3_): Likewise * config/aarch64/arm_neon.h (vld1_u8_x3): New function. (vld1_s8_x3): Likewise. (vld1_u16_x3): Likewise. (vld1_s16_x3): Likewise. (vld1_u32_x3): Likewise. (vld1_s32_x3): Likewise. (vld1_u64_x3): Likewise. (vld1_s64_x3): Likewise. (vld1_f16_x3): Likewise. (vld1_f32_x3): Likewise. (vld1_f64_x3): Likewise. (vld1_p8_x3): Likewise. (vld1_p16_x3): Likewise. (vld1_p64_x3): Likewise. (vld1q_u8_x3): Likewise. (vld1q_s8_x3): Likewise. (vld1q_u16_x3): Likewise. (vld1q_s16_x3): Likewise. (vld1q_u32_x3): Likewise. (vld1q_s32_x3): Likewise. (vld1q_u64_x3): Likewise. (vld1q_s64_x3): Likewise. (vld1q_f16_x3): Likewise. (vld1q_f32_x3): Likewise. (vld1q_f64_x3): Likewise. (vld1q_p8_x3): Likewise. (vld1q_p16_x3): Likewise. (vld1q_p64_x3): Likewise. (vst1_s64_x2): Likewise. (vst1_u64_x2): Likewise. (vst1_f64_x2): Likewise. (vst1_s8_x2): Likewise. (vst1_p8_x2): Likewise. (vst1_s16_x2): Likewise. (vst1_p16_x2): Likewise. (vst1_s32
Re: [AARCH64] Neon vld1_*_x3, vst1_*_x2 and vst1_*_x3 intrinsics
On 1 May 2018 at 05:05, Sameera Deshpande wrote: > On 13 April 2018 at 20:21, James Greenhalgh wrote: >> On Fri, Apr 13, 2018 at 03:39:32PM +0100, Sameera Deshpande wrote: >>> On Fri 13 Apr, 2018, 8:04 PM James Greenhalgh, >>> mailto:james.greenha...@arm.com>> wrote: >>> On Fri, Apr 06, 2018 at 08:55:47PM +0100, Christophe Lyon wrote: >>> > Hi, >>> > >>> > 2018-04-06 12:15 GMT+02:00 Sameera Deshpande >>> > mailto:sameera.deshpa...@linaro.org>>: >>> > > Hi Christophe, >>> > > >>> > > Please find attached the updated patch with testcases. >>> > > >>> > > Ok for trunk? >>> > >>> > Thanks for the update. >>> > >>> > Since the new intrinsics are only available on aarch64, you want to >>> > prevent the tests from running on arm. >>> > Indeed gcc.target/aarch64/advsimd-intrinsics/ is shared between the two >>> > targets. >>> > There are several examples on how to do that in that directory. >>> > >>> > I have also noticed that the tests fail at execution on aarch64_be. >>> >>> I think this is important to fix. We don't want the big-endian target to >>> have >>> failing implementations of the Neon intrinsics. What is the nature of the >>> failure? >>> >>> From what I can see, nothing in the patch prevents using these intrinsics >>> on big-endian, so either the intrinsics behaviour is wrong (we have a wrong >>> code bug), or the testcase expected behaviour is wrong. >>> >>> I don't think disabling the test for big-endian is the right fix. We should >>> either fix the intrinsics, or fix the testcase. >>> >>> Thanks, >>> James >>> >>> Hi James, >>> >>> As the tests assume the little endian order of elements while checking the >>> results, the tests are failing for big endian targets. So, the failures are >>> not because of intrinsic implementations, but because of the testcase. >> >> The testcase is a little hard to follow through the macros, but why would >> this be the case? >> >> ld1 is deterministic on big and little endian for which elements will be >> loaded from memory, as is st1. >> >> My expectation would be that: >> >> int __attribute__ ((noinline)) >> test_vld_u16_x3 () >> { >> uint16_t data[3 * 3]; >> uint16_t temp[3 * 3]; >> uint16x4x3_t vectors; >> int i,j; >> for (i = 0; i < 3 * 3; i++) >> data [i] = (uint16_t) 3*i; >> asm volatile ("" : : : "memory"); >> vectors = vld1_u16_x3 (data); >> vst1_u16 (temp, vectors.val[0]); >> vst1_u16 (&temp[3], vectors.val[1]); >> vst1_u16 (&temp[3 * 2], vectors.val[2]); >> asm volatile ("" : : : "memory"); >> for (j = 0; j < 3 * 3; j++) >> if (temp[j] != data[j]) >> return 1; >> return 0; >> } >> >> would work equally well for big- or little-endian. >> >> I think this is more likely to be an intrinsics implementation bug. >> >> Thanks, >> James >> > > Hi James, > > Please find attached the updated patch, which now passes for little as > well as big endian. > Ok for trunk? > > -- > - Thanks and regards, > Sameera D. > > gcc/Changelog: > > 2018-05-01 Sameera Deshpande > > > * config/aarch64/aarch64-simd-builtins.def (ld1x3): New. > (st1x2): Likewise. > (st1x3): Likewise. > * config/aarch64/aarch64-simd.md > (aarch64_ld1x3): New pattern. > (aarch64_ld1_x3_): Likewise > (aarch64_st1x2): Likewise > (aarch64_st1_x2_): Likewise > (aarch64_st1x3): Likewise > (aarch64_st1_x3_): Likewise > * config/aarch64/arm_neon.h (vld1_u8_x3): New function. > (vld1_s8_x3): Likewise. > (vld1_u16_x3): Likewise. > (vld1_s16_x3): Likewise. > (vld1_u32_x3): Likewise. > (vld1_s32_x3): Likewise. > (vld1_u64_x3): Likewise. > (vld1_s64_x3): Likewise. > (vld1_f16_x3): Likewise. > (vld1_f32_x3): Likewise. > (vld1_f64_x3): Likewise. > (vld1_p8_x3): Likewise. > (vld1_p16_x3): Likewise. > (vld1_p64_x3): Likewise. > (vld1q_u8_x3): Likewise. > (vld1q_s8_x3): Likewise. > (vld1q_u16_x3): Likewise. > (vld1q_
Re: [AARCH64] Neon vld1_*_x3, vst1_*_x2 and vst1_*_x3 intrinsics
On Tue 22 May, 2018, 9:26 PM James Greenhalgh, wrote: > On Mon, Apr 30, 2018 at 06:35:11PM -0500, Sameera Deshpande wrote: > > On 13 April 2018 at 20:21, James Greenhalgh > wrote: > > > On Fri, Apr 13, 2018 at 03:39:32PM +0100, Sameera Deshpande wrote: > > >> On Fri 13 Apr, 2018, 8:04 PM James Greenhalgh, < > james.greenha...@arm.com<mailto:james.greenha...@arm.com>> wrote: > > >> On Fri, Apr 06, 2018 at 08:55:47PM +0100, Christophe Lyon wrote: > > >> > Hi, > > >> > > > >> > 2018-04-06 12:15 GMT+02:00 Sameera Deshpande < > sameera.deshpa...@linaro.org<mailto:sameera.deshpa...@linaro.org>>: > > >> > > Hi Christophe, > > >> > > > > >> > > Please find attached the updated patch with testcases. > > >> > > > > >> > > Ok for trunk? > > >> > > > >> > Thanks for the update. > > >> > > > >> > Since the new intrinsics are only available on aarch64, you want to > > >> > prevent the tests from running on arm. > > >> > Indeed gcc.target/aarch64/advsimd-intrinsics/ is shared between the > two targets. > > >> > There are several examples on how to do that in that directory. > > >> > > > >> > I have also noticed that the tests fail at execution on aarch64_be. > > >> > > >> I think this is important to fix. We don't want the big-endian target > to have > > >> failing implementations of the Neon intrinsics. What is the nature of > the > > >> failure? > > >> > > >> From what I can see, nothing in the patch prevents using these > intrinsics > > >> on big-endian, so either the intrinsics behaviour is wrong (we have a > wrong > > >> code bug), or the testcase expected behaviour is wrong. > > >> > > >> I don't think disabling the test for big-endian is the right fix. We > should > > >> either fix the intrinsics, or fix the testcase. > > >> > > >> Thanks, > > >> James > > >> > > >> Hi James, > > >> > > >> As the tests assume the little endian order of elements while > checking the > > >> results, the tests are failing for big endian targets. So, the > failures are > > >> not because of intrinsic implementations, but because of the testcase. > > > > > > The testcase is a little hard to follow through the macros, but why > would > > > this be the case? > > > > > > ld1 is deterministic on big and little endian for which elements will > be > > > loaded from memory, as is st1. > > > > > > My expectation would be that: > > > > > > int __attribute__ ((noinline)) > > > test_vld_u16_x3 () > > > { > > > uint16_t data[3 * 3]; > > > uint16_t temp[3 * 3]; > > > uint16x4x3_t vectors; > > > int i,j; > > > for (i = 0; i < 3 * 3; i++) > > > data [i] = (uint16_t) 3*i; > > > asm volatile ("" : : : "memory"); > > > vectors = vld1_u16_x3 (data); > > > vst1_u16 (temp, vectors.val[0]); > > > vst1_u16 (&temp[3], vectors.val[1]); > > > vst1_u16 (&temp[3 * 2], vectors.val[2]); > > > asm volatile ("" : : : "memory"); > > > for (j = 0; j < 3 * 3; j++) > > > if (temp[j] != data[j]) > > > return 1; > > > return 0; > > > } > > > > > > would work equally well for big- or little-endian. > > > > > > I think this is more likely to be an intrinsics implementation bug. > > > > > > Thanks, > > > James > > > > > > > Hi James, > > > > Please find attached the updated patch, which now passes for little as > > well as big endian. > > Ok for trunk? > > > OK. > > Thanks, > James > > > > > -- > > - Thanks and regards, > > Sameera D. > > > > gcc/Changelog: > > > > 2018-05-01 Sameera Deshpande > > > > > > * config/aarch64/aarch64-simd-builtins.def (ld1x3): New. > > (st1x2): Likewise. > > (st1x3): Likewise. > > * config/aarch64/aarch64-simd.md > > (aarch64_ld1x3): New pattern. > > (aarch64_ld1_x3_): Likewise > > (aarch64_st1x2): Likewise > > (aarch64_st1_x2_): Likewise &
[PATCH][MIPS] Enable load-load/store-store bonding
Hi Richard, Please find attached the patch implementing load-load/store-store bonding supported by P5600. In P5600, 2 consecutive loads/stores of same type which access contiguous memory locations are bonded together by instruction issue unit to dispatch single load/store instruction which accesses both locations. This allows 2X improvement in memory intensive code. This optimization can be performed for LH, SH, LW, SW, LWC, SWC, LDC, SDC instructions. This patch adds peephole2 patterns to identify such loads/stores, and put them in parallel, so that the scheduler will not split it - thereby guarantying h/w level load/store bonding. The patch is tested with dejagnu for correctness. Local testing on hardware for perf is currently going on. Ok for trunk? Changelog: gcc/ * config/mips/mips.md (JOINLDST1): New mode iterator. (insn_type): New mode attribute. (reg): Update mode attribute. (join2_load_Store): New pattern. (join2_loadhi): Likewise. (join2_storehi): Likewise. (define_peehole2): Add peephole2 patterns to join 2 HI/SI/SF/DF-mode load-load and store-stores. * config/mips/mips.opt (mld-st-pairing): New option. * config/mips/mips.c (mips_option_override): New exception. *config/mips/mips.h (ENABLE_LD_ST_PAIRING): New macro. - Thanks and regards, Sameera D. load-store-pairing.patch Description: load-store-pairing.patch
RE: [PATCH][MIPS] Enable load-load/store-store bonding
Hi Richard, Thanks for your comments. I am working on the review comments, and will share the reworked patch soon. However, here is clarification on some of the issues raised. > > + if (TARGET_FIX_24K && TUNE_P5600) > > +error ("unsupported combination: %s", "-mtune=p5600 -mfix-24k"); > > + > >/* Save the base compression state and process flags as though we > > were generating uncompressed code. */ > >mips_base_compression_flags = TARGET_COMPRESSION; > > Although it's a bit of an odd combination, we need to accept -mfix-24k - > mtune=p5600 and continue to implement the 24k workarounds. > The idea is that a distributor can build for a common base architecture, add - > mfix- options for processors that might run the code, and add -mtune= for > the processor that's most of interest optimisation-wise. > > We should just make the pairing of stores conditional on !TARGET_FIX_24K. We had offline discussion based on your comment. There is additional view on the same. Only ISAs mips32r2, mips32r3 and mips32r5 support P5600. Remaining ISAs do not support P5600. For mips32r2 (24K) and mips32r3 (micromips), load-store pairing is implemented separately, and hence, as you suggested, P5600 Ld-ST bonding optimization should not be enabled for them. So, is it fine if I emit error for any ISAs other than mips32r2, mips32r3 and mips32r5 when P5600 is enabled, or the compilation should continue by emitting warning and disabling P5600? Also, the optimization will be enabled only if !TARGET_FIX_24K && !TARGET_MICTOMIPS as suggested by you. > > + > > +#define ENABLE_LD_ST_PAIRING \ > > + (TARGET_ENABLE_LD_ST_PAIRING && TUNE_P5600) > > The patch requires -mld-st-pairing to be passed explicitly even for - > mtune=p5600. Is that because it's not a consistent enough win for us to > enable it by default? It sounded from the description like it should be an > improvement more often that not. > > We should allow pairing even without -mtune=p5600. Performance testing for this patch is not yet done. If the patch proves beneficial in most of the testcases (which we believe will do on P5600) we will enable this optimization by default for P5600 - in which case this option can be removed. > > Are QImodes not paired in the same way? If so, it'd be worth adding a > comment above the define_mode_iterator saying that QI is deliberately > excluded. The P5600 datasheet mentions bonding of load/stores in HI, SI, SF and DF modes only. Hence QI mode is excluded. I will add the comment on the iterator. - Thanks and regards, Sameera D.
RE: [PATCH][MIPS] Enable load-load/store-store bonding
dingly. In order to allow d/f for both register classes, the pattern join2_load_store was altered a bit which eliminated this mode iterator. > > Outer (parallel ...)s are redundant in a define_insn. Removed. > > It would be better to add the mips_load_store_insns for each operand > rather than multiplying one of them by 2. Or see the next bit for an > alternative. Using the alternative method as you suggested, so this change is not needed. > Please instead add HI to the define_mode_iterator so that we can use the > same peephole and define_insn. Added HI in the mode iterator to eliminate join2_storehi pattern and corresponding peephole2. As arithmetic operations on HImode is not supported, we generate zero or sign extended loads in such cases. To handle that case, join2_loadhi pattern is kept. - Thanks and regards, Sameera D. load-store-pairing.patch Description: load-store-pairing.patch
[AArch64] Add Saphira pipeline description.
Hi! Please find attached the patch to add a pipeline description for the Qualcomm Saphira core. It is tested with a bootstrap and make check, with no regressions. Ok for trunk? gcc/ Changelog: 2018-10-26 Sameera Deshpande * config/aarch64/aarch64-cores.def (saphira): Use saphira pipeline. * config/aarch64/aarch64.md: Include saphira.md * config/aarch64/saphira.md: New file for pipeline description. -- - Thanks and regards, Sameera D. diff --git a/gcc/config/aarch64/aarch64-cores.def b/gcc/config/aarch64/aarch64-cores.def index 3d876b8..8e4c646 100644 --- a/gcc/config/aarch64/aarch64-cores.def +++ b/gcc/config/aarch64/aarch64-cores.def @@ -90,7 +90,7 @@ AARCH64_CORE("cortex-a76", cortexa76, cortexa57, 8_2A, AARCH64_FL_FOR_ARCH8_2 /* ARMv8.4-A Architecture Processors. */ /* Qualcomm ('Q') cores. */ -AARCH64_CORE("saphira", saphira,falkor,8_4A, AARCH64_FL_FOR_ARCH8_4 | AARCH64_FL_CRYPTO | AARCH64_FL_RCPC, saphira, 0x51, 0xC01, -1) +AARCH64_CORE("saphira", saphira,saphira,8_4A, AARCH64_FL_FOR_ARCH8_4 | AARCH64_FL_CRYPTO | AARCH64_FL_RCPC, saphira, 0x51, 0xC01, -1) /* ARMv8-A big.LITTLE implementations. */ diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index a014a01..f951354 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -298,6 +298,7 @@ (include "../arm/cortex-a57.md") (include "../arm/exynos-m1.md") (include "falkor.md") +(include "saphira.md") (include "thunderx.md") (include "../arm/xgene1.md") (include "thunderx2t99.md") diff --git a/gcc/config/aarch64/saphira.md b/gcc/config/aarch64/saphira.md new file mode 100644 index 000..bbf1c5c --- /dev/null +++ b/gcc/config/aarch64/saphira.md @@ -0,0 +1,583 @@ +;; Saphira pipeline description +;; Copyright (C) 2017-2018 Free Software Foundation, Inc. +;; +;; This file is part of GCC. +;; +;; GCC is free software; you can redistribute it and/or modify it +;; under the terms of the GNU General Public License as published by +;; the Free Software Foundation; either version 3, or (at your option) +;; any later version. +;; +;; GCC is distributed in the hope that it will be useful, but +;; WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +;; General Public License for more details. +;; +;; You should have received a copy of the GNU General Public License +;; along with GCC; see the file COPYING3. If not see +;; <http://www.gnu.org/licenses/>. + +(define_automaton "saphira") + +;; Complex int instructions (e.g. multiply and divide) execute in the X +;; pipeline. Simple int instructions execute in the X, Y, Z and B pipelines. + +(define_cpu_unit "saphira_x" "saphira") +(define_cpu_unit "saphira_y" "saphira") + +;; Branches execute in the Z or B pipeline or in one of the int pipelines depending +;; on how complex it is. Simple int insns (like movz) can also execute here. + +(define_cpu_unit "saphira_z" "saphira") +(define_cpu_unit "saphira_b" "saphira") + +;; Vector and FP insns execute in the VX and VY pipelines. + +(define_automaton "saphira_vfp") + +(define_cpu_unit "saphira_vx" "saphira_vfp") +(define_cpu_unit "saphira_vy" "saphira_vfp") + +;; Loads execute in the LD pipeline. +;; Stores execute in the ST pipeline, for address, data, and +;; vector data. + +(define_automaton "saphira_mem") + +(define_cpu_unit "saphira_ld" "saphira_mem") +(define_cpu_unit "saphira_st" "saphira_mem") + +;; The GTOV and VTOG pipelines are for general to vector reg moves, and vice +;; versa. + +(define_cpu_unit "saphira_gtov" "saphira") +(define_cpu_unit "saphira_vtog" "saphira") + +;; Common reservation combinations. + +(define_reservation "saphira_vxvy" "saphira_vx|saphira_vy") +(define_reservation "saphira_zb" "saphira_z|saphira_b") +(define_reservation "saphira_xyzb" "saphira_x|saphira_y|saphira_z|saphira_b") + +;; SIMD Floating-Point Instructions + +(define_insn_reservation "saphira_afp_1_vxvy" 1 + (and (eq_attr "tune" "saphira") + (eq_attr "type" "neon_fp_neg_s,neon_fp_neg_d,neon_fp_abs_s,neon_fp_abs_d,neon_fp_neg_s_q,neon_fp_neg_d_q,neon_fp_abs_s_q,neon_fp_abs_d_q")) + "saphira_vxvy") + +(define_insn_reservation "saphira_afp_2_vxvy" 2 + (and (eq_attr "tune" "saphira") + (eq_attr "type" "neon_fp_minmax_s,neon_fp_minmax_d,neon_fp_reduc_minmax_s,neon_fp_reduc_minmax_d,neon_fp_compare_s,neon_fp_compare_d,neon_fp_round_s,neon_fp_round_d,neon_fp_minmax_s_q,ne
Re: [Patch, regrename] Fix PR87330 : ICE in scan_rtx_reg, at regrename.c
On Tue, 9 Oct 2018 at 04:08, Eric Botcazou wrote: > > > Other notes need not be changed, as they don't hold renamed register > > information. > > > > Ok for trunk? > > No, REG_DEAD & REG_UNUSED note must be recomputed by passes consuming them. > > > 2018-10-09 Sameera Deshpande > > > * gcc/regrename.c (regrename_do_replace): Add condition to alter > > regname if note has same register marked dead in notes. > > No gcc/ prefix in gcc/ChangeLog. > > -- > Eric Botcazou Hi Eric, Thanks for your comments. Please find attached updated patch invoking data flow for updating the REG_DEAD and REG_UNUSED notes. As this change is made in falkor specific file, adding James and Richard for review. Ok for trunk? Changelog: 2018-10-30 Sameera Deshpande diff --git a/gcc/config/aarch64/falkor-tag-collision-avoidance.c b/gcc/config/aarch64/falkor-tag-collision-avoidance.c index fb6568f..4ca9d66 100644 --- a/gcc/config/aarch64/falkor-tag-collision-avoidance.c +++ b/gcc/config/aarch64/falkor-tag-collision-avoidance.c @@ -805,6 +805,7 @@ execute_tag_collision_avoidance () df_set_flags (DF_RD_PRUNE_DEAD_DEFS); df_chain_add_problem (DF_UD_CHAIN); df_compute_regs_ever_live (true); + df_note_add_problem (); df_analyze (); df_set_flags (DF_DEFER_INSN_RESCAN);
Re: [AArch64] Add Saphira pipeline description.
On Fri, 26 Oct 2018 at 13:33, Sameera Deshpande wrote: > > Hi! > > Please find attached the patch to add a pipeline description for the > Qualcomm Saphira core. It is tested with a bootstrap and make check, > with no regressions. > > Ok for trunk? > > gcc/ > Changelog: > > 2018-10-26 Sameera Deshpande > > * config/aarch64/aarch64-cores.def (saphira): Use saphira pipeline. > * config/aarch64/aarch64.md: Include saphira.md > * config/aarch64/saphira.md: New file for pipeline description. > > -- > - Thanks and regards, > Sameera D. Hi! Please find attached updated patch. Bootstrap and make check passed without regression. Ok for trunk? -- - Thanks and regards, Sameera D. diff --git a/gcc/config/aarch64/aarch64-cores.def b/gcc/config/aarch64/aarch64-cores.def index 3d876b8..8e4c646 100644 --- a/gcc/config/aarch64/aarch64-cores.def +++ b/gcc/config/aarch64/aarch64-cores.def @@ -90,7 +90,7 @@ AARCH64_CORE("cortex-a76", cortexa76, cortexa57, 8_2A, AARCH64_FL_FOR_ARCH8_2 /* ARMv8.4-A Architecture Processors. */ /* Qualcomm ('Q') cores. */ -AARCH64_CORE("saphira", saphira,falkor,8_4A, AARCH64_FL_FOR_ARCH8_4 | AARCH64_FL_CRYPTO | AARCH64_FL_RCPC, saphira, 0x51, 0xC01, -1) +AARCH64_CORE("saphira", saphira,saphira,8_4A, AARCH64_FL_FOR_ARCH8_4 | AARCH64_FL_CRYPTO | AARCH64_FL_RCPC, saphira, 0x51, 0xC01, -1) /* ARMv8-A big.LITTLE implementations. */ diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index a014a01..f951354 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -298,6 +298,7 @@ (include "../arm/cortex-a57.md") (include "../arm/exynos-m1.md") (include "falkor.md") +(include "saphira.md") (include "thunderx.md") (include "../arm/xgene1.md") (include "thunderx2t99.md") diff --git a/gcc/config/aarch64/saphira.md b/gcc/config/aarch64/saphira.md new file mode 100644 index 000..bbf1c5c --- /dev/null +++ b/gcc/config/aarch64/saphira.md @@ -0,0 +1,583 @@ +;; Saphira pipeline description +;; Copyright (C) 2017-2018 Free Software Foundation, Inc. +;; +;; This file is part of GCC. +;; +;; GCC is free software; you can redistribute it and/or modify it +;; under the terms of the GNU General Public License as published by +;; the Free Software Foundation; either version 3, or (at your option) +;; any later version. +;; +;; GCC is distributed in the hope that it will be useful, but +;; WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +;; General Public License for more details. +;; +;; You should have received a copy of the GNU General Public License +;; along with GCC; see the file COPYING3. If not see +;; <http://www.gnu.org/licenses/>. + +(define_automaton "saphira") + +;; Complex int instructions (e.g. multiply and divide) execute in the X +;; pipeline. Simple int instructions execute in the X, Y, Z and B pipelines. + +(define_cpu_unit "saphira_x" "saphira") +(define_cpu_unit "saphira_y" "saphira") + +;; Branches execute in the Z or B pipeline or in one of the int pipelines depending +;; on how complex it is. Simple int insns (like movz) can also execute here. + +(define_cpu_unit "saphira_z" "saphira") +(define_cpu_unit "saphira_b" "saphira") + +;; Vector and FP insns execute in the VX and VY pipelines. + +(define_automaton "saphira_vfp") + +(define_cpu_unit "saphira_vx" "saphira_vfp") +(define_cpu_unit "saphira_vy" "saphira_vfp") + +;; Loads execute in the LD pipeline. +;; Stores execute in the ST pipeline, for address, data, and +;; vector data. + +(define_automaton "saphira_mem") + +(define_cpu_unit "saphira_ld" "saphira_mem") +(define_cpu_unit "saphira_st" "saphira_mem") + +;; The GTOV and VTOG pipelines are for general to vector reg moves, and vice +;; versa. + +(define_cpu_unit "saphira_gtov" "saphira") +(define_cpu_unit "saphira_vtog" "saphira") + +;; Common reservation combinations. + +(define_reservation "saphira_vxvy" "saphira_vx|saphira_vy") +(define_reservation "saphira_zb" "saphira_z|saphira_b") +(define_reservation "saphira_xyzb" "saphira_x|saphira_y|saphira_z|saphira_b") + +;; SIMD Floating-Point Instructions + +(define_insn_reservation "saphira_afp_1_vxvy" 1 + (and (eq_attr "tune" "saphira") + (eq_attr "type" "neon_fp_neg_s,neon_fp_neg_d,neon_fp_abs_s,neon_fp_abs_d,neon_fp_neg_s_q,neon_fp_neg_d_q,neon_fp_abs_s_q,neon_fp_abs_d_q")) + "saphira_vxvy") + +(define_insn_reservatio
Re: [Patch, regrename] Fix PR87330 : ICE in scan_rtx_reg, at regrename.c
On Tue, 30 Oct 2018 at 16:16, Richard Earnshaw (lists) wrote: > > On 30/10/2018 10:09, Sameera Deshpande wrote: > > On Tue, 9 Oct 2018 at 04:08, Eric Botcazou wrote: > >> > >>> Other notes need not be changed, as they don't hold renamed register > >>> information. > >>> > >>> Ok for trunk? > >> > >> No, REG_DEAD & REG_UNUSED note must be recomputed by passes consuming them. > >> > >>> 2018-10-09 Sameera Deshpande >>> > >>> * gcc/regrename.c (regrename_do_replace): Add condition to alter > >>> regname if note has same register marked dead in notes. > >> > >> No gcc/ prefix in gcc/ChangeLog. > >> > >> -- > >> Eric Botcazou > > > > Hi Eric, > > > > Thanks for your comments. > > > > Please find attached updated patch invoking data flow for updating the > > REG_DEAD and REG_UNUSED notes. > > > > As this change is made in falkor specific file, adding James and > > Richard for review. > > > > Ok for trunk? > > > > Changelog: > > > > 2018-10-30 Sameera Deshpande > > > * gcc/config/aarch64/falkor-tag-collision-avoidance.c > > (execute_tag_collision_avoidance): Invoke df_note_add_problem to > > recompute REG_DEAD and REG_UNUSED notes before analysis. > > > > 'Call df_note_add_problem.' is enough. > > OK with that change. > > R. > > > > > bug87330.patch > > > > diff --git a/gcc/config/aarch64/falkor-tag-collision-avoidance.c > > b/gcc/config/aarch64/falkor-tag-collision-avoidance.c > > index fb6568f..4ca9d66 100644 > > --- a/gcc/config/aarch64/falkor-tag-collision-avoidance.c > > +++ b/gcc/config/aarch64/falkor-tag-collision-avoidance.c > > @@ -805,6 +805,7 @@ execute_tag_collision_avoidance () > >df_set_flags (DF_RD_PRUNE_DEAD_DEFS); > >df_chain_add_problem (DF_UD_CHAIN); > >df_compute_regs_ever_live (true); > > + df_note_add_problem (); > >df_analyze (); > >df_set_flags (DF_DEFER_INSN_RESCAN); > > > > > Thanks Richard! Patch committed at revision 265618. -- - Thanks and regards, Sameera D.
Re: [AArch64] Add Saphira pipeline description.
On Wed, 31 Oct 2018 at 00:37, James Greenhalgh wrote: > > On Tue, Oct 30, 2018 at 05:12:58AM -0500, Sameera Deshpande wrote: > > On Fri, 26 Oct 2018 at 13:33, Sameera Deshpande > > wrote: > > > > > > Hi! > > > > > > Please find attached the patch to add a pipeline description for the > > > Qualcomm Saphira core. It is tested with a bootstrap and make check, > > > with no regressions. > > > > > > Ok for trunk? > > OK. > Thanks James, will commit the change. > I wonder if there's anything we can do to improve maintainability in these > cases where two pipeline models have considerable overlaps. > I agree that there is a need to have some mechanism to maintain the architectures which have many commonalities. However, Saphira and Falkor are very different to have lot of sharing, and with further performance tuning for Saphira, the differences will be more prominent. I will commit this patch as is for saphira, and will look at the possible factoring for Saphira and Falkor pipelines, with commonalities and differences when the tuning for Saphira is done. > Thanks, > James > > > > > > > gcc/ > > > Changelog: > > > > > > 2018-10-26 Sameera Deshpande > > > > > > * config/aarch64/aarch64-cores.def (saphira): Use saphira pipeline. > > > * config/aarch64/aarch64.md: Include saphira.md > > > * config/aarch64/saphira.md: New file for pipeline description. > > > > > > -- > > > - Thanks and regards, > > > Sameera D. > > > > Hi! > > > > Please find attached updated patch. > > Bootstrap and make check passed without regression. Ok for trunk? > > > > -- > > - Thanks and regards, > > Sameera D. > > > diff --git a/gcc/config/aarch64/aarch64-cores.def > > b/gcc/config/aarch64/aarch64-cores.def > > index 3d876b8..8e4c646 100644 > > --- a/gcc/config/aarch64/aarch64-cores.def > > +++ b/gcc/config/aarch64/aarch64-cores.def > > @@ -90,7 +90,7 @@ AARCH64_CORE("cortex-a76", cortexa76, cortexa57, 8_2A, > > AARCH64_FL_FOR_ARCH8_2 > > /* ARMv8.4-A Architecture Processors. */ > > > > /* Qualcomm ('Q') cores. */ > > -AARCH64_CORE("saphira", saphira,falkor,8_4A, > > AARCH64_FL_FOR_ARCH8_4 | AARCH64_FL_CRYPTO | AARCH64_FL_RCPC, saphira, > > 0x51, 0xC01, -1) > > +AARCH64_CORE("saphira", saphira,saphira,8_4A, > > AARCH64_FL_FOR_ARCH8_4 | AARCH64_FL_CRYPTO | AARCH64_FL_RCPC, saphira, > > 0x51, 0xC01, -1) > > > > /* ARMv8-A big.LITTLE implementations. */ > > > > diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md > > index a014a01..f951354 100644 > > --- a/gcc/config/aarch64/aarch64.md > > +++ b/gcc/config/aarch64/aarch64.md > > @@ -298,6 +298,7 @@ > > (include "../arm/cortex-a57.md") > > (include "../arm/exynos-m1.md") > > (include "falkor.md") > > +(include "saphira.md") > > (include "thunderx.md") > > (include "../arm/xgene1.md") > > (include "thunderx2t99.md") > > diff --git a/gcc/config/aarch64/saphira.md b/gcc/config/aarch64/saphira.md > > new file mode 100644 > > index 000..bbf1c5c > > --- /dev/null > > +++ b/gcc/config/aarch64/saphira.md > > @@ -0,0 +1,583 @@ > > +;; Saphira pipeline description > > +;; Copyright (C) 2017-2018 Free Software Foundation, Inc. > > +;; > > +;; This file is part of GCC. > > +;; > > +;; GCC is free software; you can redistribute it and/or modify it > > +;; under the terms of the GNU General Public License as published by > > +;; the Free Software Foundation; either version 3, or (at your option) > > +;; any later version. > > +;; > > +;; GCC is distributed in the hope that it will be useful, but > > +;; WITHOUT ANY WARRANTY; without even the implied warranty of > > +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > +;; General Public License for more details. > > +;; > > +;; You should have received a copy of the GNU General Public License > > +;; along with GCC; see the file COPYING3. If not see > > +;; <http://www.gnu.org/licenses/>. > > + > > +(define_automaton "saphira") > > + > > +;; Complex int instructions (e.g. multiply and divide) execute in the X > > +;; pipeline. Simple int instructions execute in the X, Y, Z and B > > pipelines. > > + > > +(define_cpu_unit "saphira_x" "saphira"
[unified-autovect: Patch 1b/N] Instruction tile and grammar creation.
ase see if this looks correct, or do I need additional information to successfully generate pattern matcher automatically? Also, can you please comment on usability or scalability of this approach across all the architectures or point me to appropriate people in the group with whom I can discuss target specific vectorization issues? - Thanks and regards, Sameera D.Index: gcc/Makefile.in === --- gcc/Makefile.in (revision 246613) +++ gcc/Makefile.in (working copy) @@ -1067,7 +1067,12 @@ build/print-rtl.o build/hash-table.o BUILD_MD = build/read-md.o BUILD_ERRORS = build/errors.o +BUILD_UNITED = build/vec.o build/hash-table.o build/errors.o \ + build/ggc-none.o \ + build/tree-vect-unified-common.o build/tree-vect-unified-opts.o +build/tree-vect-unified-common.o : tree-vect-unified-common.c gtype-desc.h insn-codes.h +build/tree-vect-unified-opts.o : tree-vect-unified-opts.c gtype-desc.h insn-codes.h # Specify the directories to be searched for header files. # Both . and srcdir are used, in that order, # so that *config.h will be found in the compilation @@ -2207,7 +2212,7 @@ insn-emit.c insn-recog.c insn-extract.c insn-output.c insn-peep.c \ insn-attr.h insn-attr-common.h insn-attrtab.c insn-dfatab.c \ insn-latencytab.c insn-preds.c gimple-match.c generic-match.c \ - insn-target-def.h + insn-target-def.h insn-vect-inst-tiles.h # Dependencies for the md file. The first time through, we just assume # the md file itself and the generated dependency file (in order to get @@ -2234,7 +2239,8 @@ insn-extract.c insn-output.c \ insn-peep.c insn-recog.c -simple_generated_h = $(simple_rtl_generated_h) insn-constants.h +simple_generated_h = $(simple_rtl_generated_h) insn-constants.h \ + insn-vect-inst-tiles.h simple_generated_c = $(simple_rtl_generated_c) insn-enums.c @@ -2602,6 +2608,8 @@ $(GENSUPPORT_H) build/rtl.o: rtl.c $(BCONFIG_H) coretypes.h $(GTM_H) $(SYSTEM_H) \ $(RTL_H) $(GGC_H) errors.h +build/tree.o: tree.c $(BCONFIG_H) coretypes.h $(GTM_H) $(SYSTEM_H) \ + $(RTL_H) $(GGC_H) errors.h build/vec.o : vec.c $(BCONFIG_H) $(SYSTEM_H) coretypes.h $(VEC_H) \ $(GGC_H) toplev.h $(DIAGNOSTIC_CORE_H) build/hash-table.o : hash-table.c $(BCONFIG_H) $(SYSTEM_H) coretypes.h \ @@ -2655,6 +2663,9 @@ coretypes.h $(GTM_H) $(RTL_BASE_H) errors.h $(READ_MD_H) $(GENSUPPORT_H) \ $(HASH_TABLE_H) target-insns.def build/gengenrtl.o : gengenrtl.c $(BCONFIG_H) $(SYSTEM_H) rtl.def +build/genvect-inst-tiles.o : genvect-inst-tiles.c $(RTL_BASE_H) $(BCONFIG_H)\ + $(SYSTEM_H) coretypes.h $(GTM_H) errors.h tree-vect-unified.h \ + tree-vect-unified-opts.o tree-vect-unified-common.o # The gengtype generator program is special: Two versions are built. # One is for the build machine, and one is for the host to allow @@ -2732,8 +2743,11 @@ genprogerr = $(genprogmd) genrtl modes gtype hooks cfn-macros $(genprogerr:%=build/gen%$(build_exeext)): $(BUILD_ERRORS) +genprogunited = vect-inst-tiles +$(genprogunited:%=build/gen%$(build_exeext)): $(BUILD_UNITED) + # Remaining build programs. -genprog = $(genprogerr) check checksum condmd match +genprog = $(genprogerr) $(genprogunited) check checksum condmd match # These programs need libs over and above what they get from the above list. build/genautomata$(build_exeext) : BUILD_LIBS += -lm Index: gcc/config/mips/mips.h === --- gcc/config/mips/mips.h (revision 246613) +++ gcc/config/mips/mips.h (working copy) @@ -3468,4 +3468,37 @@ (TARGET_LOAD_STORE_PAIRS && (TUNE_P5600 || TUNE_I6400) \ && !TARGET_MICROMIPS && !TARGET_FIX_24K) +#define TARGET_VEC_PERM_CONST_ORDER \ +{ \ + {2, 2, 2, (int[2]){0,2}, 1, "PCKEV.D", "RRR", NULL, NULL}, \ + {2, 2, 2, (int[2]){1,3}, 1, "PCKOD.D", "RRR", NULL, NULL}, \ +\ + {2, 4, 4, (int[4]){0,4,2,6}, 1, "ILVEV.W", "RRR", NULL, NULL}, \ + {2, 4, 4, (int[4]){1,5,3,7}, 1, "ILVOD.W", "RRR", NULL, NULL}, \ + {2, 4, 4, (int[4]){0,2,4,6}, 1, "PCKEV.W", "RRR", NULL, NULL}, \ + {2, 4, 4, (int[4]){1,3,5,7}, 1, "PCKOD.W", "RRR", NULL, NULL}, \ + {2, 4, 4, (int[4]){2,6,3,7}, 1, "ILVL.W", "RRR", NULL, NULL}, \ + {2, 4, 4, (int[4]){0,4,1,5}, 1, "ILVR.W", "RRR", NULL, NULL}, \ +\ + {2, 8, 8, (int[8]){0,8,2,10,4,12,6,14}, 1, "ILVEV.H", "RRR", NULL, NULL}, \ + {2, 8, 8, (int[8]){1,9,3,11,5,13,7,15}, 1, "ILVOD.H", "RRR", NULL, NULL}, \ + {2, 8, 8, (int[8]){0,2,4,6,8,10,12,14}, 1, "PCKEV.H", "RRR", NULL, NULL}, \ + {2, 8, 8, (int[8]){1,3,5,7,9,11,13,15}, 1, "PCKOD.H", "RRR", NULL, NULL}, \ + {2, 8, 8, (int[8]){0,8,1,9,2,10,3,11}, 1, "ILVR.H", "RRR", NULL, NULL}, \ + {2
[unified-autovect: Patch 2/N] Implementation of k-arity promotion/reduction
Hi Richard, Sorry for delayed patch submission. I was on maternity leave, so could not post earlier. Here is the previous mail for your reference: https://gcc.gnu.org/ml/gcc/2016-06/msg00043.html Please find attached the patch for stage 2: implementation of k-arity promotion/reduction in the series "Improving effectiveness and generality of autovectorization using unified representation". The permute nodes within primitive reorder tree(PRT) generated from input program can have any arity depending upon stride of accesses. However, the target cannot have instructions to support all arities. Hence, we need to promote or reduce the arity of PRT to enable successful tree tiling. In classic autovectorization, if vectorization stride > 2, arity reduction is performed by generating cascaded extract and interleave instructions as described by "Auto-vectorization of Interleaved Data for SIMD" by D. Nuzman, I. Rosen and A. Zaks. Moreover, to enable SLP across loop, "Loop-aware SLP in GCC" by D. Nuzman, I. Rosen and A. Zaks unrolls loop till stride = vector size. k-arity reduction/promotion algorithm makes use of modulo arithmetic to generate PRT of desired arity for both above-mentioned cases. Single ILV node of arity k can be reduced into cascaded ILV nodes with single node of arity m with children of arity k/m such that ith child of original ILV node becomes floor (i/m) th child of (i%m) th child of new parent. Single EXTR node with k parts and i selector can be reduced into cascaded EXTR nodes such that parent EXTR node has m parts and i/(k/m) selection on child EXTR node with k/m parts and i % (k/m) selection. Similarly, loop unrolling to get desired arity m can be represented as arity promotion from k to m. Single ILV node of arity k can be promoted to single ILV node of arity m by adding extraction with m/k parts and selection i/k of i%k the child of original tree as ith child of new ILV node. To enable loop-aware SLP, we first promote arity of input PRT to maximum vector size permissible on the architecture. This can have impact on vector code size, though performance will be the same. However, to allow variable vector size like SVE in NEON, it is necessary. Later we apply arity promotion reduction algorithm on the output tree to get tree with desired arity. For now, we are supporting target arity = 2, as most of the architectures have support for that. However, the code can be extended for additional arity supports as well. I have tested the code with handwritten testcases for correctness. Do you spot any problem in the logic or arithmetic that I am performing for reduction/promotion? If not, will push this patch on the branch that we have created - unified-autovect. - Thanks and regards, Sameera D.Index: gcc/Makefile.in === --- gcc/Makefile.in (revision 243687) +++ gcc/Makefile.in (working copy) @@ -1529,6 +1529,7 @@ tree-vect-slp.o \ tree-vectorizer.o \ tree-vect-unified.o \ + tree-vect-unified-opts.o \ tree-vrp.o \ tree.o \ valtrack.o \ Index: gcc/tree-vect-data-refs.c === --- gcc/tree-vect-data-refs.c (revision 238158) +++ gcc/tree-vect-data-refs.c (working copy) @@ -136,16 +136,9 @@ return scalar_type; } - -/* Insert DDR into LOOP_VINFO list of ddrs that may alias and need to be - tested at run-time. Return TRUE if DDR was successfully inserted. - Return false if versioning is not supported. */ - -static bool -vect_mark_for_runtime_alias_test (ddr_p ddr, loop_vec_info loop_vinfo) +bool +vect_mark_for_runtime_alias_test_1 (ddr_p ddr, loop *loop) { - struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); - if ((unsigned) PARAM_VALUE (PARAM_VECT_MAX_VERSION_FOR_ALIAS_CHECKS) == 0) return false; @@ -189,11 +182,28 @@ return false; } - LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo).safe_push (ddr); return true; } + +/* Insert DDR into LOOP_VINFO list of ddrs that may alias and need to be + tested at run-time. Return TRUE if DDR was successfully inserted. + Return false if versioning is not supported. */ + +static bool +vect_mark_for_runtime_alias_test (ddr_p ddr, loop_vec_info loop_vinfo) +{ + struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); + bool is_alias; + + is_alias = vect_mark_for_runtime_alias_test_1 (ddr, loop); + if (is_alias) +LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo).safe_push (ddr); + return is_alias; +} + + /* Function vect_analyze_data_ref_dependence. Return TRUE if there (might) exist a dependence between a memory-reference Index: gcc/tree-vect-unified-opts.c === --- gcc/tree-vect-unified-opts.c (revision 0) +++ gcc/tree-vect-unified-opts.c (working copy) @@ -0,0 +1,391 @@ +/* lOOP Vectorization using unified representation +the terms of the GNU General Public
[PATCH, MIPS] Calling convention differs depending on the presence of MSA
Hi Matthew, Please find attached the patch to fix the calling convention issue, where argument and result passing convention differed for MSA and non-MSA variants. The implementation of TARGET_RETURN_IN_MEMORY is altered to block V4SF to be returned in registers. Ok for trunk? - Thanks and regards, Sameera D. Changelog: gcc/ * config/mips/mips.c (mips_return_in_memory) : Restrict V4SFmode to be returned in registers. gcc/testsuite/ * gcc.target/mips/msa-fp-cc.c : New testcase.
RE: [PATCH, MIPS] Calling convention differs depending on the presence of MSA
Hi Matthew, Please find attached updated patch as per our offline discussion. I have disabled return in registers for all vector float types, and updated the test case accordingly. Ok for trunk? - Thanks and regards, Sameera D. From: Sameera Deshpande Sent: 08 February 2017 14:10:52 To: Matthew Fortune Cc: gcc-patches@gcc.gnu.org Subject: [PATCH, MIPS] Calling convention differs depending on the presence of MSA Hi Matthew, Please find attached the patch to fix the calling convention issue, where argument and result passing convention differed for MSA and non-MSA variants. The implementation of TARGET_RETURN_IN_MEMORY is altered to block V4SF to be returned in registers. Ok for trunk? - Thanks and regards, Sameera D. Changelog: gcc/ * config/mips/mips.c (mips_return_in_memory) : Restrict V4SFmode to be returned in registers. gcc/testsuite/ * gcc.target/mips/msa-fp-cc.c : New testcase. fix_calling_convention.patch Description: fix_calling_convention.patch
[wwwdocs] Add branch description for new branch unified-autovect
Hi! I have created new branch unified-autovect based on ToT. Please find attached the patch adding information about new branch "unified-autovect" in the documentation. Is it ok to commit? - Thanks and regards, Sameera D. unified-autovec-doc.patch Description: unified-autovec-doc.patch
Re: [RFA/ARM][Patch 01/02]: Thumb2 epilogue in RTL
Hi Ramana, Please find attached reworked patch. The patch is tested with check-gcc, check-gdb and bootstrap with no regression. On Fri, 2011-10-21 at 13:43 +0100, Ramana Radhakrishnan wrote: > Why are you differentiating on stack_only ? Does it really matter ? > The patterns pop_multi* generate pop instruction, hence I wanted to be sure that base register is stack. I can remove stack_only option by 1. Modifying pattern to match SP as base-register explicitly or 2. Generate ldm%(ia%) instruction for non-SP base registers. I chose second option. > Hmmm isn't this true of only LDM's in Thumb state ? Though it could be argued > that this patch is only T2 epilogues. > Yes, its true. But for single register we want to match LDR pattern and not any of ldm* or pop_multi* pattern. So, I am barring LDM for single register here. > >+strcpy (pattern, \"fldmfdd\\t\"); > >+strcat (pattern, > >+reg_names[REGNO (SET_DEST (XVECEXP (operands[0], 0, > >0)))]); > >+strcat (pattern, \"!, {\"); > >+strcat (pattern, table[(REGNO (XEXP (XVECEXP (operands[0], 0, 1), 0)) > >+ - FIRST_VFP_REGNUM) / 2].name); > > Can't you reuse names from arm.h and avoid the table here ? > The array REGISTER_NAMES in aout.h use S0, S2, ... names for double registers. Is there any way to use OVERLAPPING_REGISTER_NAMES? If that can be done, I can eliminate the table here. Updated ChangeLog entry: 2011-09-28 Ian Bolton Sameera Deshpande * config/arm/arm-protos.h (load_multiple_operation_p): New declaration. (thumb2_expand_epilogue): Likewise. (thumb2_output_return): Likewise (thumb2_expand_return): Likewise. (thumb_unexpanded_epilogue): Rename to... (thumb1_unexpanded_epilogue): ...this * config/arm/arm.c (load_multiple_operation_p): New function. (thumb2_emit_multi_reg_pop): Likewise. (thumb2_emit_vfp_multi_reg_pop): Likewise. (thumb2_expand_return): Likewise. (thumb2_expand_epilogue): Likewise. (thumb2_output_return): Likewise (thumb_unexpanded_epilogue): Rename to... ( thumb1_unexpanded_epilogue): ...this * config/arm/arm.md (pop_multiple_with_stack_update): New pattern. (pop_multiple_with_stack_update_and_return): Likewise. (thumb2_ldr_with_return): Likewise. (vfp_point_pop_multiple_with_stack_update): Likewise. (return): Update condition and code for pattern. (arm_return): Likewise. (epilogue_insns): Likewise. * config/arm/predicates.md (load_multiple_operation): Update predicate. (load_multiple_operation_return): New predicate. (load_multiple_operation_fp): Likewise. * config/arm/thumb2.md (thumb2_return): Remove. (thumb2_rtl_epilogue_return): New pattern. - Thanks and regards, Sameera D.diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index 23a29c6..2c38883 100644 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -65,6 +65,7 @@ extern int thumb1_legitimate_address_p (enum machine_mode, rtx, int); extern int arm_const_double_rtx (rtx); extern int neg_const_double_rtx_ok_for_fpa (rtx); extern int vfp3_const_double_rtx (rtx); +extern bool load_multiple_operation_p (rtx, bool, enum machine_mode, bool); extern int neon_immediate_valid_for_move (rtx, enum machine_mode, rtx *, int *); extern int neon_immediate_valid_for_logic (rtx, enum machine_mode, int, rtx *, int *); @@ -176,10 +177,13 @@ extern int arm_float_words_big_endian (void); /* Thumb functions. */ extern void arm_init_expanders (void); -extern const char *thumb_unexpanded_epilogue (void); +extern const char *thumb1_unexpanded_epilogue (void); extern void thumb1_expand_prologue (void); extern void thumb1_expand_epilogue (void); extern const char *thumb1_output_interwork (void); +extern void thumb2_expand_epilogue (void); +extern void thumb2_output_return (rtx); +extern void thumb2_expand_return (void); #ifdef TREE_CODE extern int is_called_in_ARM_mode (tree); #endif diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index e07c8c3..ec87892 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -8906,6 +8906,137 @@ neon_valid_immediate (rtx op, enum machine_mode mode, int inverse, #undef CHECK } +/* Return true if OP is a valid load multiple operation for MODE mode. + CONSECUTIVE is true if the registers in the operation must form + a consecutive sequence in the register bank. STACK_ONLY is true + if the base register must be the stack pointer. RETURN_PC is true + if value is to be loaded in PC. */ +bool +load_multiple_operation_p (rtx op, bool consecutive, enum machine_mode mode, + bool return_pc) +{ + HOST_WIDE_INT count = XVE
Re: [RFA/ARM][Patch 02/05]: LDRD generation instead of POP in A15 Thumb2 epilogue.
> > > I don't believe REG_FRAME_RELATED_EXPR does the right thing for > anything besides prologues. You need to emit REG_CFA_RESTORE > for the pop inside an epilogue. Richard, here is updated patch that uses REG_CFA_RESTORE instead of REG_FRAME_RELATED_EXPR. The patch is tested with check-gcc, check-gdb and bootstrap with no regression. Ok for trunk? - Thanks and regards, Sameeradiff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index 37113f5..e71ead5 100644 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -203,6 +203,7 @@ extern void thumb_reload_in_hi (rtx *); extern void thumb_set_return_address (rtx, rtx); extern const char *thumb1_output_casesi (rtx *); extern const char *thumb2_output_casesi (rtx *); +extern bool bad_reg_pair_for_thumb_ldrd_strd (rtx, rtx); #endif /* Defined in pe.c. */ diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index 429b644..05c9368 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -15706,6 +15706,151 @@ arm_emit_vfp_multi_reg_pop (int first_reg, int num_regs, rtx base_reg) REG_NOTES (par) = dwarf; } +bool +bad_reg_pair_for_thumb_ldrd_strd (rtx src1, rtx src2) +{ + return (GET_CODE (src1) != REG + || GET_CODE (src2) != REG + || (REGNO (src1) == PC_REGNUM) + || (REGNO (src1) == SP_REGNUM) + || (REGNO (src1) == REGNO (src2)) + || (REGNO (src2) == PC_REGNUM) + || (REGNO (src2) == SP_REGNUM)); +} + +/* Generate and emit a pattern that will be recognized as LDRD pattern. If even + number of registers are being popped, multiple LDRD patterns are created for + all register pairs. If odd number of registers are popped, last register is + loaded by using LDR pattern. */ +static bool +thumb2_emit_ldrd_pop (unsigned long saved_regs_mask, bool really_return) +{ + int num_regs = 0; + int i, j; + rtx par = NULL_RTX; + rtx dwarf = NULL_RTX; + rtx tmp, reg, tmp1; + + for (i = 0; i <= LAST_ARM_REGNUM; i++) +if (saved_regs_mask & (1 << i)) + num_regs++; + + gcc_assert (num_regs && num_regs <= 16); + gcc_assert (really_return || ((saved_regs_mask & (1 << PC_REGNUM)) == 0)); + + /* We cannot generate ldrd for PC. Hence, reduce the count if PC is + to be popped. So, if num_regs is even, now it will become odd, + and we can generate pop with PC. If num_regs is odd, it will be + even now, and ldr with return can be generated for PC. */ + if (really_return && (saved_regs_mask & (1 << PC_REGNUM))) +num_regs--; + + /* Var j iterates over all the registers to gather all the registers in + saved_regs_mask. Var i gives index of saved registers in stack frame. + A PARALLEL RTX of register-pair is created here, so that pattern for + LDRD can be matched. As PC is always last register to be popped, and + we have already decremented num_regs if PC, we don't have to worry + about PC in this loop. */ + for (i = 0, j = 0; i < (num_regs - (num_regs % 2)); j++) +if (saved_regs_mask & (1 << j)) + { +gcc_assert (j != SP_REGNUM); + +/* Create RTX for memory load. */ +reg = gen_rtx_REG (SImode, j); +tmp = gen_rtx_SET (SImode, + reg, + gen_frame_mem (SImode, + plus_constant (stack_pointer_rtx, 4 * i))); +RTX_FRAME_RELATED_P (tmp) = 1; + +if (i % 2 == 0) + { +/* When saved-register index (i) is even, the RTX to be emitted is + yet to be created. Hence create it first. The LDRD pattern we + are generating is : + [ (SET (reg_t0) (MEM (PLUS (SP) (NUM + (SET (reg_t1) (MEM (PLUS (SP) (NUM + 4 ] + where target registers need not be consecutive. */ +par = gen_rtx_PARALLEL (VOIDmode, rtvec_alloc (2)); +dwarf = NULL_RTX; + } + +/* ith register is added in PARALLEL RTX. If i is even, the reg_i is + added as 0th element and if i is odd, reg_i is added as 1st element + of LDRD pattern shown above. */ +XVECEXP (par, 0, (i % 2)) = tmp; +dwarf = alloc_reg_note (REG_CFA_RESTORE, reg, dwarf); + +if ((i % 2) == 1) + { +/* When saved-register index (i) is odd, RTXs for both the registers + to be loaded are generated in above given LDRD pattern, and the + pattern can be emitted now. */ +par = emit_insn (par); +REG_NOTES (par) = dwarf; + } + +i++; + } + + /* If the number of registers pushed is odd AND really_return is false OR + number of registers are even AND really_return is true, last register is + popped using LDR. It can be PC as well. Hence, adjust the stack first and + then LDR with post increment. */ + + /* Increment the stack pointer, based on there being + num_regs 4-byte re
Re: [RFA/ARM][Patch 03/05]: STRD generation instead of PUSH in A15 Thumb2 prologue.
Hi Ramana, Please find attached reworked patch. The patch is tested with check-gcc, check-gdb and bootstrap with no regression. Ok? - Thanks and regards, Sameera D.diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index 05c9368..334a25f 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -15438,6 +15438,125 @@ arm_output_function_epilogue (FILE *file ATTRIBUTE_UNUSED, } } +/* Generate and emit a pattern that will be recognized as STRD pattern. If even + number of registers are being pushed, multiple STRD patterns are created for + all register pairs. If odd number of registers are pushed, emit a + combination of STRDs and STR for the prologue saves. */ +static void +thumb2_emit_strd_push (unsigned long saved_regs_mask) +{ + int num_regs = 0; + int i, j; + rtx par = NULL_RTX; + rtx insn = NULL_RTX; + rtx dwarf = NULL_RTX; + rtx tmp, reg, tmp1; + + for (i = 0; i <= LAST_ARM_REGNUM; i++) +if (saved_regs_mask & (1 << i)) + num_regs++; + + gcc_assert (num_regs && num_regs <= 16); + + /* Pre-decrement the stack pointer, based on there being num_regs 4-byte + registers to push. */ + tmp = gen_rtx_SET (VOIDmode, + stack_pointer_rtx, + plus_constant (stack_pointer_rtx, -4 * num_regs)); + RTX_FRAME_RELATED_P (tmp) = 1; + insn = emit_insn (tmp); + + /* Create sequence for DWARF info. */ + dwarf = gen_rtx_SEQUENCE (VOIDmode, rtvec_alloc (num_regs + 1)); + + /* RTLs cannot be shared, hence create new copy for dwarf. */ + tmp1 = gen_rtx_SET (VOIDmode, + stack_pointer_rtx, + plus_constant (stack_pointer_rtx, -4 * num_regs)); + RTX_FRAME_RELATED_P (tmp1) = 1; + XVECEXP (dwarf, 0, 0) = tmp1; + + /* Var j iterates over all the registers to gather all the registers in + saved_regs_mask. Var i gives index of register R_j in stack frame. + A PARALLEL RTX of register-pair is created here, so that pattern for + STRD can be matched. If num_regs is odd, 1st register will be pushed + using STR and remaining registers will be pushed with STRD in pairs. + If num_regs is even, all registers are pushed with STRD in pairs. + Hence, skip first element for odd num_regs. */ + for (i = num_regs - 1, j = LAST_ARM_REGNUM; i >= (num_regs % 2); j--) +if (saved_regs_mask & (1 << j)) + { +gcc_assert (j != SP_REGNUM); +gcc_assert (j != PC_REGNUM); + +/* Create RTX for store. New RTX is created for dwarf as + they are not sharable. */ +reg = gen_rtx_REG (SImode, j); +tmp = gen_rtx_SET (SImode, + gen_frame_mem + (SImode, +plus_constant (stack_pointer_rtx, 4 * i)), + reg); + +tmp1 = gen_rtx_SET (SImode, + gen_frame_mem + (SImode, +plus_constant (stack_pointer_rtx, 4 * i)), + reg); +RTX_FRAME_RELATED_P (tmp) = 1; +RTX_FRAME_RELATED_P (tmp1) = 1; + +if (((i - (num_regs % 2)) % 2) == 1) + /* When (i - (num_regs % 2)) is odd, the RTX to be emitted is yet to + be created. Hence create it first. The STRD pattern we are + generating is : + [ (SET (MEM (PLUS (SP) (NUM))) (reg_t1)) + (SET (MEM (PLUS (SP) (NUM + 4))) (reg_t2)) ] + were target registers need not be consecutive. */ + par = gen_rtx_PARALLEL (VOIDmode, rtvec_alloc (2)); + +/* Register R_j is added in PARALLEL RTX. If (i - (num_regs % 2)) is + even, the reg_j is added as 0th element and if it is odd, reg_i is + added as 1st element of STRD pattern shown above. */ +XVECEXP (par, 0, ((i - (num_regs % 2)) % 2)) = tmp; +XVECEXP (dwarf, 0, (i + 1)) = tmp1; + +if (((i - (num_regs % 2)) % 2) == 0) + /* When (i - (num_regs % 2)) is even, RTXs for both the registers + to be loaded are generated in above given STRD pattern, and the + pattern can be emitted now. */ + emit_insn (par); + +i--; + } + + if ((num_regs % 2) == 1) +{ + /* If odd number of registers are pushed, generate STR pattern to store + lone register. */ + for (; (saved_regs_mask & (1 << j)) == 0; j--); + + tmp1 = gen_frame_mem (SImode, plus_constant (stack_pointer_rtx, 4 * i)); + reg = gen_rtx_REG (SImode, j); + tmp = gen_rtx_SET (SImode, tmp1, reg); + RTX_FRAME_RELATED_P (tmp) = 1; + + emit_insn (tmp); + + tmp1 = gen_rtx_SET (SImode, + gen_frame_mem + (SImode, + plus_constant (stack_pointer_rtx, 4 * i)), + reg); + RTX_FRAME_RELATED_P (tmp1) = 1; + XVECEXP (dwarf, 0,
Re: [RFA/ARM][Patch 01/02]: Thumb2 epilogue in RTL
On Mon, 2011-11-07 at 09:56 +, Paul Brook wrote: > > The array REGISTER_NAMES in aout.h use S0, S2, ... names for double > > registers. Is there any way to use OVERLAPPING_REGISTER_NAMES? If that > > can be done, I can eliminate the table here. > > You should be using %P. > Paul, Thanks for your comment. Please find attached reworked patch. The patch is tested with check-gcc without regression. - Thanks and regards, Sameera D. diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index 23a29c6..2c38883 100644 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -65,6 +65,7 @@ extern int thumb1_legitimate_address_p (enum machine_mode, rtx, int); extern int arm_const_double_rtx (rtx); extern int neg_const_double_rtx_ok_for_fpa (rtx); extern int vfp3_const_double_rtx (rtx); +extern bool load_multiple_operation_p (rtx, bool, enum machine_mode, bool); extern int neon_immediate_valid_for_move (rtx, enum machine_mode, rtx *, int *); extern int neon_immediate_valid_for_logic (rtx, enum machine_mode, int, rtx *, int *); @@ -176,10 +177,13 @@ extern int arm_float_words_big_endian (void); /* Thumb functions. */ extern void arm_init_expanders (void); -extern const char *thumb_unexpanded_epilogue (void); +extern const char *thumb1_unexpanded_epilogue (void); extern void thumb1_expand_prologue (void); extern void thumb1_expand_epilogue (void); extern const char *thumb1_output_interwork (void); +extern void thumb2_expand_epilogue (void); +extern void thumb2_output_return (rtx); +extern void thumb2_expand_return (void); #ifdef TREE_CODE extern int is_called_in_ARM_mode (tree); #endif diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index e07c8c3..ec87892 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -8906,6 +8906,137 @@ neon_valid_immediate (rtx op, enum machine_mode mode, int inverse, #undef CHECK } +/* Return true if OP is a valid load multiple operation for MODE mode. + CONSECUTIVE is true if the registers in the operation must form + a consecutive sequence in the register bank. STACK_ONLY is true + if the base register must be the stack pointer. RETURN_PC is true + if value is to be loaded in PC. */ +bool +load_multiple_operation_p (rtx op, bool consecutive, enum machine_mode mode, + bool return_pc) +{ + HOST_WIDE_INT count = XVECLEN (op, 0); + unsigned dest_regno, first_dest_regno; + rtx src_addr; + HOST_WIDE_INT i = 1, base = 0; + HOST_WIDE_INT offset = 0; + rtx elt; + bool addr_reg_loaded = false; + bool update = false; + int reg_increment, regs_per_val; + int offset_adj; + + /* If DFmode, we must be asking for consecutive, + since fldmdd can only do consecutive regs. */ + gcc_assert ((mode != DFmode) || consecutive); + + /* Set up the increments and the regs per val based on the mode. */ + reg_increment = GET_MODE_SIZE (mode); + regs_per_val = mode == DFmode ? 2 : 1; + offset_adj = return_pc ? 1 : 0; + + if (count <= 1 + || GET_CODE (XVECEXP (op, 0, offset_adj)) != SET + || !REG_P (SET_DEST (XVECEXP (op, 0, offset_adj +return false; + + /* Check to see if this might be a write-back. */ + if (GET_CODE (SET_SRC (elt = XVECEXP (op, 0, offset_adj))) == PLUS) +{ + i++; + base = 1; + update = true; + + /* The offset adjustment should be same as number of registers being + popped * size of single register. */ + if (!REG_P (SET_DEST (elt)) + || !REG_P (XEXP (SET_SRC (elt), 0)) + || !CONST_INT_P (XEXP (SET_SRC (elt), 1)) + || INTVAL (XEXP (SET_SRC (elt), 1)) != + ((count - 1 - offset_adj) * reg_increment)) +return false; +} + + i = i + offset_adj; + base = base + offset_adj; + /* Perform a quick check so we don't blow up below. */ + if (GET_CODE (XVECEXP (op, 0, i - 1)) != SET + || !REG_P (SET_DEST (XVECEXP (op, 0, i - 1))) + || !MEM_P (SET_SRC (XVECEXP (op, 0, i - 1 +return false; + + /* If only one reg being loaded, success depends on the type: + FLDMDD can do just one reg, LDM must do at least two. */ + if (count <= i) +return mode == DFmode ? true : false; + + first_dest_regno = REGNO (SET_DEST (XVECEXP (op, 0, i - 1))); + dest_regno = first_dest_regno; + + src_addr = XEXP (SET_SRC (XVECEXP (op, 0, i - 1)), 0); + + if (GET_CODE (src_addr) == PLUS) +{ + if (!CONST_INT_P (XEXP (src_addr, 1))) +return false; + offset = INTVAL (XEXP (src_addr, 1)); + src_addr = XEXP (src_addr, 0); +} + + if (!REG_P (src_addr)) +return false; + + /* The pattern we are trying to match here is: + [(SET (R_d0) (MEM (PLUS (src_addr) (offset + (SET (R_d1) (MEM (PLUS (src_addr) (offset + + : + : + (SET (R_dn) (MEM (PLUS (src_addr) (offset + n * + ] + Where, + 1. If offset is 0, first insn should be (SET (R_d0) (M
Re: [RFA/ARM][Patch 04/05]: STRD generation instead of PUSH in A15 ARM prologue.
On Fri, 2011-10-21 at 13:45 +0100, Ramana Radhakrishnan wrote: > >+arm_emit_strd_push (unsigned long saved_regs_mask) > > How different is this from the thumb2 version you sent out in Patch 03/05 ? > Thumb-2 STRD can handle non-consecutive registers, ARM STRD cannot. Because of which we accumulate non-consecutive STRDs in ARM mode and emit STM instruction. For consecutive registers, STRD is generated. > >@@ -15958,7 +16081,8 @@ arm_get_frame_offsets (void) > > use 32-bit push/pop instructions. */ > > if (! any_sibcall_uses_r3 () > > && arm_size_return_regs () <= 12 > >- && (offsets->saved_regs_mask & (1 << 3)) == 0) > >+ && (offsets->saved_regs_mask & (1 << 3)) == 0 > >+ && (TARGET_THUMB2 || !current_tune->prefer_ldrd_strd)) > > Not sure I completely follow this change yet. > If the stack is not aligned, we need to adjust the stack in prologue. Here, instead of adjusting the stack, we PUSH register R3 on stack, so that no additional ADD instruction is needed for stack adjustment. This works fine when we generate multi-reg load/store instructions. However, when we generate STRD in ARM mode, non-consecutive registers are stored using STR/STM instruction. As pair register of R3 (reg R2) is never pushed on stack, we always end up generating STR instruction to PUSH R3 on stack. This is more expensive than doing ADD SP, SP, #4 for stack adjustment. e.g. if we are PUSHing {R4, R5, R6} registers, the stack is not aligned, hence, we PUSH {R3, R4, R5, R6} So, Instructions generated are: STR R6, [sp, #4] STRD R4, R5, [sp, #12] STR R3, [sp, #16] However, if instead of R3, other caller-saved register is PUSHed, we push {R4, R5, R6, R7}, to generate STRD R6, R7, [sp, #8] STRD R4, R5, [sp, #16] If no caller saved register is available, we generate ADD instruction, which is still better than generating STR. > > Hmmm the question remains if we want to put these into ldmstm.md since > it was theoretically > auto-generated from ldmstm.ml. If this has to be marked to be separate > then I'd like > to regenerate ldmstm.md from ldmstm.ml and differentiate between the > bits that can be auto-generated > and the bits that have been added since. > The current patterns are quite different from patterns generated using arm-ldmstm.ml. I will submit updated arm-ldmstm.ml file generating ldrd/strd patterns as a new patch. Is that fine? The patch is tested with check-gcc, check-gdb and bootstrap. I see a regression in gcc: FAIL: gcc.c-torture/execute/vector-compare-1.c compilation, -O3 -fomit-frame-pointer -funroll-loops with error message /tmp/ccC13odV.s: Assembler messages: /tmp/ccC13odV.s:544: Error: co-processor offset out of range This seems to be uncovered latent bug, and I am looking into it. - Thanks and regards, Sameera D.diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index e71ead5..ccf05c7 100644 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -163,6 +163,7 @@ extern const char *arm_output_memory_barrier (rtx *); extern const char *arm_output_sync_insn (rtx, rtx *); extern unsigned int arm_sync_loop_insns (rtx , rtx *); extern int arm_attr_length_push_multi(rtx, rtx); +extern bool bad_reg_pair_for_arm_ldrd_strd (rtx, rtx); #if defined TREE_CODE extern void arm_init_cumulative_args (CUMULATIVE_ARGS *, tree, rtx, tree); diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index 334a25f..deee78b 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -93,6 +93,7 @@ static bool arm_assemble_integer (rtx, unsigned int, int); static void arm_print_operand (FILE *, rtx, int); static void arm_print_operand_address (FILE *, rtx); static bool arm_print_operand_punct_valid_p (unsigned char code); +static rtx emit_multi_reg_push (unsigned long); static const char *fp_const_from_val (REAL_VALUE_TYPE *); static arm_cc get_arm_condition_code (rtx); static HOST_WIDE_INT int_log2 (HOST_WIDE_INT); @@ -15438,6 +15439,117 @@ arm_output_function_epilogue (FILE *file ATTRIBUTE_UNUSED, } } +/* STRD in ARM mode needs consecutive registers to be stored. This function + keeps accumulating non-consecutive registers until first consecutive register + pair is found. It then generates multi register PUSH for all accumulated + registers, and then generates STRD with write-back for consecutive register + pair. This process is repeated until all the registers are stored on stack. + multi register PUSH takes care of lone registers as well. */ +static void +arm_emit_strd_push (unsigned long saved_regs_mask) +{ + int num_regs = 0; + int i, j; + rtx par = NULL_RTX; + rtx dwarf = NULL_RTX; + rtx insn = NULL_RTX; + rtx tmp, tmp1; + unsigned long regs_to_be_pushed_mask; + + for (i = 0; i <= LAST_ARM_REGNUM; i++) +
Re: [RFA/ARM][Patch 05/05]: LDRD generation instead of POP in A15 ARM epilogue.
On Fri, 2011-10-21 at 13:45 +0100, Ramana Radhakrishnan wrote: > change that. Other than that this patch looks OK and please watch out > for stylistic issues from the previous patch. Ramana, please find attached reworked patch. The patch is tested with check-gcc, check-gdb and bootstrap with no regression. - Thanks and regards, Sameera D.diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index deee78b..4a86749 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -15960,6 +15960,135 @@ bad_reg_pair_for_thumb_ldrd_strd (rtx src1, rtx src2) || (REGNO (src2) == SP_REGNUM)); } +/* LDRD in ARM mode needs consecutive registers to be stored. This function + keeps accumulating non-consecutive registers until first consecutive register + pair is found. It then generates multi-reg POP for all accumulated + registers, and then generates LDRD with write-back for consecutive register + pair. This process is repeated until all the registers are loaded from + stack. multi register POP takes care of lone registers as well. However, + LDRD cannot be generated for PC, as results are unpredictable. Hence, if PC + is in SAVED_REGS_MASK, generate multi-reg POP with RETURN or LDR with RETURN + depending upon number of registers in REGS_TO_BE_POPPED_MASK. */ +static void +arm_emit_ldrd_pop (unsigned long saved_regs_mask, bool really_return) +{ + int num_regs = 0; + int i, j; + rtx par = NULL_RTX; + rtx insn = NULL_RTX; + rtx dwarf = NULL_RTX; + rtx tmp; + unsigned long regs_to_be_popped_mask = 0; + bool pc_in_list = false; + + for (i = 0; i <= LAST_ARM_REGNUM; i++) +if (saved_regs_mask & (1 << i)) + num_regs++; + + gcc_assert (num_regs && num_regs <= 16); + + for (i = 0, j = 0; i < num_regs; j++) +if (saved_regs_mask & (1 << j)) + { +i++; +if ((j % 2) == 0 +&& (saved_regs_mask & (1 << (j + 1))) +&& (j + 1) != SP_REGNUM +&& (j + 1) != PC_REGNUM +&& regs_to_be_popped_mask) + { +/* Current register and next register form register pair for which + LDRD can be generated. Generate POP for accumulated registers + and reset regs_to_be_popped_mask. SP should be handled here as + the results are unpredictable if register being stored is same + as index register (in this case, SP). PC is always the last + register being popped. Hence, we don't have to worry about PC + here. */ +arm_emit_multi_reg_pop (regs_to_be_popped_mask, pc_in_list); +pc_in_list = false; +regs_to_be_popped_mask = 0; +continue; + } + +if (j == PC_REGNUM) + { +gcc_assert (really_return); +pc_in_list = 1; + } + +regs_to_be_popped_mask |= (1 << j); + +if ((j % 2) == 1 +&& (saved_regs_mask & (1 << (j - 1))) +&& j != SP_REGNUM +&& j != PC_REGNUM) + { + /* Generate a LDRD for register pair R_, R_. The pattern +generated here is +[(SET SP, (PLUS SP, 8)) + (SET R_, (MEM SP)) + (SET R_, (MEM (PLUS SP, 4)))]. */ + par = gen_rtx_PARALLEL (VOIDmode, rtvec_alloc (3)); + + tmp = gen_rtx_SET (VOIDmode, +stack_pointer_rtx, +plus_constant (stack_pointer_rtx, 8)); + RTX_FRAME_RELATED_P (tmp) = 1; + XVECEXP (par, 0, 0) = tmp; + + tmp = gen_rtx_SET (SImode, +gen_rtx_REG (SImode, j - 1), +gen_frame_mem (SImode, stack_pointer_rtx)); + RTX_FRAME_RELATED_P (tmp) = 1; + XVECEXP (par, 0, 1) = tmp; + dwarf = alloc_reg_note (REG_CFA_RESTORE, + gen_rtx_REG (SImode, j - 1), + dwarf); + + tmp = gen_rtx_SET (SImode, + gen_rtx_REG (SImode, j), + gen_frame_mem (SImode, + plus_constant (stack_pointer_rtx, 4))); + RTX_FRAME_RELATED_P (tmp) = 1; + XVECEXP (par, 0, 2) = tmp; + dwarf = alloc_reg_note (REG_CFA_RESTORE, + gen_rtx_REG (SImode, j), + dwarf); + + insn = emit_insn (par); + REG_NOTES (insn) = dwarf; + pc_in_list = false; + regs_to_be_popped_mask = 0; + dwarf = NULL_RTX; + } + } + + if (regs_to_be_popped_mask) +{ + /* single PC pop can happen here. Take care of that. */ + if
Re: [RFA/ARM][Patch 01/02]: Thumb2 epilogue in RTL
Hi Richard, thanks for your comments. -- > + if (GET_CODE (SET_SRC (elt = XVECEXP (op, 0, offset_adj))) == PLUS) > > It's generally best not to use assignments within conditionals unless > there is a strong reason otherwise (that normally implies something like > being deep within a condition test where you only want to update the > variable if some pre-conditions are true and that can't be easily > factored out). > > + != (unsigned int) (first_dest_regno + regs_per_val * > (i - base > > Line length (split the line just before the '+' operator. > > + /* now show EVERY reg that will be restored, using a SET for each. */ > > Capital letter at start of sentence. Why is EVERY in caps? > > + saved_regs_mask = offsets->saved_regs_mask; > + for (i = 0, num_regs = 0; i <= LAST_ARM_REGNUM; i++) > > blank line before the for loop. > > + /* It's illegal to do a pop for only one reg, so generate an ldr. */ > > GCC coding standards suggest avoiding the use of 'illegal'. Suggest > changing that to 'Pop can only be used for more than one reg; so...' > > +reg_names[REGNO (XEXP (XVECEXP (operands[0], 0, 2), > 0))]); > + > +/* Skip over the first two elements and the one we just generated. > */ > +for (i = 3; i < (num_saves); i++) > + { > +strcat (pattern, \", %|\"); > > +strcat (pattern, > > +reg_names[REGNO (XEXP (XVECEXP (operands[0], 0, i), > 0))]); > + } > + > +strcat (pattern, \"}\"); > +output_asm_insn (pattern, operands); > + > > +return \"\"; > + } > + " > > + [(set_attr "type" "load4")] > > There's a lot of trailing white space here. Please remove. Removed white spaces in reworked patch http://gcc.gnu.org/ml/gcc-patches/2011-11/msg01009.html > > +(define_insn "*thumb2_ldr_with_return" > + [(return) > + (set (reg:SI PC_REGNUM) > +(mem:SI (post_inc:SI (match_operand:SI 0 "s_register_operand" > "k"] > + "TARGET_THUMB2" > + "ldr%?\t%|pc, [%0], #4" > + [(set_attr "type" "load1") > + (set_attr "predicable" "yes")] > +) > + > > This pattern doesn't seem to be used. What's its purpose? This pattern is generated from thumb2_expand_return in + if (num_regs == 1) +{ + rtx par = gen_rtx_PARALLEL (VOIDmode, rtvec_alloc (2)); + rtx reg = gen_rtx_REG (SImode, PC_REGNUM); + rtx addr = gen_rtx_MEM (SImode, + gen_rtx_POST_INC (SImode, + stack_pointer_rtx)); + set_mem_alias_set (addr, get_frame_alias_set ()); + XVECEXP (par, 0, 0) = ret_rtx; + XVECEXP (par, 0, 1) = gen_rtx_SET (SImode, reg, addr); + RTX_FRAME_RELATED_P (par) = 1; + emit_jump_insn (par); +} > > +static const struct { const char *const name; } table[] > + = { {\"d0\"}, {\"d1\"}, {\"d2\"}, {\"d3\"}, > > I'm not keen on having this table. Generally the register names should > be configurable depending on the assembler flavour and this patch > defeats that. Is there any way to rewrite this code so that it can use > the standard operand methods for generating register names? The updated patch was resent after comments from Ramana and Paul which eliminates this table. http://gcc.gnu.org/ml/gcc-patches/2011-11/msg01009.html I will take care of other formatting issues and will resend the patch. > > In summary, this is mostly OK, apart from the last two items. > > R. - Thanks and regards, Sameera D.
Re: [RFA/ARM][Patch 01/02]: Thumb2 epilogue in RTL
On Thu, 2011-11-10 at 13:44 +, Richard Earnshaw wrote: > On 28/09/11 17:15, Sameera Deshpande wrote: > > Hi! > > > > This patch generates Thumb2 epilogues in RTL form. > > > > The work involves defining new functions, predicates and patterns along with > > few changes in existing code: > > * The load_multiple_operation predicate was found to be too restrictive for > > integer loads as it required consecutive destination regs, so this > > restriction was lifted. > > * Variations of load_multiple_operation were required to handle cases > >- where SP must be the base register > >- where FP values were being loaded (which do require consecutive > > destination registers) > >- where PC can be in register-list (which requires return pattern along > > with register loads). > > Hence, the common code was factored out into a new function in arm.c and > > parameterised to show > >- whether consecutive destination regs are needed > >- the data type being loaded > >- whether the base register has to be SP > >- whether PC is in register-list > > > > The patch is tested with arm-eabi with no regressions. > > > > ChangeLog: > > > > 2011-09-28 Ian Bolton > > Sameera Deshpande > > > >* config/arm/arm-protos.h (load_multiple_operation_p): New > > declaration. > > (thumb2_expand_epilogue): Likewise. > > (thumb2_output_return): Likewise > > (thumb2_expand_return): Likewise. > > (thumb_unexpanded_epilogue): Rename to... > > (thumb1_unexpanded_epilogue): ...this > >* config/arm/arm.c (load_multiple_operation_p): New function. > > (thumb2_emit_multi_reg_pop): Likewise. > > (thumb2_emit_vfp_multi_reg_pop): Likewise. > > (thumb2_expand_return): Likewise. > > (thumb2_expand_epilogue): Likewise. > > (thumb2_output_return): Likewise > > (thumb_unexpanded_epilogue): Rename to... > > ( thumb1_unexpanded_epilogue): ...this > >* config/arm/arm.md (pop_multiple_with_stack_update): New pattern. > > (pop_multiple_with_stack_update_and_return): Likewise. > > (thumb2_ldr_with_return): Likewise. > > (floating_point_pop_multiple_with_stack_update): Likewise. > > (return): Update condition and code for pattern. > > (arm_return): Likewise. > > (epilogue_insns): Likewise. > >* config/arm/predicates.md (load_multiple_operation): Update > > predicate. > > (load_multiple_operation_stack_and_return): New predicate. > > (load_multiple_operation_stack): Likewise. > > (load_multiple_operation_stack_fp): Likewise. > >* config/arm/thumb2.md (thumb2_return): Remove. > > (thumb2_rtl_epilogue_return): New pattern. > > > > > > - Thanks and regards, > > Sameera D. > > > > > > thumb2_rtl_epilogue_complete-27Sept.patch > > > > + if (GET_CODE (SET_SRC (elt = XVECEXP (op, 0, offset_adj))) == PLUS) > > It's generally best not to use assignments within conditionals unless > there is a strong reason otherwise (that normally implies something like > being deep within a condition test where you only want to update the > variable if some pre-conditions are true and that can't be easily > factored out). > > + != (unsigned int) (first_dest_regno + regs_per_val * > (i - base > > Line length (split the line just before the '+' operator. > > + /* now show EVERY reg that will be restored, using a SET for each. */ > > Capital letter at start of sentence. Why is EVERY in caps? > > + saved_regs_mask = offsets->saved_regs_mask; > + for (i = 0, num_regs = 0; i <= LAST_ARM_REGNUM; i++) > > blank line before the for loop. > > + /* It's illegal to do a pop for only one reg, so generate an ldr. */ > > GCC coding standards suggest avoiding the use of 'illegal'. Suggest > changing that to 'Pop can only be used for more than one reg; so...' > > +reg_names[REGNO (XEXP (XVECEXP (operands[0], 0, 2), > 0))]); > + > +/* Skip over the first two elements and the one we just generated. > */ > +for (i = 3; i < (num_saves); i++) > + { > +strcat (pattern, \", %|\"); > > +strcat (pattern, > > +reg_names[REGNO (XEXP (XVECEXP (operands[0], 0, i), > 0))]); > + } > + > +strc
[RFA/ARM][Patch 01/02]: Thumb2 epilogue in RTL
Hi! This patch generates Thumb2 epilogues in RTL form. The work involves defining new functions, predicates and patterns along with few changes in existing code: * The load_multiple_operation predicate was found to be too restrictive for integer loads as it required consecutive destination regs, so this restriction was lifted. * Variations of load_multiple_operation were required to handle cases - where SP must be the base register - where FP values were being loaded (which do require consecutive destination registers) - where PC can be in register-list (which requires return pattern along with register loads). Hence, the common code was factored out into a new function in arm.c and parameterised to show - whether consecutive destination regs are needed - the data type being loaded - whether the base register has to be SP - whether PC is in register-list The patch is tested with arm-eabi with no regressions. ChangeLog: 2011-09-28 Ian Bolton Sameera Deshpande * config/arm/arm-protos.h (load_multiple_operation_p): New declaration. (thumb2_expand_epilogue): Likewise. (thumb2_output_return): Likewise (thumb2_expand_return): Likewise. (thumb_unexpanded_epilogue): Rename to... (thumb1_unexpanded_epilogue): ...this * config/arm/arm.c (load_multiple_operation_p): New function. (thumb2_emit_multi_reg_pop): Likewise. (thumb2_emit_vfp_multi_reg_pop): Likewise. (thumb2_expand_return): Likewise. (thumb2_expand_epilogue): Likewise. (thumb2_output_return): Likewise (thumb_unexpanded_epilogue): Rename to... ( thumb1_unexpanded_epilogue): ...this * config/arm/arm.md (pop_multiple_with_stack_update): New pattern. (pop_multiple_with_stack_update_and_return): Likewise. (thumb2_ldr_with_return): Likewise. (floating_point_pop_multiple_with_stack_update): Likewise. (return): Update condition and code for pattern. (arm_return): Likewise. (epilogue_insns): Likewise. * config/arm/predicates.md (load_multiple_operation): Update predicate. (load_multiple_operation_stack_and_return): New predicate. (load_multiple_operation_stack): Likewise. (load_multiple_operation_stack_fp): Likewise. * config/arm/thumb2.md (thumb2_return): Remove. (thumb2_rtl_epilogue_return): New pattern. - Thanks and regards, Sameera D. thumb2_rtl_epilogue_complete-27Sept.patch Description: Binary data
Ping! Re: [RFA/ARM][Patch 01/02]: Thumb2 epilogue in RTL
Ping! http://gcc.gnu.org/ml/gcc-patches/2011-09/msg01854.html On Wed, 2011-09-28 at 17:15 +0100, Sameera Deshpande wrote: > Hi! > > This patch generates Thumb2 epilogues in RTL form. > > The work involves defining new functions, predicates and patterns along with > few changes in existing code: > * The load_multiple_operation predicate was found to be too restrictive for > integer loads as it required consecutive destination regs, so this > restriction was lifted. > * Variations of load_multiple_operation were required to handle cases >- where SP must be the base register >- where FP values were being loaded (which do require consecutive > destination registers) >- where PC can be in register-list (which requires return pattern along > with register loads). > Hence, the common code was factored out into a new function in arm.c and > parameterised to show >- whether consecutive destination regs are needed >- the data type being loaded >- whether the base register has to be SP >- whether PC is in register-list > > The patch is tested with arm-eabi with no regressions. > > ChangeLog: > > 2011-09-28 Ian Bolton > Sameera Deshpande > >* config/arm/arm-protos.h (load_multiple_operation_p): New > declaration. > (thumb2_expand_epilogue): Likewise. > (thumb2_output_return): Likewise > (thumb2_expand_return): Likewise. > (thumb_unexpanded_epilogue): Rename to... > (thumb1_unexpanded_epilogue): ...this >* config/arm/arm.c (load_multiple_operation_p): New function. > (thumb2_emit_multi_reg_pop): Likewise. > (thumb2_emit_vfp_multi_reg_pop): Likewise. > (thumb2_expand_return): Likewise. > (thumb2_expand_epilogue): Likewise. > (thumb2_output_return): Likewise > (thumb_unexpanded_epilogue): Rename to... > ( thumb1_unexpanded_epilogue): ...this >* config/arm/arm.md (pop_multiple_with_stack_update): New pattern. > (pop_multiple_with_stack_update_and_return): Likewise. > (thumb2_ldr_with_return): Likewise. > (floating_point_pop_multiple_with_stack_update): Likewise. > (return): Update condition and code for pattern. > (arm_return): Likewise. > (epilogue_insns): Likewise. >* config/arm/predicates.md (load_multiple_operation): Update > predicate. > (load_multiple_operation_stack_and_return): New predicate. > (load_multiple_operation_stack): Likewise. > (load_multiple_operation_stack_fp): Likewise. >* config/arm/thumb2.md (thumb2_return): Remove. > (thumb2_rtl_epilogue_return): New pattern. > > > - Thanks and regards, > Sameera D. --
Ping! Re: [RFA/ARM][Patch 02/02]: ARM epilogues in RTL
Ping! http://gcc.gnu.org/ml/gcc-patches/2011-09/msg01855.html On Wed, 2011-09-28 at 17:15 +0100, Sameera Deshpande wrote: > Hi! > > This patch generates ARM epilogue in RTL form. > > The work defines new functions and reuses most of the static functions and > patterns defined in the previous patch (Thumb2 epilogues in RTL) with minor > changes to handle mode specific details. > Hence, this patch depends completely on previous patch. > > It is tested with arm-eabi with no regression. > > ChangeLog: > > 2011-09-28 Sameera Deshpande > > >* config/arm/arm-protos.h (arm_expand_epilogue): New declarations. > (arm_expand_return): Likewise. > (thumb2_expand_epilogue): Add new boolean argument. >* config/arm/arm.c (print_multi_reg): Remove. > (vfp_output_fldmd): Likewise. > > (arm_output_epilogue): Likewise. > (output_return_instruction): Update the function. > (thumb2_emit_multi_reg_pop): Rename to... > (arm_emit_multi_reg_pop): ...this > (thumb2_emit_vfp_multi_reg_pop): Rename to... > (arm_emit_vfp_multi_reg_pop): ...this > (arm_emit_vfp_multi_reg_pop): Add new argument base_reg. > (arm_expand_return): New function. > > (arm_expand_epilogue): Likewise. > (thumb2_expand_epilogue): Add new argument is_sibling. >* config/arm/arm.md (pop_multiple_with_stack_update): Update > condition and code for pattern. > (arm_return): Likewise. > (pop_multiple_with_stack_update_and_return): Likewise. > (floating_point_pop_multiple_with_stack_update): Likewise. > (thumb2_ldr_with_return): Rename to... > (ldr_with_return): ...this > (ldr_with_return): Update condition. > (cond_return): Remove. > (cond_return_inverted): Likewise. > (return): Update code. > (epilogue): Likewise. > (sibcall_epilogue): Likewise. > (epilogue_insns): Update condition and code. > > > - Thanks and regards, > Sameera D. --
[RFA/ARM][Patch 00/05]: Introduction - Generate LDRD/STRD in prologue/epilogue instead of PUSH/POP.
This series of 5 patches generate LDRD/STRD instead of POP/PUSH in epilogue/prologue for ARM and Thumb-2 mode of A15. Patch [1/5] introduces new field in tune which can be used to indicate whether LDRD/STRD are preferred over POP/PUSH by the specific core. Patches [2-5/5] use this field to determine if LDRD/STRD can be generated instead of PUSH/POP in ARM and Thumb-2 mode. Patch [2/5] generates LDRD instead of POP for Thumb-2 epilogue in A15. This patch depends on patch [1/5]. Patch [3/5] generates STRD instead of PUSH for Thumb-2 prologue in A15. This patch depends for variables, functions and patterns defined in [1/5] and [2/5]. Patch [4/5] generates STRD instead of PUSH for ARM prologue in A15. This patch depends on [1/5]. Patch [5/5] generates LDRD instead of POP for ARM epilogue in A15. This patch depends for variables, functions and patterns defined in [1/5] and [4/5]. All these patches depend upon the Thumb2/ARM RTL epilogue patches http://gcc.gnu.org/ml/gcc-patches/2011-09/msg01854.html, http://gcc.gnu.org/ml/gcc-patches/2011-09/msg01855.html submitted for review. All these patches are applied in given order and tested with check-gcc, check-gdb and bootstrap without regression. In case of ARM mode, significant performance improvement can be seen on some parts of a popular embedded consumer benchmark (~26%). However, in most of the cases, not much effect is seen on performance. (~ 3% improvement) In case of thumb2, the performance improvement observed on same parts the benchmark is ~11% (2.5% improvement). --
[RFA/ARM][Patch 01/05]: Create tune for Cortex-A15.
Hi! This patch adds new field in tune_params to indicate if LDRD/STRD are preferred over PUSH/POP in prologue/epilogue of specific core. It also creates new tune for cortex-A15 and updates tunes for other cores to set new field to default value. Changelog entry for Patch to create tune for cortex-a15: 2011-10-11 Sameera Deshpande * config/arm/arm-cores.def (cortex_a15): Update. * config/arm/arm-protos.h (struct tune_params): Add new field... (arm_gen_ldrd_strd): ... this. * config/arm/arm.c (arm_slowmul_tune): Add arm_gen_ldrd_strd field settings. (arm_fastmul_tune): Likewise. (arm_strongarm_tune): Likewise. (arm_xscale_tune): Likewise. (arm_9e_tune): Likewise. (arm_v6t2_tune): Likewise. (arm_cortex_tune): Likewise. (arm_cortex_a5_tune): Likewise. (arm_cortex_a9_tune): Likewise. (arm_fa726te_tune): Likewise. (arm_cortex_a15_tune): New variable. -- On Tue, 2011-10-11 at 10:08 +0100, Sameera Deshpande wrote: > This series of 5 patches generate LDRD/STRD instead of POP/PUSH in > epilogue/prologue for ARM and Thumb-2 mode of A15. > > Patch [1/5] introduces new field in tune which can be used to indicate > whether LDRD/STRD are preferred over POP/PUSH by the specific core. > > Patches [2-5/5] use this field to determine if LDRD/STRD can be > generated instead of PUSH/POP in ARM and Thumb-2 mode. > > Patch [2/5] generates LDRD instead of POP for Thumb-2 epilogue in A15. > This patch depends on patch [1/5]. > > Patch [3/5] generates STRD instead of PUSH for Thumb-2 prologue in A15. > This patch depends for variables, functions and patterns defined in > [1/5] and [2/5]. > > Patch [4/5] generates STRD instead of PUSH for ARM prologue in A15. This > patch depends on [1/5]. > > Patch [5/5] generates LDRD instead of POP for ARM epilogue in A15. This > patch depends for variables, functions and patterns defined in [1/5] and > [4/5]. > > All these patches depend upon the Thumb2/ARM RTL epilogue patches > http://gcc.gnu.org/ml/gcc-patches/2011-09/msg01854.html, > http://gcc.gnu.org/ml/gcc-patches/2011-09/msg01855.html submitted for > review. > > All these patches are applied in given order and tested with check-gcc, > check-gdb and bootstrap without regression. > > In case of ARM mode, significant performance improvement can be seen on > some parts of a popular embedded consumer benchmark (~26%). > However, in most of the cases, not much effect is seen on performance. > (~ 3% improvement) > > In case of thumb2, the performance improvement observed on same parts > the benchmark is ~11% (2.5% improvement). > diff --git a/gcc/config/arm/arm-cores.def b/gcc/config/arm/arm-cores.def index 742b5e8..1b42713 100644 --- a/gcc/config/arm/arm-cores.def +++ b/gcc/config/arm/arm-cores.def @@ -128,7 +128,7 @@ ARM_CORE("generic-armv7-a", genericv7a, 7A, FL_LDSCHED, cortex) ARM_CORE("cortex-a5", cortexa5, 7A, FL_LDSCHED, cortex_a5) ARM_CORE("cortex-a8", cortexa8, 7A, FL_LDSCHED, cortex) ARM_CORE("cortex-a9", cortexa9, 7A, FL_LDSCHED, cortex_a9) -ARM_CORE("cortex-a15", cortexa15, 7A, FL_LDSCHED | FL_THUMB_DIV | FL_ARM_DIV, cortex) +ARM_CORE("cortex-a15", cortexa15, 7A, FL_LDSCHED | FL_THUMB_DIV | FL_ARM_DIV, cortex_a15) ARM_CORE("cortex-r4", cortexr4, 7R, FL_LDSCHED, cortex) ARM_CORE("cortex-r4f", cortexr4f, 7R, FL_LDSCHED, cortex) ARM_CORE("cortex-r5", cortexr5, 7R, FL_LDSCHED | FL_ARM_DIV, cortex) diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index f69bc42..c6b8f71 100644 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -243,6 +243,9 @@ struct tune_params int l1_cache_line_size; bool prefer_constant_pool; int (*branch_cost) (bool, bool); + /* This flag indicates if STRD/LDRD instructions are preferred + over PUSH/POP in epilogue/prologue. */ + bool prefer_ldrd_strd; }; extern const struct tune_params *current_tune; diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index 6c09267..d709375 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -850,7 +850,8 @@ const struct tune_params arm_slowmul_tune = 5, /* Max cond insns. */ ARM_PREFETCH_NOT_BENEFICIAL, true, /* Prefer constant pool. */ - arm_default_branch_cost + arm_default_branch_cost, + false /* Prefer LDRD/STRD. */ }; const struct tune_params arm_fastmul_tune = @@ -861,7 +862,8 @@ const struct tune_params arm_fastmul_tune = 5, /* Max cond insns. */ ARM_PREFETCH_NOT_BENEFICIAL, true, /* Prefer constant pool. */ - arm_default_bran
[RFA/ARM][Patch 02/05]: LDRD generation instead of POP in A15 Thumb2 epilogue.
Hi! This patch generates LDRD instead of POP for Thumb2 epilogue in A15. For optimize_size, original epilogue is generated for A15. The work involves defining new functions, predicates and patterns. As LDRD cannot be generated for PC, if PC is in register-list, LDRD is generated for all other registers in the list which can form register pair. Then LDR with return is generated if PC is the only register left to be popped, otherwise POP with return is generated. The patch is tested with check-gcc, check-gdb and bootstrap with no regression. Changelog entry for Patch to emit LDRD for thumb2 epilogue in A15: 2011-10-11 Sameera Deshpande * config/arm/arm-protos.h (bad_reg_pair_for_thumb_ldrd_strd): New declaration. * config/arm/arm.c (bad_reg_pair_for_thumb_ldrd_strd): New helper function. (thumb2_emit_ldrd_pop): New static function. (thumb2_expand_epilogue): Update functions. * config/arm/constraints.md (Pz): New constraint. * config/arm/ldmstm.md (thumb2_ldrd_base): New pattern. (thumb2_ldrd): Likewise. * config/arm/predicates.md (ldrd_immediate_operand): New predicate. -- diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index c6b8f71..06a67b5 100644 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -202,6 +202,7 @@ extern void thumb_reload_in_hi (rtx *); extern void thumb_set_return_address (rtx, rtx); extern const char *thumb1_output_casesi (rtx *); extern const char *thumb2_output_casesi (rtx *); +extern bool bad_reg_pair_for_thumb_ldrd_strd (rtx, rtx); #endif /* Defined in pe.c. */ diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index d709375..3eba510 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -15410,6 +15410,155 @@ arm_emit_vfp_multi_reg_pop (int first_reg, int num_regs, rtx base_reg) par = emit_insn (par); add_reg_note (par, REG_FRAME_RELATED_EXPR, dwarf); } +bool +bad_reg_pair_for_thumb_ldrd_strd (rtx src1, rtx src2) +{ + return (GET_CODE (src1) != REG + || GET_CODE (src2) != REG + || (REGNO (src1) == PC_REGNUM) + || (REGNO (src1) == SP_REGNUM) + || (REGNO (src1) == REGNO (src2)) + || (REGNO (src2) == PC_REGNUM) + || (REGNO (src2) == SP_REGNUM)); +} + +/* Generate and emit a pattern that will be recognized as LDRD pattern. If even + number of registers are being popped, multiple LDRD patterns are created for + all register pairs. If odd number of registers are popped, last register is + loaded by using LDR pattern. */ +static bool +thumb2_emit_ldrd_pop (unsigned long saved_regs_mask, bool really_return) +{ + int num_regs = 0; + int i, j; + rtx par = NULL_RTX; + rtx dwarf = NULL_RTX; + rtx tmp, reg, tmp1; + + for (i = 0; i <= LAST_ARM_REGNUM; i++) +if (saved_regs_mask & (1 << i)) + num_regs++; + + gcc_assert (num_regs && num_regs <= 16); + gcc_assert (really_return || ((saved_regs_mask & (1 << PC_REGNUM)) == 0)); + + if (really_return && (saved_regs_mask & (1 << PC_REGNUM))) +/* We cannot generate ldrd for PC. Hence, reduce the count if PC is + to be popped. So, if num_regs is even, now it will become odd, + and we can generate pop with PC. If num_regs is odd, it will be + even now, and ldr with return can be generated for PC. */ +num_regs--; + + for (i = 0, j = 0; i < (num_regs - (num_regs % 2)); j++) +/* Var j iterates over all the registers to gather all the registers in + saved_regs_mask. Var i gives index of saved registers in stack frame. + A PARALLEL RTX of register-pair is created here, so that pattern for + LDRD can be matched. As PC is always last register to be popped, and + we have already decremented num_regs if PC, we don't have to worry + about PC in this loop. */ +if (saved_regs_mask & (1 << j)) + { +gcc_assert (j != SP_REGNUM); + +/* Create RTX for memory load. New RTX is created for dwarf as + they are not sharable. */ +reg = gen_rtx_REG (SImode, j); +tmp = gen_rtx_SET (SImode, + reg, + gen_frame_mem (SImode, + plus_constant (stack_pointer_rtx, 4 * i))); + +tmp1 = gen_rtx_SET (SImode, + reg, + gen_frame_mem (SImode, + plus_constant (stack_pointer_rtx, 4 * i))); +RTX_FRAME_RELATED_P (tmp) = 1; +RTX_FRAME_RELATED_P (tmp1) = 1; + +if (i % 2 == 0) + { +/* When saved-register index (i) is even, the RTX to be emitted is + yet to be cre
[RFA/ARM][Patch 03/05]: STRD generation instead of PUSH in A15 Thumb2 prologue.
Hi! This patch generates STRD instruction instead of PUSH in thumb2 mode for A15. For optimize_size, original prologue is generated for A15. The work involves defining new functions, predicates and patterns. The patch is tested with check-gcc, check-gdb and bootstrap with no regression. Changelog entries for the patch for STRD generation for a15-thumb2: 2011-10-11 Sameera Deshpande * config/arm/arm.c (thumb2_emit_strd_push): New static function. (arm_expand_prologue): Update. * config/arm/ldmstm.md (thumb2_strd): New pattern. (thumb2_strd_base): Likewise. -- diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index 3eba510..fd8c31d 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -15095,6 +15095,125 @@ arm_output_function_epilogue (FILE *file ATTRIBUTE_UNUSED, } } +/* Generate and emit a pattern that will be recognized as STRD pattern. If even + number of registers are being pushed, multiple STRD patterns are created for + all register pairs. If odd number of registers are pushed, first register is + stored by using STR pattern. */ +static void +thumb2_emit_strd_push (unsigned long saved_regs_mask) +{ + int num_regs = 0; + int i, j; + rtx par = NULL_RTX; + rtx insn = NULL_RTX; + rtx dwarf = NULL_RTX; + rtx tmp, reg, tmp1; + + for (i = 0; i <= LAST_ARM_REGNUM; i++) +if (saved_regs_mask & (1 << i)) + num_regs++; + + gcc_assert (num_regs && num_regs <= 16); + + /* Pre-decrement the stack pointer, based on there being num_regs 4-byte + registers to push. */ + tmp = gen_rtx_SET (VOIDmode, + stack_pointer_rtx, + plus_constant (stack_pointer_rtx, -4 * num_regs)); + RTX_FRAME_RELATED_P (tmp) = 1; + insn = emit_insn (tmp); + + /* Create sequence for DWARF info. */ + dwarf = gen_rtx_SEQUENCE (VOIDmode, rtvec_alloc (num_regs + 1)); + + /* RTLs cannot be shared, hence create new copy for dwarf. */ + tmp1 = gen_rtx_SET (VOIDmode, + stack_pointer_rtx, + plus_constant (stack_pointer_rtx, -4 * num_regs)); + RTX_FRAME_RELATED_P (tmp1) = 1; + XVECEXP (dwarf, 0, 0) = tmp1; + + for (i = num_regs - 1, j = LAST_ARM_REGNUM; i >= (num_regs % 2); j--) +/* Var j iterates over all the registers to gather all the registers in + saved_regs_mask. Var i gives index of register R_j in stack frame. + A PARALLEL RTX of register-pair is created here, so that pattern for + STRD can be matched. If num_regs is odd, 1st register will be pushed + using STR and remaining registers will be pushed with STRD in pairs. + If num_regs is even, all registers are pushed with STRD in pairs. + Hence, skip first element for odd num_regs. */ +if (saved_regs_mask & (1 << j)) + { +gcc_assert (j != SP_REGNUM); +gcc_assert (j != PC_REGNUM); + +/* Create RTX for store. New RTX is created for dwarf as + they are not sharable. */ +reg = gen_rtx_REG (SImode, j); +tmp = gen_rtx_SET (SImode, + gen_frame_mem + (SImode, +plus_constant (stack_pointer_rtx, 4 * i)), + reg); + +tmp1 = gen_rtx_SET (SImode, + gen_frame_mem + (SImode, +plus_constant (stack_pointer_rtx, 4 * i)), + reg); +RTX_FRAME_RELATED_P (tmp) = 1; +RTX_FRAME_RELATED_P (tmp1) = 1; + +if (((i - (num_regs % 2)) % 2) == 1) + /* When (i - (num_regs % 2)) is odd, the RTX to be emitted is yet to + be created. Hence create it first. The STRD pattern we are + generating is : + [ (SET (MEM (PLUS (SP) (NUM))) (reg_t1)) + (SET (MEM (PLUS (SP) (NUM + 4))) (reg_t2)) ] + were target registers need not be consecutive. */ + par = gen_rtx_PARALLEL (VOIDmode, rtvec_alloc (2)); + +/* Register R_j is added in PARALLEL RTX. If (i - (num_regs % 2)) is + even, the reg_j is added as 0th element and if it is odd, reg_i is + added as 1st element of STRD pattern shown above. */ +XVECEXP (par, 0, ((i - (num_regs % 2)) % 2)) = tmp; +XVECEXP (dwarf, 0, (i + 1)) = tmp1; + +if (((i - (num_regs % 2)) % 2) == 0) + /* When (i - (num_regs % 2)) is even, RTXs for both the registers + to be loaded are generated in above given STRD pattern, and the + pattern can be emitted now. */ + emit_insn (par); + +i--; + } + + if ((num_regs % 2) == 1) +{ + /* If odd number of registers are pushed, generate STR pattern to store +
[RFA/ARM][Patch 04/05]: STRD generation instead of PUSH in A15 ARM prologue.
Hi! This patch generates STRD instead of PUSH in prologue for A15 ARM mode. For optimize_size, original prologue is generated for A15. The work involves defining new functions, predicates and patterns, along with minor changes in existing code: * STRD in ARM mode needs consecutive registers to be stored. The performance of compiler degrades greatly if R3 is pushed for stack alignment as it generates single LDR for pushing R3. Instead, having SUB instruction to do stack adjustment is more efficient. Hence, the condition in arm_get_frame_offsets () is changed to disable push-in-R3 if prefer_ldrd_strd in ARM mode. In this patch we keep on accumulating non-consecutive registers till register-pair to be pushed is found. Then, first PUSH all the accumulated registers, followed by STRD with pre-stack update for register-pair. We repeat this until all the registers in register-list are PUSHed. The patch is tested with check-gcc, check-gdb and bootstrap with no regression. Changelog entry for Patch to emit STRD for ARM prologue in A15: 2011-10-11 Sameera Deshpande * config/arm/arm-protos.h (bad_reg_pair_for_arm_ldrd_strd): New declaration. * config/arm/arm.c (arm_emit_strd_push): New static function. (bad_reg_pair_for_arm_ldrd_strd): New helper function. (arm_expand_prologue): Update. (arm_get_frame_offsets): Update. * config/arm/ldmstm.md (arm_strd_base): New pattern. -- diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index 06a67b5..d5287ad 100644 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -162,6 +162,7 @@ extern const char *arm_output_memory_barrier (rtx *); extern const char *arm_output_sync_insn (rtx, rtx *); extern unsigned int arm_sync_loop_insns (rtx , rtx *); extern int arm_attr_length_push_multi(rtx, rtx); +extern bool bad_reg_pair_for_arm_ldrd_strd (rtx, rtx); #if defined TREE_CODE extern void arm_init_cumulative_args (CUMULATIVE_ARGS *, tree, rtx, tree); diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index fd8c31d..08fa0d5 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -93,6 +93,7 @@ static bool arm_assemble_integer (rtx, unsigned int, int); static void arm_print_operand (FILE *, rtx, int); static void arm_print_operand_address (FILE *, rtx); static bool arm_print_operand_punct_valid_p (unsigned char code); +static rtx emit_multi_reg_push (unsigned long); static const char *fp_const_from_val (REAL_VALUE_TYPE *); static arm_cc get_arm_condition_code (rtx); static HOST_WIDE_INT int_log2 (HOST_WIDE_INT); @@ -15095,6 +15096,116 @@ arm_output_function_epilogue (FILE *file ATTRIBUTE_UNUSED, } } +/* STRD in ARM mode needs consecutive registers to be stored. This function + keeps accumulating non-consecutive registers until first consecutive register + pair is found. It then generates multi-reg PUSH for all accumulated + registers, and then generates STRD with write-back for consecutive register + pair. This process is repeated until all the registers are stored on stack. + multi-reg PUSH takes care of lone registers as well. */ +static void +arm_emit_strd_push (unsigned long saved_regs_mask) +{ + int num_regs = 0; + int i, j; + rtx par = NULL_RTX; + rtx dwarf = NULL_RTX; + rtx insn = NULL_RTX; + rtx tmp, tmp1; + unsigned long regs_to_be_pushed_mask; + + for (i = 0; i <= LAST_ARM_REGNUM; i++) +if (saved_regs_mask & (1 << i)) + num_regs++; + + gcc_assert (num_regs && num_regs <= 16); + + for (i=0, j = LAST_ARM_REGNUM, regs_to_be_pushed_mask = 0; i < num_regs; j--) +/* Var j iterates over all registers to gather all registers in + saved_regs_mask. Var i is used to count number of registers stored on + stack. regs_to_be_pushed_mask accumulates non-consecutive registers + that can be pushed using multi-reg PUSH before STRD is generated. */ +if (saved_regs_mask & (1 << j)) + { +gcc_assert (j != SP_REGNUM); +gcc_assert (j != PC_REGNUM); +i++; + +if ((j % 2 == 1) +&& (saved_regs_mask & (1 << (j - 1))) +&& regs_to_be_pushed_mask) + { +/* Current register and previous register form register pair for + which STRD can be generated. Hence, emit PUSH for accumulated + registers and reset regs_to_be_pushed_mask. */ +insn = emit_multi_reg_push (regs_to_be_pushed_mask); +regs_to_be_pushed_mask = 0; +RTX_FRAME_RELATED_P (insn) = 1; +continue; + } + +regs_to_be_pushed_mask |= (1 << j); + +if ((j % 2) == 0 && (saved_regs_mask & (1 << (j + 1 + { +/* We have found 2 consecutive registers, for whi
[RFA/ARM][Patch 05/05]: LDRD generation instead of POP in A15 ARM epilogue.
Hi! This patch generates LDRD instead of POP in epilogue for A15 ARM mode. For optimize_size, original epilogue is generated for A15. The work involves defining new functions, predicates and patterns. In this patch we keep on accumulating non-consecutive registers till register-pair to be popped is found. Then, first POP all the accumulated registers, followed by LDRD with post-stack update for register-pair. We repeat this until all the registers in register-list are POPPed. The patch is tested with check-gcc, check-gdb and bootstrap with no regression. Changelog entry for Patch to emit LDRD for ARM epilogue in A15: 2011-10-11 Sameera Deshpande * config/arm/arm.c (arm_emit_ldrd_pop): New static function. (arm_expand_epilogue): Update. * config/arm/ldmstm.md (arm_ldrd_base): New pattern. (arm_ldr_with_update): Likewise. -- diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index 08fa0d5..0b9fd93 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -967,7 +967,7 @@ const struct tune_params arm_cortex_a9_tune = ARM_PREFETCH_BENEFICIAL(4,32,32), false, /* Prefer constant pool. */ arm_default_branch_cost, - false /* Prefer LDRD/STRD. */ + true /* Prefer LDRD/STRD. */ }; const struct tune_params arm_fa726te_tune = @@ -15664,6 +15664,145 @@ bad_reg_pair_for_thumb_ldrd_strd (rtx src1, rtx src2) || (REGNO (src2) == SP_REGNUM)); } +/* LDRD in ARM mode needs consecutive registers to be stored. This function + keeps accumulating non-consecutive registers until first consecutive register + pair is found. It then generates multi-reg POP for all accumulated + registers, and then generates LDRD with write-back for consecutive register + pair. This process is repeated until all the registers are loaded from + stack. multi-reg POP takes care of lone registers as well. However, LDRD + cannot be generated for PC, as results are unpredictable. Hence, if PC is + in SAVED_REGS_MASK, generate multi-reg POP with RETURN or LDR with RETURN + depending upon number of registers in REGS_TO_BE_POPPED_MASK. */ +static void +arm_emit_ldrd_pop (unsigned long saved_regs_mask, bool really_return) +{ + int num_regs = 0; + int i, j; + rtx par = NULL_RTX; + rtx insn = NULL_RTX; + rtx dwarf = NULL_RTX; + rtx tmp, tmp1; + unsigned long regs_to_be_popped_mask = 0; + bool pc_in_list = false; + + for (i = 0; i <= LAST_ARM_REGNUM; i++) +if (saved_regs_mask & (1 << i)) + num_regs++; + + gcc_assert (num_regs && num_regs <= 16); + + for (i = 0, j = 0; i < num_regs; j++) +if (saved_regs_mask & (1 << j)) + { +i++; +if ((j % 2) == 0 +&& (saved_regs_mask & (1 << (j + 1))) +&& (j + 1) != SP_REGNUM +&& (j + 1) != PC_REGNUM +&& regs_to_be_popped_mask) + { +/* Current register and next register form register pair for which + LDRD can be generated. Generate POP for accumulated registers + and reset regs_to_be_popped_mask. SP should be handled here as + the results are unpredictable if register being stored is same + as index register (in this case, SP). PC is always the last + register being popped. Hence, we don't have to worry about PC + here. */ +arm_emit_multi_reg_pop (regs_to_be_popped_mask, pc_in_list); +pc_in_list = false; +regs_to_be_popped_mask = 0; +continue; + } + +if (j == PC_REGNUM) + { +gcc_assert (really_return); +pc_in_list = 1; + } + +regs_to_be_popped_mask |= (1 << j); + +if ((j % 2) == 1 +&& (saved_regs_mask & (1 << (j - 1))) +&& j != SP_REGNUM +&& j != PC_REGNUM) + { + /* Generate a LDRD for register pair R_, R_. The pattern +generated here is +[(SET SP, (PLUS SP, 8)) + (SET R_, (MEM SP)) + (SET R_, (MEM (PLUS SP, 4)))]. */ + par = gen_rtx_PARALLEL (VOIDmode, rtvec_alloc (3)); + dwarf = gen_rtx_SEQUENCE (VOIDmode, rtvec_alloc (3)); + + tmp = gen_rtx_SET (VOIDmode, +stack_pointer_rtx, +plus_constant (stack_pointer_rtx, 8)); + tmp1 = gen_rtx_SET (VOIDmode, + stack_pointer_rtx, + plus_constant (stack_pointer_rtx, 8)); + RTX_FRAME_RELATED_P (tmp) = 1; +
[RFA/ARM][Patch]: Fix NEG_POOL_RANGE
Hi! Please find attached the patch updating NEG_POOL_RANGE from 1008 to 1020 -(8 + ). Tested with check-gcc with no regression. The test-case failing for patch 'STRD generation instead of PUSH in A15 ARM prologue' (http://gcc.gnu.org/ml/gcc-patches/2011-11/msg01158.html) passes with this fix. gcc/ChangeLog entry: 2011-11-17 Sameera Deshpande * config/arm/arm.md (arm_movdi): Update NEG_POOL_RANGE. (movdf_soft_insn): Likewise. * config/arm/fpa.md (thumb2_movdf_fpa): Likewise. * config/arm/neon.md (neon_mov): Likewise. * config/arm/vfp.md (movdi_vfp): Likewise. (movdi_vfp_cortexa8): Likewise. (movdf_vfp): Likewise. - Thanks and regards, Sameera D.*** gcc/config/arm/.svn/text-base/arm.md.svn-base Mon Oct 31 14:59:55 2011 --- gcc/config/arm/arm.md Thu Nov 17 11:52:38 2011 *** (define_insn "*arm_movdi" *** 5223,5229 [(set_attr "length" "8,12,16,8,8") (set_attr "type" "*,*,*,load2,store2") (set_attr "arm_pool_range" "*,*,*,1020,*") !(set_attr "arm_neg_pool_range" "*,*,*,1008,*") (set_attr "thumb2_pool_range" "*,*,*,4096,*") (set_attr "thumb2_neg_pool_range" "*,*,*,0,*")] ) --- 5223,5229 [(set_attr "length" "8,12,16,8,8") (set_attr "type" "*,*,*,load2,store2") (set_attr "arm_pool_range" "*,*,*,1020,*") !(set_attr "arm_neg_pool_range" "*,*,*,1004,*") (set_attr "thumb2_pool_range" "*,*,*,4096,*") (set_attr "thumb2_neg_pool_range" "*,*,*,0,*")] ) *** (define_insn "*movdf_soft_insn" *** 6583,6589 [(set_attr "length" "8,12,16,8,8") (set_attr "type" "*,*,*,load2,store2") (set_attr "pool_range" "*,*,*,1020,*") !(set_attr "arm_neg_pool_range" "*,*,*,1008,*") (set_attr "thumb2_neg_pool_range" "*,*,*,0,*")] ) --- 6583,6589 [(set_attr "length" "8,12,16,8,8") (set_attr "type" "*,*,*,load2,store2") (set_attr "pool_range" "*,*,*,1020,*") !(set_attr "arm_neg_pool_range" "*,*,*,1004,*") (set_attr "thumb2_neg_pool_range" "*,*,*,0,*")] ) *** gcc/config/arm/.svn/text-base/neon.md.svn-base Mon Oct 31 14:59:54 2011 --- gcc/config/arm/neon.md Thu Nov 17 11:52:38 2011 *** (define_insn "*neon_mov" *** 198,204 (set_attr "insn" "*,*,*,*,*,*,mov,*,*") (set_attr "length" "4,4,4,4,4,4,8,8,8") (set_attr "pool_range" "*,*,*,1020,*,*,*,1020,*") ! (set_attr "neg_pool_range" "*,*,*,1008,*,*,*,1008,*")]) (define_insn "*neon_mov" [(set (match_operand:VQXMOV 0 "nonimmediate_operand" --- 198,204 (set_attr "insn" "*,*,*,*,*,*,mov,*,*") (set_attr "length" "4,4,4,4,4,4,8,8,8") (set_attr "pool_range" "*,*,*,1020,*,*,*,1020,*") ! (set_attr "neg_pool_range" "*,*,*,1004,*,*,*,1004,*")]) (define_insn "*neon_mov" [(set (match_operand:VQXMOV 0 "nonimmediate_operand" *** (define_insn "*neon_mov" *** 243,249 (set_attr "insn" "*,*,*,*,*,*,mov,*,*") (set_attr "length" "4,8,4,8,8,8,16,8,16") (set_attr "pool_range" "*,*,*,1020,*,*,*,1020,*") !(set_attr "neg_pool_range" "*,*,*,1008,*,*,*,1008,*")]) (define_expand "movti" [(set (match_operand:TI 0 "nonimmediate_operand" "") --- 243,249 (set_attr "insn" "*,*,*,*,*,*,mov,*,*") (set_attr "length" "4,8,4,8,8,8,16,8,16") (set_attr "pool_range" "*,*,*,1020,*,*,*,1020,*") !(set_attr "neg_pool_range" "*,*,*,996,*,*,*,996,*")]) (define_expand "movti" [(set (match_operand:TI 0 "nonimmediate_operand" "") *** gcc/config/arm/.svn/text-base/vfp.md.svn-base Thu Sep 29 10:31:13 2011 --- gcc/config/arm/vfp.md Thu Nov 17 11:52:38 2011 *** (define_insn "*movdi_vfp" *** 178,184 (const_int 4))] (const_int 4))) (set_attr "pool_range" "*,*,*,*,1020,4096,*,*,*,*,1020,*") !(set_attr "neg_pool_range" "*,*,*,*,1008,0,*,*,*,*,1008,*") (set_attr "arch" "t2,any,any,any,a,t2,any,any,any,any,any,an
Re: Ping! Re: [RFA/ARM][Patch 02/02]: ARM epilogues in RTL
On Fri, 2011-11-18 at 21:45 +, Ramana Radhakrishnan wrote: > On 5 October 2011 17:04, Sameera Deshpande wrote: > > Ping! > > > > http://gcc.gnu.org/ml/gcc-patches/2011-09/msg01855.html > > > This should now be rebased given your other changes to the Thumb2 > epilogues patch or is there a newer version of this patch . This > doesn't apply cleanly on top of the other patch. Ramana, I have already sent the rebased patch at gcc.gnu.org/ml/gcc-patches/2011-11/msg00885.html. Can you please see if it is working? --
RE: [RFA/ARM][Patch 01/02]: Thumb2 epilogue in RTL
On Tue, 2011-11-22 at 01:55 +, Xinyu Qi wrote: > At 2011-11-19 07:11:17,"Ramana Radhakrishnan" > wrote: > > On 10 November 2011 18:07, Sameera Deshpande > > wrote: > > > Please find attached the reworked patch. > > > > OK but for a very small bit . > > > > I'll note that we need to add support for the iwMMXt registers but the > > attached patch (untested) should be what is needed to support the iwMMxt > > registers and this should be something that should be incorporated into your > > ARM RTL epilogues patch as well. My understanding is that this is all you > > should need to do as there is a wldrd as a part of *iwmmx*_movdi instruction > > and therefore this should just work in this form. > > > > Xinyu can you help in testing this patch once this is applied along with > > your > > other patches to turn on Thumb2 for iwmmxt ? > > > > cheers > > Ramana > > Hi, Ramana, > > Though I don't fully understand this patch, I think it is unnecessary to > consider turn on Thumb2 for iwmmxt, > because there is no chip that supports WMMX instructions with thumb2 mode. > > Thanks, > Xinyu > Ramana, in that case, should I add the change you suggested in ARM RTL epilogue patch only? --
Re: [RFA/ARM][Patch]: Fix NEG_POOL_RANGE
On Fri, 2011-11-18 at 23:12 +, Ramana Radhakrishnan wrote: > On 17 November 2011 15:16, Sameera Deshpande > wrote: > > Hi! > > > > Please find attached the patch updating NEG_POOL_RANGE from 1008 to > > 1020 -(8 + ). > > This is OK - can you add a comment around the neg_pool_range attribute > in arm.md stating that the limit should essentially be - > (8 + ?. > Hi Ramana, Thanks for your comment. Please find attached the updated patch. -- *** /work/spec-test/local-checkouts/gcc-fsf/gcc/config/arm/arm.md 2011-11-22 17:20:36.0 + --- gcc/config/arm/arm.md 2011-11-22 17:14:48.0 + *** (define_attr "enabled" "no,yes" *** 268,274 ; can be placed. If the distance is zero, then this insn will never ; reference the pool. ; NEG_POOL_RANGE is nonzero for insns that can reference a constant pool entry ! ; before its address. (define_attr "arm_pool_range" "" (const_int 0)) (define_attr "thumb2_pool_range" "" (const_int 0)) (define_attr "arm_neg_pool_range" "" (const_int 0)) --- 268,274 ; can be placed. If the distance is zero, then this insn will never ; reference the pool. ; NEG_POOL_RANGE is nonzero for insns that can reference a constant pool entry ! ; before its address. It is set to - (8 + ). (define_attr "arm_pool_range" "" (const_int 0)) (define_attr "thumb2_pool_range" "" (const_int 0)) (define_attr "arm_neg_pool_range" "" (const_int 0)) *** (define_insn "*arm_movdi" *** 5223,5229 [(set_attr "length" "8,12,16,8,8") (set_attr "type" "*,*,*,load2,store2") (set_attr "arm_pool_range" "*,*,*,1020,*") !(set_attr "arm_neg_pool_range" "*,*,*,1008,*") (set_attr "thumb2_pool_range" "*,*,*,4096,*") (set_attr "thumb2_neg_pool_range" "*,*,*,0,*")] ) --- 5223,5229 [(set_attr "length" "8,12,16,8,8") (set_attr "type" "*,*,*,load2,store2") (set_attr "arm_pool_range" "*,*,*,1020,*") !(set_attr "arm_neg_pool_range" "*,*,*,1004,*") (set_attr "thumb2_pool_range" "*,*,*,4096,*") (set_attr "thumb2_neg_pool_range" "*,*,*,0,*")] ) *** (define_insn "*movdf_soft_insn" *** 6583,6589 [(set_attr "length" "8,12,16,8,8") (set_attr "type" "*,*,*,load2,store2") (set_attr "pool_range" "*,*,*,1020,*") !(set_attr "arm_neg_pool_range" "*,*,*,1008,*") (set_attr "thumb2_neg_pool_range" "*,*,*,0,*")] ) --- 6583,6589 [(set_attr "length" "8,12,16,8,8") (set_attr "type" "*,*,*,load2,store2") (set_attr "pool_range" "*,*,*,1020,*") !(set_attr "arm_neg_pool_range" "*,*,*,1004,*") (set_attr "thumb2_neg_pool_range" "*,*,*,0,*")] ) *** /work/spec-test/local-checkouts/gcc-fsf/gcc/config/arm/fpa.md 2011-11-22 17:18:37.0 + --- gcc/config/arm/fpa.md 2011-11-22 17:14:48.0 + *** (define_insn "*thumb2_movdf_fpa" *** 671,677 (set_attr "type" "load1,store2,*,store2,load1,ffarith,ffarith,f_fpa_load,f_fpa_store,r_mem_f,f_mem_r") (set_attr "pool_range" "*,*,*,*,4092,*,*,1024,*,*,*") !(set_attr "neg_pool_range" "*,*,*,*,0,*,*,1020,*,*,*")] ) ;; Saving and restoring the floating point registers in the prologue should --- 671,677 (set_attr "type" "load1,store2,*,store2,load1,ffarith,ffarith,f_fpa_load,f_fpa_store,r_mem_f,f_mem_r") (set_attr "pool_range" "*,*,*,*,4092,*,*,1024,*,*,*") !(set_attr "neg_pool_range" "*,*,*,*,0,*,*,1008,*,*,*")] ) ;; Saving and restoring the floating point registers in the prologue should *** /work/spec-test/local-checkouts/gcc-fsf/gcc/config/arm/neon.md 2011-11-22 17:18:37.0 + --- gcc/config/arm/neon.md 2011-11-22 17:14:48.0 + *** (define_insn "*neon_mov" *** 198,204 (set_attr "insn" "*,*,*,*,*,*,mov,*,*") (set_attr "length" "4,4,4,4,4,4,8,8,8") (set_attr "pool_range" "*,*,*,1020,*,*,*,1020,*") ! (set_attr "neg_pool_range" "*,*,*,1008,*,*,*,1008,*")]) (define_insn "*neon_mov" [(set (match_operand:VQXMOV 0 "nonimmediate_operand" --- 198,204 (set_attr "
[Patch] Fix Bug 51162
Hi, Please find attached the patch fixing bugzilla issue http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51162. ARM architecture implements vec_[load|store]_lanes which are implemented as internal function calls. The function gimple_call_fn () returns NULL for internal calls. Hence, this patch guards dereferences of 'fn' in dump_gimple_call (). Tests in gcc-dg/vect failing with 'segmentation fault', pass with this patch. gcc/Changelog entry: 2011-11-24 Sameera Deshpande * gimple-pretty-print.c (dump_gimple_call): Check if fn is NULL before dereferencing. -- diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c index f0e7c50..6d96868 100644 --- a/gcc/gimple-pretty-print.c +++ b/gcc/gimple-pretty-print.c @@ -699,11 +699,12 @@ dump_gimple_call (pretty_printer *buffer, gimple gs, int spc, int flags) pp_string (buffer, " [tail call]"); /* Dump the arguments of _ITM_beginTransaction sanely. */ - if (TREE_CODE (fn) == ADDR_EXPR) + if (fn != NULL && TREE_CODE (fn) == ADDR_EXPR) fn = TREE_OPERAND (fn, 0); - if (TREE_CODE (fn) == FUNCTION_DECL && decl_is_tm_clone (fn)) + if (fn != NULL && TREE_CODE (fn) == FUNCTION_DECL && decl_is_tm_clone (fn)) pp_string (buffer, " [tm-clone]"); - if (TREE_CODE (fn) == FUNCTION_DECL + if (fn != NULL + && TREE_CODE (fn) == FUNCTION_DECL && DECL_BUILT_IN_CLASS (fn) == BUILT_IN_NORMAL && DECL_FUNCTION_CODE (fn) == BUILT_IN_TM_START && gimple_call_num_args (gs) > 0)
Added myself to MAINTAINERS: write after approval
Committed. -- Index: MAINTAINERS === --- MAINTAINERS (revision 181721) +++ MAINTAINERS (working copy) @@ -345,6 +345,7 @@ David Daney david.da...@caviumnetworks.com Bud Davis jmda...@link.com Chris Demetriou c...@google.com +Sameera Deshpandesameera.deshpa...@arm.com Fran�ois Dumont fdum...@gcc.gnu.org Benoit Dupont de Dinechin benoit.dupont-de-dinec...@st.com Michael Eager ea...@eagercon.com
Re: [RFA/ARM][Patch 01/02]: Thumb2 epilogue in RTL
On Tue, 2011-11-22 at 10:37 +, Ramana Radhakrishnan wrote: > Xinyu: I seem to have mis-remembered that one of your patches was > turning on Thumb2 for wMMX. > > > > Ramana, in that case, should I add the change you suggested in ARM RTL > > epilogue patch only? > > The comment in Thumb2 epilogues should remain and yes - it should be > added to the ARM RTL epilogue patch only. I'm also ok with that being > in with a #if 0 around it but given it's in the epilogue whoever tries > turning on Thumb2 for iwMMX will surely notice that in the first > testrun :) Ramana, Please find attached updated patch which sets CFA_RESTORE note for single register pop and fixing new ICEs in check-gcc at trunk. The patch is tested with check-gcc, bootstrap and check-gdb without regression. -- diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index 23a29c6..2c38883 100644 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -65,6 +65,7 @@ extern int thumb1_legitimate_address_p (enum machine_mode, rtx, int); extern int arm_const_double_rtx (rtx); extern int neg_const_double_rtx_ok_for_fpa (rtx); extern int vfp3_const_double_rtx (rtx); +extern bool load_multiple_operation_p (rtx, bool, enum machine_mode, bool); extern int neon_immediate_valid_for_move (rtx, enum machine_mode, rtx *, int *); extern int neon_immediate_valid_for_logic (rtx, enum machine_mode, int, rtx *, int *); @@ -176,10 +177,13 @@ extern int arm_float_words_big_endian (void); /* Thumb functions. */ extern void arm_init_expanders (void); -extern const char *thumb_unexpanded_epilogue (void); +extern const char *thumb1_unexpanded_epilogue (void); extern void thumb1_expand_prologue (void); extern void thumb1_expand_epilogue (void); extern const char *thumb1_output_interwork (void); +extern void thumb2_expand_epilogue (void); +extern void thumb2_output_return (rtx); +extern void thumb2_expand_return (void); #ifdef TREE_CODE extern int is_called_in_ARM_mode (tree); #endif diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index e3b0b88..40c8b44 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -8906,6 +8906,139 @@ neon_valid_immediate (rtx op, enum machine_mode mode, int inverse, #undef CHECK } +/* Return true if OP is a valid load multiple operation for MODE mode. + CONSECUTIVE is true if the registers in the operation must form + a consecutive sequence in the register bank. STACK_ONLY is true + if the base register must be the stack pointer. RETURN_PC is true + if value is to be loaded in PC. */ +bool +load_multiple_operation_p (rtx op, bool consecutive, enum machine_mode mode, + bool return_pc) +{ + HOST_WIDE_INT count = XVECLEN (op, 0); + unsigned dest_regno, first_dest_regno; + rtx src_addr; + HOST_WIDE_INT i = 1, base = 0; + HOST_WIDE_INT offset = 0; + rtx elt; + bool addr_reg_loaded = false; + bool update = false; + int reg_increment, regs_per_val; + int offset_adj; + + /* If DFmode, we must be asking for consecutive, + since fldmdd can only do consecutive regs. */ + gcc_assert ((mode != DFmode) || consecutive); + + /* Set up the increments and the regs per val based on the mode. */ + reg_increment = GET_MODE_SIZE (mode); + regs_per_val = mode == DFmode ? 2 : 1; + offset_adj = return_pc ? 1 : 0; + + if (count <= 1 + || GET_CODE (XVECEXP (op, 0, offset_adj)) != SET + || !REG_P (SET_DEST (XVECEXP (op, 0, offset_adj +return false; + + /* Check to see if this might be a write-back. */ + elt = XVECEXP (op, 0, offset_adj); + if (GET_CODE (SET_SRC (elt)) == PLUS) +{ + i++; + base = 1; + update = true; + + /* The offset adjustment should be same as number of registers being + popped * size of single register. */ + if (!REG_P (SET_DEST (elt)) + || !REG_P (XEXP (SET_SRC (elt), 0)) + || !CONST_INT_P (XEXP (SET_SRC (elt), 1)) + || INTVAL (XEXP (SET_SRC (elt), 1)) != + ((count - 1 - offset_adj) * reg_increment)) +return false; +} + + i = i + offset_adj; + base = base + offset_adj; + /* Perform a quick check so we don't blow up below. */ + if (GET_CODE (XVECEXP (op, 0, i - 1)) != SET + || !REG_P (SET_DEST (XVECEXP (op, 0, i - 1))) + || !MEM_P (SET_SRC (XVECEXP (op, 0, i - 1 +return false; + + /* If only one reg being loaded, success depends on the type: + FLDMDD can do just one reg, LDM must do at least two. */ + if (count <= i) +return mode == DFmode ? true : false; + + first_dest_regno = REGNO (SET_DEST (XVECEXP (op, 0, i - 1))); + dest_regno = first_dest_regno; + + src_addr = XEXP (SET_SRC (XVECEXP (op, 0, i - 1)), 0); + + if (GET_CODE (src_addr) == PLUS) +{ + if (!CONST_INT_P (XEXP (src_addr, 1))) +return false; + offset = INTVAL (XEXP (src_addr, 1)); + src_addr = XEXP (src_addr, 0); +} + + if (!REG_P (src_addr)) +return false; + + /* T
Re: [Patch] Fix Bug 51162
On Wed, 2011-11-30 at 19:43 +, Jason Merrill wrote: > On 11/24/2011 05:42 AM, Sameera Deshpande wrote: > > - if (TREE_CODE (fn) == ADDR_EXPR) > > + if (fn != NULL && TREE_CODE (fn) == ADDR_EXPR) > > fn = TREE_OPERAND (fn, 0); > > - if (TREE_CODE (fn) == FUNCTION_DECL && decl_is_tm_clone (fn)) > > + if (fn != NULL && TREE_CODE (fn) == FUNCTION_DECL && decl_is_tm_clone > > (fn)) > > pp_string (buffer, " [tm-clone]"); > > - if (TREE_CODE (fn) == FUNCTION_DECL > > + if (fn != NULL > > I'd rather not add the null check so many times. How about just > returning if fn is null? > > Jason > Jason, Thanks for your comment. Please find attached reworked patch returning if fn is NULL. the patch is tested with check-gcc for ARM. -- diff --git a/gcc/gimple-pretty-print.c b/gcc/gimple-pretty-print.c index f0e7c50..3b5f670 100644 --- a/gcc/gimple-pretty-print.c +++ b/gcc/gimple-pretty-print.c @@ -698,6 +698,9 @@ dump_gimple_call (pretty_printer *buffer, gimple gs, int spc, int flags) if (gimple_call_tail_p (gs)) pp_string (buffer, " [tail call]"); + if (fn == NULL) +return; + /* Dump the arguments of _ITM_beginTransaction sanely. */ if (TREE_CODE (fn) == ADDR_EXPR) fn = TREE_OPERAND (fn, 0);
Re: [RFA/ARM][Patch 02/05]: LDRD generation instead of POP in A15 Thumb2 epilogue.
Hi! Please find attached revised LDRD generation patch for A15 Thumb-2 mode. Because of the major rework in ARM and Thumb-2 RTL epilogue patches, this patch has undergone some changes. The patch is tested with check-gcc, bootstrap and check-gdb without regression. Ok for trunk? -- diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index 64d5993..49aae52 100644 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -201,6 +201,7 @@ extern void thumb_reload_in_hi (rtx *); extern void thumb_set_return_address (rtx, rtx); extern const char *thumb1_output_casesi (rtx *); extern const char *thumb2_output_casesi (rtx *); +extern bool bad_reg_pair_for_thumb_ldrd_strd (rtx, rtx); #endif /* Defined in pe.c. */ diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index d671281..6d008c5 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -15847,6 +15847,154 @@ arm_emit_vfp_multi_reg_pop (int first_reg, int num_regs, rtx base_reg) REG_NOTES (par) = dwarf; } +bool +bad_reg_pair_for_thumb_ldrd_strd (rtx src1, rtx src2) +{ + return (GET_CODE (src1) != REG + || GET_CODE (src2) != REG + || (REGNO (src1) == PC_REGNUM) + || (REGNO (src1) == SP_REGNUM) + || (REGNO (src1) == REGNO (src2)) + || (REGNO (src2) == PC_REGNUM) + || (REGNO (src2) == SP_REGNUM)); +} + +/* Generate and emit a pattern that will be recognized as LDRD pattern. If even + number of registers are being popped, multiple LDRD patterns are created for + all register pairs. If odd number of registers are popped, last register is + loaded by using LDR pattern. */ +static void +thumb2_emit_ldrd_pop (unsigned long saved_regs_mask, bool really_return) +{ + int num_regs = 0; + int i, j; + rtx par = NULL_RTX; + rtx dwarf = NULL_RTX; + rtx tmp, reg, tmp1; + + for (i = 0; i <= LAST_ARM_REGNUM; i++) +if (saved_regs_mask & (1 << i)) + num_regs++; + + gcc_assert (num_regs && num_regs <= 16); + gcc_assert (really_return || ((saved_regs_mask & (1 << PC_REGNUM)) == 0)); + + /* We cannot generate ldrd for PC. Hence, reduce the count if PC is + to be popped. So, if num_regs is even, now it will become odd, + and we can generate pop with PC. If num_regs is odd, it will be + even now, and ldr with return can be generated for PC. */ + if (really_return && (saved_regs_mask & (1 << PC_REGNUM))) +num_regs--; + + /* Var j iterates over all the registers to gather all the registers in + saved_regs_mask. Var i gives index of saved registers in stack frame. + A PARALLEL RTX of register-pair is created here, so that pattern for + LDRD can be matched. As PC is always last register to be popped, and + we have already decremented num_regs if PC, we don't have to worry + about PC in this loop. */ + for (i = 0, j = 0; i < (num_regs - (num_regs % 2)); j++) +if (saved_regs_mask & (1 << j)) + { +gcc_assert (j != SP_REGNUM); + +/* Create RTX for memory load. */ +reg = gen_rtx_REG (SImode, j); +tmp = gen_rtx_SET (SImode, + reg, + gen_frame_mem (SImode, + plus_constant (stack_pointer_rtx, 4 * i))); +RTX_FRAME_RELATED_P (tmp) = 1; + +if (i % 2 == 0) + { +/* When saved-register index (i) is even, the RTX to be emitted is + yet to be created. Hence create it first. The LDRD pattern we + are generating is : + [ (SET (reg_t0) (MEM (PLUS (SP) (NUM + (SET (reg_t1) (MEM (PLUS (SP) (NUM + 4 ] + where target registers need not be consecutive. */ +par = gen_rtx_PARALLEL (VOIDmode, rtvec_alloc (2)); +dwarf = NULL_RTX; + } + +/* ith register is added in PARALLEL RTX. If i is even, the reg_i is + added as 0th element and if i is odd, reg_i is added as 1st element + of LDRD pattern shown above. */ +XVECEXP (par, 0, (i % 2)) = tmp; +dwarf = alloc_reg_note (REG_CFA_RESTORE, reg, dwarf); + +if ((i % 2) == 1) + { +/* When saved-register index (i) is odd, RTXs for both the registers + to be loaded are generated in above given LDRD pattern, and the + pattern can be emitted now. */ +par = emit_insn (par); +REG_NOTES (par) = dwarf; + } + +i++; + } + + /* If the number of registers pushed is odd AND really_return is false OR + number of registers are even AND really_return is true, last register is + popped using LDR. It can be PC as well. Hence, adjust the stack first and + then LDR with post increment. */ + + /* Increment the stack pointer, based on there being + num_regs 4-byte registers to restore. */ + tmp = gen_rtx_SET (VOIDmode, + stack_pointer_rtx, +
Re: [RFA/ARM][Patch 05/05]: LDRD generation instead of POP in A15 ARM epilogue.
Hi Ramana, Please find attached revised LDRD generation patch for A15 ARM mode. Because of the major rework in ARM RTL epilogue patch, this patch has undergone some changes. The patch is tested with check-gcc, bootstrap and check-gdb without regression. Ok for trunk? -- diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index d5c651c..46becfb 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -16101,6 +16101,135 @@ bad_reg_pair_for_thumb_ldrd_strd (rtx src1, rtx src2) || (REGNO (src2) == SP_REGNUM)); } +/* LDRD in ARM mode needs consecutive registers to be stored. This function + keeps accumulating non-consecutive registers until first consecutive register + pair is found. It then generates multi-reg POP for all accumulated + registers, and then generates LDRD with write-back for consecutive register + pair. This process is repeated until all the registers are loaded from + stack. multi register POP takes care of lone registers as well. However, + LDRD cannot be generated for PC, as results are unpredictable. Hence, if PC + is in SAVED_REGS_MASK, generate multi-reg POP with RETURN or LDR with RETURN + depending upon number of registers in REGS_TO_BE_POPPED_MASK. */ +static void +arm_emit_ldrd_pop (unsigned long saved_regs_mask, bool really_return) +{ + int num_regs = 0; + int i, j; + rtx par = NULL_RTX; + rtx insn = NULL_RTX; + rtx dwarf = NULL_RTX; + rtx tmp; + unsigned long regs_to_be_popped_mask = 0; + bool pc_in_list = false; + + for (i = 0; i <= LAST_ARM_REGNUM; i++) +if (saved_regs_mask & (1 << i)) + num_regs++; + + gcc_assert (num_regs && num_regs <= 16); + + for (i = 0, j = 0; i < num_regs; j++) +if (saved_regs_mask & (1 << j)) + { +i++; +if ((j % 2) == 0 +&& (saved_regs_mask & (1 << (j + 1))) +&& (j + 1) != SP_REGNUM +&& (j + 1) != PC_REGNUM +&& regs_to_be_popped_mask) + { +/* Current register and next register form register pair for which + LDRD can be generated. Generate POP for accumulated registers + and reset regs_to_be_popped_mask. SP should be handled here as + the results are unpredictable if register being stored is same + as index register (in this case, SP). PC is always the last + register being popped. Hence, we don't have to worry about PC + here. */ +arm_emit_multi_reg_pop (regs_to_be_popped_mask, pc_in_list); +pc_in_list = false; +regs_to_be_popped_mask = 0; +continue; + } + +if (j == PC_REGNUM) + { +gcc_assert (really_return); +pc_in_list = 1; + } + +regs_to_be_popped_mask |= (1 << j); + +if ((j % 2) == 1 +&& (saved_regs_mask & (1 << (j - 1))) +&& j != SP_REGNUM +&& j != PC_REGNUM) + { + /* Generate a LDRD for register pair R_, R_. The pattern +generated here is +[(SET SP, (PLUS SP, 8)) + (SET R_, (MEM SP)) + (SET R_, (MEM (PLUS SP, 4)))]. */ + par = gen_rtx_PARALLEL (VOIDmode, rtvec_alloc (3)); + + tmp = gen_rtx_SET (VOIDmode, +stack_pointer_rtx, +plus_constant (stack_pointer_rtx, 8)); + RTX_FRAME_RELATED_P (tmp) = 1; + XVECEXP (par, 0, 0) = tmp; + + tmp = gen_rtx_SET (SImode, +gen_rtx_REG (SImode, j - 1), +gen_frame_mem (SImode, stack_pointer_rtx)); + RTX_FRAME_RELATED_P (tmp) = 1; + XVECEXP (par, 0, 1) = tmp; + dwarf = alloc_reg_note (REG_CFA_RESTORE, + gen_rtx_REG (SImode, j - 1), + dwarf); + + tmp = gen_rtx_SET (SImode, + gen_rtx_REG (SImode, j), + gen_frame_mem (SImode, + plus_constant (stack_pointer_rtx, 4))); + RTX_FRAME_RELATED_P (tmp) = 1; + XVECEXP (par, 0, 2) = tmp; + dwarf = alloc_reg_note (REG_CFA_RESTORE, + gen_rtx_REG (SImode, j), + dwarf); + + insn = emit_insn (par); + REG_NOTES (insn) = dwarf; + pc_in_list = false; + regs_to_be_popped_mask = 0; + dwarf = NULL_RTX; + } + } + + if (regs_to_be_popped_mask) +{ + /* single PC pop can happen here. Take care of that. */ + if (pc_in_list && (regs_to_be_popped_mask == (1 << PC_REGNUM))) +{ + /* Only PC is to be popped. */ + par = gen_rtx_PARALLEL (VOIDmode, rtvec_alloc (2)); + XVECEXP (par
[Patch ARM] Fix PR 49069.
Hi, Please find attached the patch fixing bug 49069. This patch is tested with check-gcc on trunk and 4.6 without regression. OK for trunk? Is it fine to backport to 4.6 branch? ChangeLog: 2012-01-24 Sameera Deshpande PR target/49069 gcc/config/arm/arm.md (cstoredi4): Handle the case when both operands are const_int. gcc/testsuite/ChangeLog: 2012-01-24 Sameera Deshpande PR target/49069 gcc.target/arm/pr49069.c: New compile-only test. - Thanks and regards, Sameera D. diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md index 751997f..e3dc98f 100644 --- a/gcc/config/arm/arm.md +++ b/gcc/config/arm/arm.md @@ -7911,8 +7911,9 @@ enum rtx_code code = GET_CODE (operands[1]); /* We should not have two constants. */ - gcc_assert (GET_MODE (operands[2]) == DImode - || GET_MODE (operands[3]) == DImode); + if (!(GET_MODE (operands[2]) == DImode || GET_MODE (operands[3]) == DImode) + && !(reload_in_progress || reload_completed)) + operands[3] = force_reg (DImode, operands[3]); /* Flip unimplemented DImode comparisons to a form that arm_gen_compare_reg can handle. */ diff --git a/gcc/testsuite/gcc.target/arm/pr49069.c b/gcc/testsuite/gcc.target/arm/pr49069.c new file mode 100644 index 000..3cc903e --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/pr49069.c @@ -0,0 +1,24 @@ +/* { dg-do compile } */ +/* { dg-options "-Os -mfloat-abi=softfp -mfpu=vfpv3-d16" } */ + +__extension__ typedef unsigned long long int uint64_t; + +static int +func2 (int a, int b) +{ + return a == 0 ? a : a / b; +} + +int array1[1]; +const uint64_t array2[1] = { 1 }; + +void +foo (void) +{ + for (array1[0] = 0; array1[0] == 1; array1[0]++) +{ +} + if (bar (array2[0] == func2 (array1[0], 0)) == 0) +{ +} +}