[PATCH][AARCH64] Fix for PR86901
Hi, Adding support for extv and extzv on aarch64 as described in PR86901. I also changed extract_bit_field_using_extv to use gen_lowpart_if_possible instead of gen_lowpart directly. Using gen_lowpart directly will fail with an ICE in building libgcc when the compiler fails to successfully do so whereas gen_lowpart_if_possible will bail out of matching this pattern gracefully. I'm looking through past mails and https://gcc.gnu.org/contribute.html which details testing bootstrap. I'm building a cross-compiler (x64_aarch64) and the instructions don't address that scenario. The GCC cross-build is green and there's no regressions on the C/C++ tests (The go/fortran etc. look like they need additional infrastructure built on my side to work). Is there a workflow for cross-builds or should I aim to get an aarch64 native machine for full validation? ChangeLog: 2020-02-03 Di Mo gcc/ * config/aarch64/aarch64.md: Add define_expand for extv and extzv. * expmed.c (extract_bit_field_using_extv): Change gen_lowpart to gen_lowpart_if_possible to avoid compiler assert building libgcc. testsuite/ * gcc.target/aarch64/pr86901.c: Add new test. Best, Modi pr86901.patch Description: pr86901.patch
RE: [PATCH][AARCH64] Fix for PR86901
> I did a quick bootstrap, this shows several failures like: > > gcc/builtins.c:9427:1: error: unrecognizable insn: > 9427 | } > | ^ > (insn 212 211 213 24 (set (reg:SI 207) > (zero_extract:SI (reg:SI 206) > (const_int 26 [0x1a]) > (const_int 6 [0x6]))) "/gcc/builtins.c":9413:44 -1 > (nil)) > > The issue here is that 26+6 = 32 and that's not a valid ubfx encoding. > Currently cases like this are split into a right shift in aarch64.md around > line > 5569: Appreciate you taking a look and the validation. I've gotten access to an aarch64 server and the bootstrap demonstrated the issue you saw. This was caused by my re-definition of the pattern to: + if (width == 0 || (pos + width) > GET_MODE_BITSIZE (mode)) +FAIL; Which meant for SImode only a sum of >32 bit actually triggers the fail condition for the define_expand whereas the existing define_insn fails on >=32 bit. I looked into the architecture reference manual and the bits are available for ubfx/sbfx for that type of encoding and the documentation says you can use [lsb, 32-lsb] for SImode as a legal pair. Checking with the GNU assembler it does accept a sum of 32 but transforms it into a LSR: Assembly file: ubfxw0, w0, 24, 8 Disassembly of section .text: : 0: 53187c00lsr w0, w0, #24 Similarly with the 64 bit version it'll become a 64 bit LSR. Certainly other assemblers could trip over, I've attached a new patch that allows this encoding and bootstrap + testing c/c++ testsuite looks good. I'll defer to you if it's better to explicitly do the transformation in GCC. > ;; When the bit position and width add up to 32 we can use a W-reg LSR ;; > instruction taking advantage of the implicit zero-extension of the X-reg. > (define_split > [(set (match_operand:DI 0 "register_operand") > (zero_extract:DI (match_operand:DI 1 "register_operand") > (match_operand 2 >"aarch64_simd_shift_imm_offset_di") > (match_operand 3 >"aarch64_simd_shift_imm_di")))] > "IN_RANGE (INTVAL (operands[2]) + INTVAL (operands[3]), 1, > GET_MODE_BITSIZE (DImode) - 1) >&& (INTVAL (operands[2]) + INTVAL (operands[3])) >== GET_MODE_BITSIZE (SImode)" > [(set (match_dup 0) > (zero_extend:DI (lshiftrt:SI (match_dup 4) (match_dup 3] > { > operands[4] = gen_lowpart (SImode, operands[1]); > } > > However that only supports DImode, not SImode, so it needs to be changed > to be more general using GPI. > > Your new extv patterns should replace the magic patterns above it: With the previous discovery that a sum of 32/64 will trigger LSR in the assembler I was curious what would happen if I remove this pattern. Turns out, you will end up with a UBFX x0, x0, 24, 8 compared to a LSR w0, w0, 24 in the test case associated with this change (gcc.target/aarch64/ubfx_lsr_1.c) which doesn't get transformed into an LSR by the assembler since it's in 64 bit mode. So this pattern still has value but I don't think it's necessary to extend it to DI since that'll automatically get turned into a LSR by the assembler as I previously mentioned. > ;; --- > ;; Bitfields > ;; --- > > (define_expand "" > > These are the current extv/extzv patterns, but without a mode. They are no > longer used when we start using the new ones. > > Note you can write to combine the extzv and extz patterns. > But please add a comment mentioning the pattern names so they are easy to > find! Good call here, made this change in the new patch. I did see the define_insn of these guys below it but somehow missed that they were expanded just above. I believe, from my understanding of GCC, that the matching pattern below the first line is what constrains into just extv/extsv from the long list of iterators it belongs to. Still, I see that there's constrained iterators elsewhere like: ;; Optab prefix for sign/zero-extending operations (define_code_attr su_optab [(sign_extend "") (zero_extend "u") I added a comment in this patch before the pattern. Thoughts on defining another constrained version to make it clearer (in addition or in lieu of the comment)? > Besides a bootstrap it is always useful to compile a large body of code with > your change (say SPEC2006/2017) and check for differences in at least > codesize. If there is an increase in instruction count then there may be more > issues that need to be resolved. Sounds good. I'll get those setup and running and will report back on findings. What's the preferred way to measure codesize? I'm assuming by default the code pages are aligned so smaller differences would need to trip over the boundary to actually show up. > I find it easiest to develop on a many-core AArch64 ser
RE: [PATCH][AARCH64] Fix for PR86901
> > Sounds good. I'll get those setup and running and will report back on > > findings. What's the preferred way to measure codesize? I'm assuming > > by default the code pages are aligned so smaller differences would need to > > trip > over the boundary to actually show up. > > You can use the size command on the binaries: > > >size /bin/ls >text data bss dec hex filename > 107271 20243472 112767 1b87f /bin/ls > > As you can see it shows the text size in bytes. It is not rounded up to a > page, so it > is an accurate measure of the codesize. Generally -O2 size is most useful to > check (since that is what most applications build with), but -Ofast -flto can > be > useful as well (the global inlining means you get instruction combinations > which > appear less often with -O2). > > Cheers, > Wilco Alrighty, I've got spec 2017 and spec 2006 setup and building. Using default configurations so -O2 in spec2006 and -O3 in spec2017. Testing the patch as last sent showed a 1% code size regression in spec 2017 perlbench which turned out to be a missing pattern for tbnz and all its variants: (define_insn "*tb1" [(set (pc) (if_then_else (EQL (zero_extract:DI (match_operand:GPI 0 "register_operand" "r") <--- only matches against zero_extract:DI (const_int 1) (match_operand 1 "aarch64_simd_shift_imm_" "n")) The zero extract now matching against other modes would generate a test + branch rather than the combined instruction which led to the code size regression. I've updated the patch so that tbnz etc. matches GPI and that brings code size down to <0.2% in spec2017 and <0.4% in spec2006. Spec results are attached for reference. @Wilco I've gotten instruction on my side to set up an individual contributor's license for the time being. Can you send me the necessary documents to make that happen? Thanks! ChangeLog: 2020-02-21 Di Mo gcc/ * config/aarch64/aarch64.md: Add GPI modes to extsv/extv patterns. Allow tb1 pattern to match against zero_extract:GPI. * expmed.c (extract_bit_field_using_extv): Change gen_lowpart to gen_lowpart_if_possible to avoid compiler assert building libgcc. testsuite/ * gcc.target/aarch64/pr86901.c: Add new test. Modi pr86901.patch Description: pr86901.patch spec2006 Description: spec2006 spec2017 Description: spec2017
RE: [PATCH][AARCH64] Fix for PR86901
Hey all-apologies for the long delay. Haven't had time until recently to look into this further. >>> The zero extract now matching against other modes would generate a >>> test + branch rather than the combined instruction which led to the >>> code size regression. I've updated the patch so that tbnz etc. matches GPI >>> and >>that brings code size down to <0.2% in spec2017 and <0.4% in spec2006. >> >>That's looking better indeed. I notice there are still differences, eg. >>tbz/tbnz >>counts are significantly different in perlbench, with ~350 missed cases >>overall >>(mostly tbz reg, #7). >> >>There are also more uses of uxtw, ubfiz, sbfiz - for example I see cases like >>this >>in namd: >> >> 42c7dc: 13007400 sbfx w0, w0, #0, #30 >> 42c7e0: 937c7c00 sbfiz x0, x0, #4, #32 >> >>So it would be a good idea to check any benchmarks where there is still a non- >>trivial codesize difference. You can get a quick idea what is happening by >>grepping for instructions like this: >> >>grep -c sbfiz out1.txt out2.txt >>out1.txt:872 >>out2.txt:934 >> >>grep -c tbnz out1.txt out2.txt >>out1.txt:5189 >>out2.txt:4989 That's really good insight Wilco! I took a look at the tbnz/tbz case in perl and we lose matching against this because allowing SI mode on extv/extzv causes subst in combine.c to generate: (lshiftrt:SI (reg:SI 107 [ _16 ]) (const_int 7 [0x7])) (nil) Instead of: (and:DI (lshiftrt:DI (subreg:DI (reg:SI 107 [ _16 ]) 0) (const_int 7 [0x7])) (const_int 1 [0x1])) The latter case is picked up in make_compound_operation_int to transform into a zero_extract while the new case is left alone. A lshiftrt generally can't be reduced down to a bit-test but in this case it can because we have zero_bit information on it. Given that, looking around try_combine it seems like the best place to detect this pattern is in the 2nd chance code after the first failure of recog_for_combine which I've done in this patch. I think this is the place to put this fix given changing subst/make_compound_operation_int leads to significantly more diffs. After this change the total number of tbnz/tbz lines up near identical to the baseline which is good and overall size within .1% on spec 2017 and spec 2006. However, looking further at ubfiz there's a pretty large increase in certain benchmarks. I looked into spec 2017/blender and we fail to combine this pattern: Trying 653 -> 654: 653: r512:SI=r94:SI 0>>0x8 REG_DEAD r94:SI 654: r513:DI=zero_extend(r512:SI) REG_DEAD r512:SI Failed to match this instruction: (set (reg:DI 513) (zero_extract:DI (reg:SI 94 [ bswapdst_4 ]) (const_int 8 [0x8]) (const_int 8 [0x8]))) Where previously we combined it like this: Trying 653 -> 654: 653: r512:SI=r94:SI 0>>0x8 REG_DEAD r94:SI 654: r513:DI=zero_extend(r512:SI) REG_DEAD r512:SI Successfully matched this instruction: (set (reg:DI 513) (zero_extract:DI (subreg:DI (reg:SI 94 [ bswapdst_4 ]) 0) // subreg used (const_int 8 [0x8]) (const_int 8 [0x8]))) Here's where I'm at an impasse. The code that generates the modes in get_best_reg_extraction_insn looks at the inner mode of SI now that extzvsi is valid and generates a non-subreg use. However, the MD pattern is looking for all modes being DI or SI not a mix. I think a fix could be done to canonicalize these extracts to the same mode but am unsure if in general a mode mismatched extract RTX is valid which would make this a fairly large change. Latest patch with fix for tbnz/tbz is attached alongside the numbers for SPEC and instruction count for SPEC 2017 are attached for reference. >>> Can you send me the necessary documents to make that happen? Thanks! >> >>That's something you need to sort out with the fsf. There is a mailing list >>for this: >>mailto:ass...@gnu.org. I haven't had any response from my previous mail there. Should I add one of you to the CC or mail someone specifically to get traction? Best, Modi pr86901.patch Description: pr86901.patch basediff % increase textdatabss total filenametextdata bss total filename 1038264 243802 12472 1294538 base/benchspec/CPU2006/400.perlbench/exe/perlbench_base.gcc9-base 1038344 243714 12472 1294530 diff/benchspec/CPU2006/400.perlbench/exe/perlbench_base.gcc9-diff 0.01% 72024 16030 432892382 base/benchspec/CPU2006/401.bzip2/exe/bzip2_base.gcc9-base 72016 16030 432892374 diff/benchspec/CPU2006/401.bzip2/exe/bzip2_base.gcc9-diff -0.01% 2651976 816398 749792 4218166 base/benchspec/CPU2006/403.gcc/exe/gcc_base.gcc9-base 2652096 816558 749792 4218446 diff/benchspec/CPU2006/403.gcc/exe/gcc_base.gc