NEON vectorization improvements - preliminary notes
Hi, In case this is useful in its current (unfinished!) form: here are some notes I made whilst looking at a couple of the items listed for CS308 here: https://wiki.linaro.org/Internal/Contractors/CodeSourcery Namely: * automatic vector size selection (it's currently selected by command line switch) * also consider ARMv6 SIMD vectors (see CS309) * mixed size vectors (using to most appropriate size in each case) * ensure that all gcc vectorizer pattern names are implemented in the machine description (those that can be). I've not even started on looking at: * loops with more than two basic blocks (caused by if statements (anything else?)) * use of specialized load instructions * Conversly, perhaps identify NEON capabilities not covered by GCC patterns, and add them to gcc (e.g. vld2/vld3/vld4 insns) * any other missed opportunities (identify common idioms and teach the compiler to deal with them) I'm not likely to have time to restart work on the vectorization study for at least a couple of days, because of other CodeSourcery work. But perhaps the attached will still be useful in the meantime. Do you (Ira) have access to the ARM ISA docs detailing the NEON instructions? Cheers, JulianAutomatic vector size selection/mixed-size vectors == The "vect256" branch now has a vectorization factor argument for UNITS_PER_SIMD_WORD (allowing selection of different vector sizes). Patches to support that would need backporting to 4.5 if that looks useful. Could investigate the feasibility of doing that. Currently UNITS_PER_SIMD_WORD is only used in tree-vect-stmts.c:get_vectype_for_scalar_type (which itself is used in several places). Generally (check assumption) I think that wider vectors may make inner loops more efficient, but may increase the size of setup/teardown code (e.g. setup: increased versioning. Teardown, increased insns for reduction ops). More importantly, sometimes larger vectors may inhibit vectorization. We ideally want to calculate costs per vector-size per-loop (or per other vectorization opportunity). Using the vect256 bits is probably much easier than the alternatives. ARMv6 SIMD operations = It looks like several of the ARMv6 instructions may be useful to the vectorizer, or even just to regular integer code. Some of the instructions are supported already, but it's possible that we could support more -- particularly if combine is now able to recognize longer instruction sequences. GCC already has V4QI and V2HI modes enabled on ARM. PKH --- Pack halfword. May be usable by combine (or may be too complicated). QADD16, QADD8, QASX, QSUB16, QSUB8, QSAX UQADD16, UQADD8, UQASX, UQSUB16, UQSUB8, UQSAX -- Saturating adds/subtracts. No use to vectorizer or combine at present. REV, REV16, REVSH - Unlikely to be usable without builtins. REV is currently supported like that. SADD8, SADD16, UADD8, UADD16 Packed addition of bytes/halfwords (setting GE flags). Should be usable by vectorizer. SEL --- Select bytes depending on GE flags. Can probably be used in vectorizer to implement vcond on core registers. SHADD8, SHADD16, SHSUB8, SHSUB16 UHADD8, UHADD16, UHSUB8, UHSUB16 Packed additions & subtractions, halving the results before writing to dest register. Probably can't be used by vectorizer at present. SMLAD, SMLALD - Two packed 16-bit multiplies, adding both results to a 32-bit accumulator. Pattern can be written in RTL, possibly recognizable by combine. SMLSD, SMLSLD - Adds difference of two packed 16-bit multiplies to an accumulator. Again can be written in RTL, but will combine be able to do anything with it? SMMLA, SMMLS, SMMUL --- Can probably be added quite easily, if combine plays nicely. SMUAD, SMUSD Packed multiply with "sideways" add or subtract before writing to dest. Could probably be recognized by combine. SMULBB, SMULBT, SMULTB, SMULTT -- (ARMv5TE instructions). Supported. No unsigned variants for these. SSAT, SSAT16, USAT, USAT16 -- Saturate (signed or unsigned) to power-of-two range given by a bit position. No use to vectorizer. SSUB8, SSUB16, USUB8, USUB16 Packed 8- or 16-bit subtraction, setting flag bits. Could potentially be used by vectorizer. SASX, SSAX, UASX, USAX -- [Un]signed add/subtract with exchange, or [un]signed subtract/add with exchange. May be usable from regular code, but might be too much for combine. (Maybe the intermediate pseudo-instruction trick might work though?). SXTAB, SXTAH, UXTAB, UXTAH -- Signed extend and add halfword. Already supported. SXTAB16, UXTAB16 Extract two 8-bit v
Re: NEON vectorization improvements - preliminary notes
On 15/09/10 10:37, Julian Brown wrote: > The "vect256" branch now has a vectorization factor argument for > UNITS_PER_SIMD_WORD (allowing selection of different vector sizes). > Patches to support that would need backporting to 4.5 if that looks > useful. Could investigate the feasibility of doing that. Backports to 4.5 would indeed be nice, but the target here is to improve vectorization upstream. Also, the list in the task was just ideas to get started on, there's no reason to limit investigations to that list, if it turns out to be incomplete - it's not like it was written with any real effort. Andrew ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Thumb2 size optimization report
* Goal Goal of this work is to look for thumb2 code size improvements on FSF GCC trunk. * Methodology ** Build FSF GCC trunk w/ and wo/ hardfp, run benchmarks including eembc, spec2000, and dhrystone, and check asm code to see if there is any possible improvements on size. ** Get input and suggestion from ARM experts. ** Search open PRs in GCC bugzilla. * Results Each item has been tracked on launchpad, and is listed with some elements, ** Cause: cause of this problem is known or unknown ** Difficulty: estimation of implementation difficulty ** Recommendation: Yao's recommendation on that bug for next step 1. LP:633233 Push/pop low register rather than high register when keeping stack alignment As Richard E. pointed out, it was implemented in gcc-4.5 on 2009, but Yao still can see the usage of r8 on FSF GCC trunk. Cause: Might be a regression if problem disappears on gcc-4.5. Difficulty: Easy. might not hard to fix a regression. Recommendations: Fix this regression if it is. 2. LP:633243 Improve regrename to make use of low registers. Get input from Bernd S. and Julian B. Initial implementation has been suggested by Bernd S. Cause: current regrename in gcc treats high and low registers equally. Difficulty: Medium. Recommendation: Implement it as Bernd suggested, and do benchmarking to see how much size is improved. 3. LP:634682 Redundant uxth/sxth insn are generated Cause: Unknown Difficulty: Unknown Recommendation: No recommendation so far. 4. LP:634696 Function is not inlined properly with -Os In consumer/cjpeg/jmemmgr.c, GCC inlined out_of_memory() with -Os, so increase code size. Cause: Unknown. Difficulty: Unknown Recommendation: Educate GCC to inline carefully when -Os is turned on. 5. GCC PR40730 LP:634731 Redundant memory load 6. LP:634738 inefficient code to extract least bits from an integer value GCC PR40697 is for thumb-1. The same problem is in thumb-2. Cause: Unknown. Difficulty: Medium. Recommendation: Fix it the similar way as fixing GCC PR40697. 7. LP:634891 Replace load/store by memcpy more aggressively Difficulty: Should be easy. Recommendation: Fix to this problem might be "reduce threshold value once -Os is turned on". 8. LP:637220 allocate local variables with fewer instructions GCC PR40657 is about this kind of problem, and was fixed. The similar prolbme exits on gcc with hardfp. Cause: Unknown. Difficulty: Unknown. Recommendation: No recommendation so far. 9. GCC PR 43721 Failure to optimize (a/b) and (a%b) into single __aeabi_idivmod call Difficulty: Medium or easy. Recommendation: No. 10. LP:637814 Combine add/move to add LP:637882 Combine ldr/mov to ldr Possible improvements have been found. No idea how to fix it yet. Cause: Unknown. Difficulty: Unknown. Recommendation: No. 11. LP:638014 Replace memset by memclr when 2nd parameter is zero Difficulty: Easy. Recommendation: No recommendation so far. 12. LP:625233 Merge constant pools for small functions Cause: Unknown. Difficulty: Medium. Recommendation: No. 13. LP:638935 Replace multiple vldr by vldm Some vldr insns accessing consecutive address can be replaced by single vldm. It is not about thumb2, but related to code size optimization. Cause: Unknown. Difficulty: Medium. Recommendation: No. -- Yao Qi CodeSourcery y...@codesourcery.com (650) 331-3385 x739 ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
RE: Linaro GCC 4.4 and 4.5 2010.09 released
Hi, > Also available is an early release of optimised string routines for > the Cortex-A series, including a mix of NEON and Thumb-2 versions of > memcpy(), memset(), strcpy(), strcmp(), and strlen(). For more > information see: > https://launchpad.net/cortex-strings My understanding is that the NEON optimisation will give some performance gain *ONLY* on Cortex-A8 but it will also burn more energy. On other CPU, e.g. Cortex-A9, there is no performance gain but still it will cost more energy. Linaro toolchain doesn't target a specific platform but is generic for armv7 platforms. Are you expecting to see those optimisations turned on in Linaro toolchain? The NEON-optimised version is also beneficial for large copies, but it is not on short copies when the NEON unit has to be powered up (Linux kernel will get an exception to turn it on). I guess your benchmark didn't take that into account. Can the NEON-optimised version be changed so that it is not used for small copies? Guillaume -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
RE: Linaro GCC 4.4 and 4.5 2010.09 released
> > Linaro toolchain doesn't target a specific platform but is generic > for armv7 platforms. Are you expecting to see those optimisations > turned on in Linaro toolchain? > > Sorry, I don't understand the question. We want to spread these > routines out and get them integrated into all of the upstream C > libraries including NewLib, Bionic, and GLIBC. My concern is that you want to spread it too widely! If the NEON-optimised memcpy() goes into GLIBC then I assume it will be used for any armv7 platforms (unless I'm mistaken you don't have a mechanism to detect whether GLIBC runs on a cortex-A8 or A9 And you don't have 2 different versions of the glibc library for the 2 CPUs) So this library might be good for the A8 but not the other CPUs. > My understanding is that the NEON unit is on per process, so once > you've turned it on once it should stay on. It's turned off by the kernel at context switch. For thread dealing with a lot of data, it make sense. Turning on NEON for a small copy doesn't make sense on embedded platforms. > I assume the turn on cost is amortised across a run. Note that if the data > is not in the L1 > cache then the NEON unit wins even for small-ish (~64 byte) copies. Only on Cortex-A8. But still expensive power-wise. Guillaume -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: NEON vectorization improvements - preliminary notes
Hi, I need to learn much more about ARM architecture, but I have some initial comments. Julian Brown wrote on 15/09/2010 11:37:21 AM: > * automatic vector size selection (it's currently selected by command > line switch) > Generally (check assumption) I think that wider vectors may make inner loops more efficient, > but may increase the size of setup/teardown code (e.g. setup: increased versioning. Teardown, > increased insns for reduction ops). More importantly, sometimes larger vectors may inhibit vectorization. > We ideally want to calculate costs per vector-size per-loop (or per other vectorization opportunity). There is a patch http://gcc.gnu.org/ml/gcc-patches/2010-03/msg00167.html that was not committed to mainline (and I think not to vect256, but I am not sure about that). This patch tries to vectorize for the wider option unless it is impossible because of data dependence constraints. I agree with that cost model approach. > > * ensure that all gcc vectorizer pattern names are implemented in the > machine description (those that can be). In my opinion we better concentrate on: > * Conversly, perhaps identify NEON capabilities not covered by GCC > patterns, and add them to gcc (e.g. vld2/vld3/vld4 insns) Most of the existing vectorizer patterns were inspired by Altivec's capabilities. I think our approach should originate from the architecture and not the other way around. For example, I don't think we should spend time on implementation of vect_extract_even/odd and vect_interleave_high/low (even though they seem to match VUNZIP and VZIP), when we have those amazing VLD2/3/4 and VST2/3/4 instructions. > > I've not even started on looking at: > > * loops with more than two basic blocks (caused by if statements > (anything else?)) What do you mean by that? If-conversion improvements? > > Do you (Ira) have access to the ARM ISA docs detailing the NEON > instructions? I have "ARM® Architecture Reference Manual ARM®v7-A and ARM®v7-R edition". Ira > > Cheers, > > Julian[attachment "CS308-vectorization-improvements.txt" deleted by > Ira Rosen/Haifa/IBM] ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: [gnu-linaro-tools] Thumb2 size optimization report
On 15/09/10 14:49, Yao Qi wrote: > * Goal >Goal of this work is to look for thumb2 code size improvements on FSF > GCC trunk. Thank you Yao, I think we've definitely got some things we can do good work on here. :) Andrew ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain