NEON vectorization improvements - preliminary notes

2010-09-15 Thread Julian Brown
Hi,

In case this is useful in its current (unfinished!) form: here are some
notes I made whilst looking at a couple of the items listed for CS308
here:

  https://wiki.linaro.org/Internal/Contractors/CodeSourcery

Namely:

  * automatic vector size selection (it's currently selected by command
line switch)

* also consider ARMv6 SIMD vectors (see CS309)

  * mixed size vectors (using to most appropriate size in each case)

  * ensure that all gcc vectorizer pattern names are implemented in the
machine description (those that can be).

I've not even started on looking at:

  * loops with more than two basic blocks (caused by if statements
(anything else?))

  * use of specialized load instructions

  * Conversly, perhaps identify NEON capabilities not covered by GCC
patterns, and add them to gcc (e.g. vld2/vld3/vld4 insns)

  * any other missed opportunities (identify common idioms and teach the
compiler to deal with them)

I'm not likely to have time to restart work on the vectorization study
for at least a couple of days, because of other CodeSourcery work. But
perhaps the attached will still be useful in the meantime.

Do you (Ira) have access to the ARM ISA docs detailing the NEON
instructions?

Cheers,

JulianAutomatic vector size selection/mixed-size vectors
==

The "vect256" branch now has a vectorization factor argument for 
UNITS_PER_SIMD_WORD (allowing selection of different vector sizes). Patches to 
support that would need backporting to 4.5 if that looks useful. Could 
investigate the feasibility of doing that.

Currently UNITS_PER_SIMD_WORD is only used in 
tree-vect-stmts.c:get_vectype_for_scalar_type (which itself is used in several 
places).

Generally (check assumption) I think that wider vectors may make inner loops 
more efficient, but may increase the size of setup/teardown code (e.g. setup: 
increased versioning. Teardown, increased insns for reduction ops). More 
importantly, sometimes larger vectors may inhibit vectorization. We ideally 
want to calculate costs per vector-size per-loop (or per other vectorization 
opportunity).

Using the vect256 bits is probably much easier than the alternatives.

ARMv6 SIMD operations
=

It looks like several of the ARMv6 instructions may be useful to the 
vectorizer, or even just to regular integer code. Some of the instructions are 
supported already, but it's possible that we could support more -- particularly 
if combine is now able to recognize longer instruction sequences.

GCC already has V4QI and V2HI modes enabled on ARM.

PKH
---

Pack halfword. May be usable by combine (or may be too complicated).

QADD16, QADD8, QASX, QSUB16, QSUB8, QSAX
UQADD16, UQADD8, UQASX, UQSUB16, UQSUB8, UQSAX
--

Saturating adds/subtracts. No use to vectorizer or combine at present.

REV, REV16, REVSH
-

Unlikely to be usable without builtins. REV is currently supported like that.

SADD8, SADD16, UADD8, UADD16


Packed addition of bytes/halfwords (setting GE flags). Should be usable by 
vectorizer.

SEL
---

Select bytes depending on GE flags. Can probably be used in vectorizer to 
implement vcond on core registers.

SHADD8, SHADD16, SHSUB8, SHSUB16
UHADD8, UHADD16, UHSUB8, UHSUB16


Packed additions & subtractions, halving the results before writing to dest 
register. Probably can't be used by vectorizer at present.

SMLAD, SMLALD
-

Two packed 16-bit multiplies, adding both results to a 32-bit accumulator. 
Pattern can be written in RTL, possibly recognizable by combine.

SMLSD, SMLSLD
-

Adds difference of two packed 16-bit multiplies to an accumulator. Again can be 
written in RTL, but will combine be able to do anything with it?

SMMLA, SMMLS, SMMUL
---

Can probably be added quite easily, if combine plays nicely.

SMUAD, SMUSD


Packed multiply with "sideways" add or subtract before writing to dest. Could 
probably be recognized by combine.

SMULBB, SMULBT, SMULTB, SMULTT
--

(ARMv5TE instructions). Supported. No unsigned variants for these.

SSAT, SSAT16, USAT, USAT16
--

Saturate (signed or unsigned) to power-of-two range given by a bit position. No 
use to vectorizer.

SSUB8, SSUB16, USUB8, USUB16


Packed 8- or 16-bit subtraction, setting flag bits. Could potentially be used 
by vectorizer.

SASX, SSAX, UASX, USAX
--

[Un]signed add/subtract with exchange, or [un]signed subtract/add with 
exchange. May be usable from regular code, but might be too much for combine. 
(Maybe the intermediate pseudo-instruction trick might work though?).

SXTAB, SXTAH, UXTAB, UXTAH
--

Signed extend and add halfword. Already supported.

SXTAB16, UXTAB16


Extract two 8-bit v

Re: NEON vectorization improvements - preliminary notes

2010-09-15 Thread Andrew Stubbs
On 15/09/10 10:37, Julian Brown wrote:
> The "vect256" branch now has a vectorization factor argument for
> UNITS_PER_SIMD_WORD (allowing selection of different vector sizes).
> Patches to support that would need backporting to 4.5 if that looks
> useful. Could investigate the feasibility of doing that.

Backports to 4.5 would indeed be nice, but the target here is to improve 
vectorization upstream.

Also, the list in the task was just ideas to get started on, there's no 
reason to limit investigations to that list, if it turns out to be 
incomplete - it's not like it was written with any real effort.

Andrew

___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain


Thumb2 size optimization report

2010-09-15 Thread Yao Qi
* Goal
  Goal of this work is to look for thumb2 code size improvements on FSF
GCC trunk.

* Methodology
  ** Build FSF GCC trunk w/ and wo/ hardfp, run benchmarks including
eembc, spec2000, and dhrystone, and check asm code to see if there is
any possible improvements on size.
  ** Get input and suggestion from ARM experts.
  ** Search open PRs in GCC bugzilla.

* Results
Each item has been tracked on launchpad, and is listed with some elements,
 ** Cause: cause of this problem is known or unknown
 ** Difficulty: estimation of implementation difficulty
 ** Recommendation: Yao's recommendation on that bug for next step

  1. LP:633233 Push/pop low register rather than high register when
keeping stack alignment
  As Richard E. pointed out, it was implemented in gcc-4.5 on 2009, but
Yao still can see the usage of r8 on FSF GCC trunk.
  Cause: Might be a regression if problem disappears on gcc-4.5.
  Difficulty: Easy.  might not hard to fix a regression.
  Recommendations: Fix this regression if it is.

  2. LP:633243 Improve regrename to make use of low registers.
  Get input from Bernd S. and Julian B.  Initial implementation has been
 suggested by Bernd S.
  Cause: current regrename in gcc treats high and low registers equally.
  Difficulty: Medium.
  Recommendation: Implement it as Bernd suggested, and do benchmarking
to see how much size is improved.

  3. LP:634682 Redundant uxth/sxth insn are generated
  Cause: Unknown
  Difficulty: Unknown
  Recommendation: No recommendation so far.

  4. LP:634696 Function is not inlined properly with -Os
  In consumer/cjpeg/jmemmgr.c, GCC inlined out_of_memory() with -Os, so
increase code size.
  Cause: Unknown.
  Difficulty: Unknown
  Recommendation: Educate GCC to inline carefully when -Os is turned on.

  5. GCC PR40730 LP:634731 Redundant memory load

  6. LP:634738 inefficient code to extract least bits from an integer value
  GCC PR40697 is for thumb-1.  The same problem is in thumb-2.
  Cause: Unknown.
  Difficulty: Medium.
  Recommendation: Fix it the similar way as fixing GCC PR40697.

  7. LP:634891 Replace load/store by memcpy more aggressively
  Difficulty: Should be easy.
  Recommendation: Fix to this problem might be "reduce threshold value
once -Os is turned on".

  8. LP:637220 allocate local variables with fewer instructions
  GCC PR40657 is about this kind of problem, and was fixed.  The similar
prolbme exits on gcc with hardfp.
  Cause: Unknown.
  Difficulty: Unknown.
  Recommendation: No recommendation so far.

  9. GCC PR 43721 Failure to optimize (a/b) and (a%b) into single
__aeabi_idivmod call
  Difficulty: Medium or easy.
  Recommendation: No.

  10. LP:637814 Combine add/move to add
  LP:637882 Combine ldr/mov to ldr
  Possible improvements have been found.  No idea how to fix it yet.
  Cause: Unknown.
  Difficulty: Unknown.
  Recommendation: No.

  11. LP:638014 Replace memset by memclr when 2nd parameter is zero
  Difficulty: Easy.
  Recommendation: No recommendation so far.

  12. LP:625233 Merge constant pools for small functions
  Cause: Unknown.
  Difficulty: Medium.
  Recommendation: No.

  13. LP:638935 Replace multiple vldr by vldm
  Some vldr insns accessing consecutive address can be replaced by
single vldm.  It is not about thumb2, but related to code size optimization.
  Cause: Unknown.
  Difficulty: Medium.
  Recommendation: No.

-- 
Yao Qi
CodeSourcery
y...@codesourcery.com
(650) 331-3385 x739

___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain


RE: Linaro GCC 4.4 and 4.5 2010.09 released

2010-09-15 Thread Guillaume Letellier
Hi,

> Also available is an early release of optimised string routines for
> the Cortex-A series, including a mix of NEON and Thumb-2 versions of
> memcpy(), memset(), strcpy(), strcmp(), and strlen().  For more
> information see:
>  https://launchpad.net/cortex-strings

My understanding is that the NEON optimisation will give some performance gain 
*ONLY* on Cortex-A8 but it will also burn more energy. On other CPU, e.g. 
Cortex-A9, there is no performance gain but still it will cost more energy.
Linaro toolchain doesn't target a specific platform but is generic for armv7 
platforms. Are you expecting to see those optimisations turned on in Linaro 
toolchain?

The NEON-optimised version is also beneficial for large copies, but it is not 
on short copies when the NEON unit has to be powered up (Linux kernel will get 
an exception to turn it on). I guess your benchmark didn't take that into 
account. Can the NEON-optimised version be changed so that it is not used for 
small copies?

Guillaume

-- IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium.  Thank you.

___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain


RE: Linaro GCC 4.4 and 4.5 2010.09 released

2010-09-15 Thread Guillaume Letellier
> > Linaro toolchain doesn't target a specific platform but is generic
> for armv7 platforms. Are you expecting to see those optimisations
> turned on in Linaro toolchain?
>
> Sorry, I don't understand the question.  We want to spread these
> routines out and get them integrated into all of the upstream C
> libraries including NewLib, Bionic, and GLIBC.

My concern is that you want to spread it too widely!
If the NEON-optimised memcpy() goes into GLIBC then I assume it will be used 
for any armv7 platforms
(unless I'm mistaken you don't have a mechanism to detect whether GLIBC runs on 
a cortex-A8 or A9
And you don't have 2 different versions of the glibc library for the 2 CPUs)
So this library might be good for the A8 but not the other CPUs.

> My understanding is that the NEON unit is on per process, so once
> you've turned it on once it should stay on.

It's turned off by the kernel at context switch.
For thread dealing with a lot of data, it make sense.
Turning on NEON for a small copy doesn't make sense on embedded platforms.

> I assume the turn on cost is amortised across a run.  Note that if the data 
> is not in the L1
> cache then the NEON unit wins even for small-ish (~64 byte) copies.

Only on Cortex-A8. But still expensive power-wise.

Guillaume


-- IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium.  Thank you.

___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain


Re: NEON vectorization improvements - preliminary notes

2010-09-15 Thread Ira Rosen
Hi,

I need to learn much more about ARM architecture, but I have some initial
comments.

Julian Brown  wrote on 15/09/2010 11:37:21 AM:

>   * automatic vector size selection (it's currently selected by command
> line switch)


> Generally (check assumption) I think that wider vectors may make inner
loops more efficient,
> but may increase the size of setup/teardown code (e.g. setup: increased
versioning. Teardown,
> increased insns for reduction ops). More importantly, sometimes larger
vectors may inhibit vectorization.
> We ideally want to calculate costs per vector-size per-loop (or per other
vectorization opportunity).

There is a patch http://gcc.gnu.org/ml/gcc-patches/2010-03/msg00167.html
that was not committed to mainline (and I think not to vect256, but I am
not sure about that). This patch tries to vectorize for the wider option
unless it is impossible because of data dependence constraints.

I agree with that cost model approach.

>
>   * ensure that all gcc vectorizer pattern names are implemented in the
> machine description (those that can be).

In my opinion we better concentrate on:

>   * Conversly, perhaps identify NEON capabilities not covered by GCC
> patterns, and add them to gcc (e.g. vld2/vld3/vld4 insns)


Most of the existing vectorizer patterns were inspired by Altivec's
capabilities. I think our approach should originate from the architecture
and not the other way around. For example, I don't think we should spend
time on implementation of vect_extract_even/odd and
vect_interleave_high/low (even though they seem to match VUNZIP and VZIP),
when we have those amazing VLD2/3/4 and VST2/3/4 instructions.


>
> I've not even started on looking at:
>
>   * loops with more than two basic blocks (caused by if statements
> (anything else?))

What do you mean by that? If-conversion improvements?

>
> Do you (Ira) have access to the ARM ISA docs detailing the NEON
> instructions?

I have "ARM® Architecture Reference Manual ARM®v7-A and ARM®v7-R edition".

Ira

>
> Cheers,
>
> Julian[attachment "CS308-vectorization-improvements.txt" deleted by
> Ira Rosen/Haifa/IBM]


___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain


Re: [gnu-linaro-tools] Thumb2 size optimization report

2010-09-15 Thread Andrew Stubbs
On 15/09/10 14:49, Yao Qi wrote:
> * Goal
>Goal of this work is to look for thumb2 code size improvements on FSF
> GCC trunk.

Thank you Yao, I think we've definitely got some things we can do good 
work on here. :)

Andrew

___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain