NEON vectorization improvements - preliminary notes

Julian Brown Wed, 15 Sep 2010 02:37:43 -0700

Hi,

In case this is useful in its current (unfinished!) form: here are some
notes I made whilst looking at a couple of the items listed for CS308
here:


  https://wiki.linaro.org/Internal/Contractors/CodeSourcery

Namely:

  * automatic vector size selection (it's currently selected by command
    line switch)

    * also consider ARMv6 SIMD vectors (see CS309)

  * mixed size vectors (using to most appropriate size in each case)

  * ensure that all gcc vectorizer pattern names are implemented in the
    machine description (those that can be).

I've not even started on looking at:

  * loops with more than two basic blocks (caused by if statements
    (anything else?))

  * use of specialized load instructions

  * Conversly, perhaps identify NEON capabilities not covered by GCC
    patterns, and add them to gcc (e.g. vld2/vld3/vld4 insns)

  * any other missed opportunities (identify common idioms and teach the
    compiler to deal with them)

I'm not likely to have time to restart work on the vectorization study
for at least a couple of days, because of other CodeSourcery work. But
perhaps the attached will still be useful in the meantime.

Do you (Ira) have access to the ARM ISA docs detailing the NEON
instructions?

Cheers,

Julian

Automatic vector size selection/mixed-size vectors
==================================================

The "vect256" branch now has a vectorization factor argument for 
UNITS_PER_SIMD_WORD (allowing selection of different vector sizes). Patches to 
support that would need backporting to 4.5 if that looks useful. Could 
investigate the feasibility of doing that.

Currently UNITS_PER_SIMD_WORD is only used in 
tree-vect-stmts.c:get_vectype_for_scalar_type (which itself is used in several 
places).

Generally (check assumption) I think that wider vectors may make inner loops 
more efficient, but may increase the size of setup/teardown code (e.g. setup: 
increased versioning. Teardown, increased insns for reduction ops). More 
importantly, sometimes larger vectors may inhibit vectorization. We ideally 
want to calculate costs per vector-size per-loop (or per other vectorization 
opportunity).

Using the vect256 bits is probably much easier than the alternatives.

ARMv6 SIMD operations
=====================

It looks like several of the ARMv6 instructions may be useful to the 
vectorizer, or even just to regular integer code. Some of the instructions are 
supported already, but it's possible that we could support more -- particularly 
if combine is now able to recognize longer instruction sequences.

GCC already has V4QI and V2HI modes enabled on ARM.

PKH
---

Pack halfword. May be usable by combine (or may be too complicated).

QADD16, QADD8, QASX, QSUB16, QSUB8, QSAX
UQADD16, UQADD8, UQASX, UQSUB16, UQSUB8, UQSAX
----------------------------------------------

Saturating adds/subtracts. No use to vectorizer or combine at present.

REV, REV16, REVSH
-----------------

Unlikely to be usable without builtins. REV is currently supported like that.

SADD8, SADD16, UADD8, UADD16
----------------------------

Packed addition of bytes/halfwords (setting GE flags). Should be usable by 
vectorizer.

SEL
---

Select bytes depending on GE flags. Can probably be used in vectorizer to 
implement vcond on core registers.

SHADD8, SHADD16, SHSUB8, SHSUB16
UHADD8, UHADD16, UHSUB8, UHSUB16
--------------------------------

Packed additions & subtractions, halving the results before writing to dest 
register. Probably can't be used by vectorizer at present.

SMLAD, SMLALD
-------------

Two packed 16-bit multiplies, adding both results to a 32-bit accumulator. 
Pattern can be written in RTL, possibly recognizable by combine.

SMLSD, SMLSLD
-------------

Adds difference of two packed 16-bit multiplies to an accumulator. Again can be 
written in RTL, but will combine be able to do anything with it?

SMMLA, SMMLS, SMMUL
-------------------

Can probably be added quite easily, if combine plays nicely.

SMUAD, SMUSD
------------

Packed multiply with "sideways" add or subtract before writing to dest. Could 
probably be recognized by combine.

SMULBB, SMULBT, SMULTB, SMULTT
------------------------------

(ARMv5TE instructions). Supported. No unsigned variants for these.

SSAT, SSAT16, USAT, USAT16
--------------------------

Saturate (signed or unsigned) to power-of-two range given by a bit position. No 
use to vectorizer.

SSUB8, SSUB16, USUB8, USUB16
----------------------------

Packed 8- or 16-bit subtraction, setting flag bits. Could potentially be used 
by vectorizer.

SASX, SSAX, UASX, USAX
----------------------

[Un]signed add/subtract with exchange, or [un]signed subtract/add with 
exchange. May be usable from regular code, but might be too much for combine. 
(Maybe the intermediate pseudo-instruction trick might work though?).

SXTAB, SXTAH, UXTAB, UXTAH
--------------------------

Signed extend and add halfword. Already supported.

SXTAB16, UXTAB16
----------------

Extract two 8-bit values from shifted register, sign extend to 16-bits, and add 
to 16-bit values from another register, e.g. add to wider accumulators. May be 
usable by vectorizer.

SXTB, UXTB, SXTH, UXTH
----------------------

Sign-extend or zero-extend bytes or halfwords. Supported.

SXTB16, UXTB16
--------------

Widening ops. Could potentially be used by the vectorizer.

SHASX, SHSAX, UHASX, UHSAX
--------------------------

[Un]signed halving subtract/add with exchange. 16-bit elements. Probably can't 
be used by vectorizer at present. Possibly combinable, but probably too complex.

UMAAL
-----

Unsigned 32x32->64bit multiply with two follow-on 32-bit adds to the 64-bit 
result. Might be combinable at a push.

USAD8, USADA8
-------------

Absolute sum of 8-bit differences, writing (or accumulating) result to 32-bit 
register. Probably not usable by vectorizer or combine.

Loops with more than two basic blocks (if statements, etc.)
===========================================================

Use of specialized load instructions
====================================

Unimplemented GCC vector pattern names
======================================

movmisalign<mode>
-----------------

Implemented by: http://gcc.gnu.org/ml/gcc-patches/2010-08/msg00214.html

In SG++/Linaro 4.5 already.

vec_extract_even<mode>
----------------------

Not implemented. This can be done using VUZP, and only keeping the <Dd>/<Qd> 
result.

vec_extract_odd<mode>
---------------------

Not implemented. This can be done using VUZP, and only keeping the <Dm>/<Qm> 
result. Ideally a vec_extract_even paired with a vec_extract_odd would only 
create the one insn...

vec_interleave_high<mode>
-------------------------

Not implemented. Can be done using VZIP, keeping only the <Dm>/<Qm> result.

vec_interleave_low<mode>
------------------------

Not implemented. Can be done using VZIP, keeping only the <Dd>/<Qd> result. 
(Similarly paired vec_interleave_low & vec_interleave_high would ideally only 
create one insn.)

vec_init<mode>
--------------

Implemented. Probably some scope for adding more cleverness for initialising 
values in vectors: arm.c:neon_expand_vector_init knows some tricks already.

sdot_prod<mode>, udot_prod<mode>
--------------------------------

Not implemented. It's not entirely clear what operation this should support: I 
think it's several parallel dot-product operations, not one big dot-product. So 
the most natural thing to implement would be something like e.g.:

  VMULL.s8 qTMP, d1, d2
  VPADD.s16 dTMP2, dTMPlo, dTMPhi
  VADD.s16 d0, dTMP2, d3

We could possibly use VPADAL instead of VPADD, with d3 wider than dTMP2, if my 
reading is correct. In that case we wouldn't need the VADD.

We can definitely do something here, though it's a little unclear what at 
present.

ssum_widen<mode>3, usum_widen<mode>3
------------------------------------

Implemented, but called widen_[us]sum<mode>3. Doc or code bug? (Doc, I think.)

vec_pack_trunc_<mode>
---------------------

Not implemented. ARM have a patch:

  http://gcc.gnu.org/ml/gcc-patches/2010-08/msg02175.html

vec_pack_ssat_<mode>, vec_pack_usat_<mode>
------------------------------------------

Not implemented (probably easy). VQMOVN. (VQMOVUN wouldn't be needed).

vec_pack_sfix_trunc_<mode>, vec_pack_ufix_trunc_<mode>
------------------------------------------------------

Not implemented. Can be done for D registers converting via a Q register and a 
separate narrowing insn:

  (massage d1 & d2 into qTMP)
  VCVT.s32.f32 qTMP2, qTMP
  VMOVN.s32 d0, qTMP2

Usual caveats about register allocation & introducing copies. Used on other 
targets for DFmode to SImode conversions: might not be useful for NEON (which 
would only be supporting V2SF to V4HI conversions).

vec_unpack[su]_{hi,lo}_<mode>
-----------------------------

Not implemented. (Do ARM have a patch for this one?)

vec_unpack[su]_float_{hi,lo}_<mode>
-----------------------------------

Not implemented.

vec_widen_[us]mult_{hi,lo}_<mode>
---------------------------------

Not implemented.

vrotl<mode>3
vrotr<mode>3
------------

Not implemented (no NEON insns).

vec_set<mode>
vec_extract<mode>
reduc_[us]{min,max}_<mode>
vec_shl_<mode>
vec_shr_<mode>
vashl<mode>3
vashr<mode>3
vlshr<mode>3
------------

Implemented.

NEON capabilities not covered by the vectorizer
===============================================

Any other missed opportunities
==============================

_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain

NEON vectorization improvements - preliminary notes

Reply via email to