Hi,
In case this is useful in its current (unfinished!) form: here are some
notes I made whilst looking at a couple of the items listed for CS308
here:
https://wiki.linaro.org/Internal/Contractors/CodeSourcery
Namely:
* automatic vector size selection (it's currently selected by command
line switch)
* also consider ARMv6 SIMD vectors (see CS309)
* mixed size vectors (using to most appropriate size in each case)
* ensure that all gcc vectorizer pattern names are implemented in the
machine description (those that can be).
I've not even started on looking at:
* loops with more than two basic blocks (caused by if statements
(anything else?))
* use of specialized load instructions
* Conversly, perhaps identify NEON capabilities not covered by GCC
patterns, and add them to gcc (e.g. vld2/vld3/vld4 insns)
* any other missed opportunities (identify common idioms and teach the
compiler to deal with them)
I'm not likely to have time to restart work on the vectorization study
for at least a couple of days, because of other CodeSourcery work. But
perhaps the attached will still be useful in the meantime.
Do you (Ira) have access to the ARM ISA docs detailing the NEON
instructions?
Cheers,
Julian
Automatic vector size selection/mixed-size vectors
==================================================
The "vect256" branch now has a vectorization factor argument for
UNITS_PER_SIMD_WORD (allowing selection of different vector sizes). Patches to
support that would need backporting to 4.5 if that looks useful. Could
investigate the feasibility of doing that.
Currently UNITS_PER_SIMD_WORD is only used in
tree-vect-stmts.c:get_vectype_for_scalar_type (which itself is used in several
places).
Generally (check assumption) I think that wider vectors may make inner loops
more efficient, but may increase the size of setup/teardown code (e.g. setup:
increased versioning. Teardown, increased insns for reduction ops). More
importantly, sometimes larger vectors may inhibit vectorization. We ideally
want to calculate costs per vector-size per-loop (or per other vectorization
opportunity).
Using the vect256 bits is probably much easier than the alternatives.
ARMv6 SIMD operations
=====================
It looks like several of the ARMv6 instructions may be useful to the
vectorizer, or even just to regular integer code. Some of the instructions are
supported already, but it's possible that we could support more -- particularly
if combine is now able to recognize longer instruction sequences.
GCC already has V4QI and V2HI modes enabled on ARM.
PKH
---
Pack halfword. May be usable by combine (or may be too complicated).
QADD16, QADD8, QASX, QSUB16, QSUB8, QSAX
UQADD16, UQADD8, UQASX, UQSUB16, UQSUB8, UQSAX
----------------------------------------------
Saturating adds/subtracts. No use to vectorizer or combine at present.
REV, REV16, REVSH
-----------------
Unlikely to be usable without builtins. REV is currently supported like that.
SADD8, SADD16, UADD8, UADD16
----------------------------
Packed addition of bytes/halfwords (setting GE flags). Should be usable by
vectorizer.
SEL
---
Select bytes depending on GE flags. Can probably be used in vectorizer to
implement vcond on core registers.
SHADD8, SHADD16, SHSUB8, SHSUB16
UHADD8, UHADD16, UHSUB8, UHSUB16
--------------------------------
Packed additions & subtractions, halving the results before writing to dest
register. Probably can't be used by vectorizer at present.
SMLAD, SMLALD
-------------
Two packed 16-bit multiplies, adding both results to a 32-bit accumulator.
Pattern can be written in RTL, possibly recognizable by combine.
SMLSD, SMLSLD
-------------
Adds difference of two packed 16-bit multiplies to an accumulator. Again can be
written in RTL, but will combine be able to do anything with it?
SMMLA, SMMLS, SMMUL
-------------------
Can probably be added quite easily, if combine plays nicely.
SMUAD, SMUSD
------------
Packed multiply with "sideways" add or subtract before writing to dest. Could
probably be recognized by combine.
SMULBB, SMULBT, SMULTB, SMULTT
------------------------------
(ARMv5TE instructions). Supported. No unsigned variants for these.
SSAT, SSAT16, USAT, USAT16
--------------------------
Saturate (signed or unsigned) to power-of-two range given by a bit position. No
use to vectorizer.
SSUB8, SSUB16, USUB8, USUB16
----------------------------
Packed 8- or 16-bit subtraction, setting flag bits. Could potentially be used
by vectorizer.
SASX, SSAX, UASX, USAX
----------------------
[Un]signed add/subtract with exchange, or [un]signed subtract/add with
exchange. May be usable from regular code, but might be too much for combine.
(Maybe the intermediate pseudo-instruction trick might work though?).
SXTAB, SXTAH, UXTAB, UXTAH
--------------------------
Signed extend and add halfword. Already supported.
SXTAB16, UXTAB16
----------------
Extract two 8-bit values from shifted register, sign extend to 16-bits, and add
to 16-bit values from another register, e.g. add to wider accumulators. May be
usable by vectorizer.
SXTB, UXTB, SXTH, UXTH
----------------------
Sign-extend or zero-extend bytes or halfwords. Supported.
SXTB16, UXTB16
--------------
Widening ops. Could potentially be used by the vectorizer.
SHASX, SHSAX, UHASX, UHSAX
--------------------------
[Un]signed halving subtract/add with exchange. 16-bit elements. Probably can't
be used by vectorizer at present. Possibly combinable, but probably too complex.
UMAAL
-----
Unsigned 32x32->64bit multiply with two follow-on 32-bit adds to the 64-bit
result. Might be combinable at a push.
USAD8, USADA8
-------------
Absolute sum of 8-bit differences, writing (or accumulating) result to 32-bit
register. Probably not usable by vectorizer or combine.
Loops with more than two basic blocks (if statements, etc.)
===========================================================
Use of specialized load instructions
====================================
Unimplemented GCC vector pattern names
======================================
movmisalign<mode>
-----------------
Implemented by: http://gcc.gnu.org/ml/gcc-patches/2010-08/msg00214.html
In SG++/Linaro 4.5 already.
vec_extract_even<mode>
----------------------
Not implemented. This can be done using VUZP, and only keeping the <Dd>/<Qd>
result.
vec_extract_odd<mode>
---------------------
Not implemented. This can be done using VUZP, and only keeping the <Dm>/<Qm>
result. Ideally a vec_extract_even paired with a vec_extract_odd would only
create the one insn...
vec_interleave_high<mode>
-------------------------
Not implemented. Can be done using VZIP, keeping only the <Dm>/<Qm> result.
vec_interleave_low<mode>
------------------------
Not implemented. Can be done using VZIP, keeping only the <Dd>/<Qd> result.
(Similarly paired vec_interleave_low & vec_interleave_high would ideally only
create one insn.)
vec_init<mode>
--------------
Implemented. Probably some scope for adding more cleverness for initialising
values in vectors: arm.c:neon_expand_vector_init knows some tricks already.
sdot_prod<mode>, udot_prod<mode>
--------------------------------
Not implemented. It's not entirely clear what operation this should support: I
think it's several parallel dot-product operations, not one big dot-product. So
the most natural thing to implement would be something like e.g.:
VMULL.s8 qTMP, d1, d2
VPADD.s16 dTMP2, dTMPlo, dTMPhi
VADD.s16 d0, dTMP2, d3
We could possibly use VPADAL instead of VPADD, with d3 wider than dTMP2, if my
reading is correct. In that case we wouldn't need the VADD.
We can definitely do something here, though it's a little unclear what at
present.
ssum_widen<mode>3, usum_widen<mode>3
------------------------------------
Implemented, but called widen_[us]sum<mode>3. Doc or code bug? (Doc, I think.)
vec_pack_trunc_<mode>
---------------------
Not implemented. ARM have a patch:
http://gcc.gnu.org/ml/gcc-patches/2010-08/msg02175.html
vec_pack_ssat_<mode>, vec_pack_usat_<mode>
------------------------------------------
Not implemented (probably easy). VQMOVN. (VQMOVUN wouldn't be needed).
vec_pack_sfix_trunc_<mode>, vec_pack_ufix_trunc_<mode>
------------------------------------------------------
Not implemented. Can be done for D registers converting via a Q register and a
separate narrowing insn:
(massage d1 & d2 into qTMP)
VCVT.s32.f32 qTMP2, qTMP
VMOVN.s32 d0, qTMP2
Usual caveats about register allocation & introducing copies. Used on other
targets for DFmode to SImode conversions: might not be useful for NEON (which
would only be supporting V2SF to V4HI conversions).
vec_unpack[su]_{hi,lo}_<mode>
-----------------------------
Not implemented. (Do ARM have a patch for this one?)
vec_unpack[su]_float_{hi,lo}_<mode>
-----------------------------------
Not implemented.
vec_widen_[us]mult_{hi,lo}_<mode>
---------------------------------
Not implemented.
vrotl<mode>3
vrotr<mode>3
------------
Not implemented (no NEON insns).
vec_set<mode>
vec_extract<mode>
reduc_[us]{min,max}_<mode>
vec_shl_<mode>
vec_shr_<mode>
vashl<mode>3
vashr<mode>3
vlshr<mode>3
------------
Implemented.
NEON capabilities not covered by the vectorizer
===============================================
Any other missed opportunities
==============================
_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain