Re: Notes on mixing D16/D32 code

2010-11-25 Thread Dave Martin
Hi,

On Wed, Nov 24, 2010 at 9:49 PM, Michael Hope  wrote:
> It's a bit of a newbie question, but I've been wondering if you can
> intermix hard float VFPv3-D16 code with VFPv3-D32 code.  You can as:
>
> According to the ABI:
>  * d0-d15 are used for floating point parameters, no matter if you are
> D16 or D32
>  * d0-d15 are not preserved across function calls
>  * d16-d31 must be preserved across function calls

No, I don't think that's correct - see the procedure call standard
section 5.1.2.1
"VFP register usage conventions (VFP v2, v3 and the Advanced SIMD Extension)"

It's not too hard to misread ... my understanding is as follows---

 * d0-d7 (s0-s15; q0-q3) are not callee-saved and are the only regs
used for parameter and return value exchange in the standard ABI
variants
 * d8-d15 (s16-s31; q4-q7) are _callee-saved_
 * d16-d31 (q8-q15) are _not callee-saved_

Any time a function calls an unknown function with live data in d0-d7
or d16-d31, the caller must preserve and restore them around the call
to avoid losing the data.

By definition, a D16 function never has live data in d16-d31 (since
the registers don't exist from its point of view) - so such functions
don't need to care about any related saving/restoring.  If the D16's
functions own caller had live data in those regs, that caller would
already have had to preserve them in order to call the D16 function -
so the D16 function can call other functions (uncluding D32) with
impunity.

So basically, D32 code just gets to use d16-d31 for extra scratchpad
bandwidth _in between_ external function call sites.  (Of course,
compiler-generated or hand-written code can relax the rules locally in
some circumstances, just as for the integer ABI)


Note that in your scenarios below, it is d0-d7 and not d0-d15 which
are used to pass arguments.

Cheers
---Dave

>
> The scenarios are:
> A D32 function calls a D16 function:
>  * The first 16 (!) parameters are passed in D0-D15
>  * Any remaining are passed on the stack
>  * The D16 function doesn't know about D16-D31, doesn't use them, and
> hence preserves them
>
> A D16 function calls a D32 function:
>  * The first 16 parameters are passed in D0-D15
>  * Any remaining are passed on the stack
>  * The D32 function preserves any of the D16-D31 registers that it
> uses.  Redundant, but fine.
>
> A D32 function (A) calls a D16 function (B) which calls a D32 function (C):
>  * Parameters are OK, as above
>  * B doesn't use D16-D31 and hence preserves them
>  * C preserves any of the D16-D31 that it uses, which preserves them
> from A's point of view
>
> -- Michael
>
> ___
> linaro-toolchain mailing list
> linaro-toolchain@lists.linaro.org
> http://lists.linaro.org/mailman/listinfo/linaro-toolchain
>

___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain


[ACTIVITY] November 21-25

2010-11-25 Thread Ira Rosen
Hi,

- the struggle with the board took a lot of time
- continued to investigate special loads/stores
- looked for benchmarks:
  EEMBC Consumer filters rgbcmy and rgbyiq should be vectorizable
once vld3, vst3/4 are supported
  EEMBC Telecom viterbi is supposed to give 4x on NEON once
vectorized (according to
http://www.jp.arm.com/event/pdf/forum2007/t1-5.pdf slide 29). My old
version of viterbi is not  vectorizable because of if-conversion
problems. I'd be really happy to check the new version (it is supposed
to be slightly different).
  Looking into other EEMBC benchmarks.
  FFMPEG http://www.ffmpeg.org/  (got this from Rony Nandy from
User Platforms). It contains hand-vectorized code for NEON.
Investigating.

I am probably taking a day off on Sunday.

Ira

___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain


Re: [ACTIVITY] November 21-25

2010-11-25 Thread Michael Hope
On Fri, Nov 26, 2010 at 2:35 AM, Ira Rosen  wrote:
>      FFMPEG http://www.ffmpeg.org/  (got this from Rony Nandy from
> User Platforms). It contains hand-vectorized code for NEON.
> Investigating.

I'm builidng and running a few variants of ffmpeg as part of the
continuous build. See:
 
http://builds.linaro.org/toolchain/gcc-linaro-4.5+bzr99435/logs/armv7l-maverick-cbuild17-pavo3/
and check out ffmpegbench*run.txt.

See the first line of ffmpegbench*configure.txt for the configure
options used to build it.

The test files are here:
 http://ex.seabright.co.nz/misc/cbuild/live/cbuild/files/ffmpeg/

And the scripts that tie it together:
 
http://bazaar.launchpad.net/~cbuild/%2Bjunk/cbuild/annotate/head%3A/lib/ffmpegbench-variants.mk
 
http://bazaar.launchpad.net/~cbuild/%2Bjunk/cbuild/annotate/head%3A/lib/ffmpegbench.mk

Note that there's the 'plain' -O2, 'asm' with assembler backend,
'size' -Os, and 'neon' -mfpu=neon versions.  Note also that ffmpeg
turns on '-fno-tree-vectorise' by default.

-- Michael

___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain


Re: Notes on mixing D16/D32 code

2010-11-25 Thread Michael Hope
On Fri, Nov 26, 2010 at 12:00 AM, Dave Martin  wrote:
> Hi,
>
> On Wed, Nov 24, 2010 at 9:49 PM, Michael Hope  wrote:
>> It's a bit of a newbie question, but I've been wondering if you can
>> intermix hard float VFPv3-D16 code with VFPv3-D32 code.  You can as:
>>
>> According to the ABI:
>>  * d0-d15 are used for floating point parameters, no matter if you are
>> D16 or D32
>>  * d0-d15 are not preserved across function calls
>>  * d16-d31 must be preserved across function calls
>
> No, I don't think that's correct - see the procedure call standard
> section 5.1.2.1
> "VFP register usage conventions (VFP v2, v3 and the Advanced SIMD Extension)"
>
> It's not too hard to misread ... my understanding is as follows---
>
>  * d0-d7 (s0-s15; q0-q3) are not callee-saved and are the only regs
> used for parameter and return value exchange in the standard ABI
> variants
>  * d8-d15 (s16-s31; q4-q7) are _callee-saved_
>  * d16-d31 (q8-q15) are _not callee-saved_

Ah, I got the s* and d* registers mixed up.  So if you have a function
which takes doubles, the first eight parameters go in registers.  If
the function takes floats, then the first sixteen go in registers.
D8-D15 are preserved across calls, D16+ aren't.

> So basically, D32 code just gets to use d16-d31 for extra scratchpad
> bandwidth _in between_ external function call sites.  (Of course,
> compiler-generated or hand-written code can relax the rules locally in
> some circumstances, just as for the integer ABI)

I think the conclusion is the same:  you can intermix VFP-D16 and
VFP-D32 code as D16 code doesn't use D16-D31 and D32 code doesn't
expect D16-D31 to be preserved across function calls.

-- Michael

___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain