Re: Generic Linux cross toolchain for tests

2011-08-24 Thread Marcin Juszkiewicz
W dniu 22.08.2011 15:21, Marcin Juszkiewicz pisze:
> Hi
> 
> Some time ago we agreed that not everyone here uses Ubuntu distribution
> and decided to provide so called 'generic linux' cross toolchain.
> Recently I managed to get it done and now need brave testers to tell is
> it working or not.
> 
> Get it here: http://people.linaro.org/~hrw/generic-linux/

Added 32bit tarball.


___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain


Effect of SMS register move scheduling

2011-08-24 Thread Richard Sandiford
Following on from yesterday's call about what it would take to enable
SMS by default: one of the problems I was seeing with the SMS+IV patch
was that we ended up with excessive moves.  E.g. a loop such as:

void
foo (int *__restrict a, int n)
{
  int i;

  for (i = 0; i < n; i += 2)
a[i] = a[i] * a[i + 1];
}

would end up being scheduled with an ii of 3, which means that in the
ideal case, each loop iteration would take 3 cycles.  However, we then
added ~8 register moves to the loop in order to satisfy dependencies.
Obviously those 8 moves add considerably to the iteration time.

I played around with a heuristic to see whether there were enough
free slots in the original schedule to accomodate the moves.
That avoided the problem, but it was a hack: the moves weren't
actually scheduled in those slots.  (In current trunk, the moves
generated for an instruction are inserted immediately before that
instruction.)

I mentioned this to Revital, who told me that Mustafa Hagog had
tried a more complete approach that really did schedule the moves.
That patch was quite old, so I ended up reimplementing the same kind
of idea in a slightly different way.  (The main functional changes
from Mustafa's version were to schedule from the end of the window
rather than the start, and to use a cyclic window.  E.g. moves for
an instruction in row 0 column 0 should be scheduled starting at
row ii-1 downwards.)

The effect on my flawed libav microbenchmarks was much greater
than I imagined.  I used the options:

-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
-fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec

The "before" code was from trunk, the "after" code was trunk + the
register scheduling patch alone (not the IV patch).  Only the tests
that have different "before" and "after" code are run.  The results were:

a3dec
  before:  50 runs take 4.68384s
  after:   50 runs take 4.61395s
  speedup: x1.02
aes
  before:  50 runs take 20.0523s
  after:   50 runs take 16.9722s
  speedup: x1.18
avs
  before:  100 runs take 15.4698s
  after:   100 runs take 2.23676s
  speedup: x6.92
dxa
  before:  200 runs take 18.5848s
  after:   200 runs take 4.40607s
  speedup: x4.22
mjpegenc
  before:  50 runs take 28.6987s
  after:   50 runs take 7.31342s
  speedup: x3.92
resample
  before:  100 runs take 10.418s
  after:   100 runs take 1.91016s
  speedup: x5.45
rgb2rgb-rgb24tobgr16
  before:  100 runs take 1.60513s
  after:   100 runs take 1.15643s
  speedup: x1.39
rgb2rgb-yv12touyvy
  before:  150 runs take 3.50122s
  after:   150 runs take 3.49887s
  speedup: x1
twinvq
  before:  50 runs take 0.452423s
  after:   50 runs take 0.452454s
  speedup: x1

Taking resample as an example: before the patch we had an ii of 27,
stage count of 6, and 12 vector moves.  Vector moves can't be dual
issued, and there was only one free slot, so even in theory, this loop
takes 27 + 12 - 1 = 38 cycles.  Unfortunately, there were so many new
registers that we spilled quite a few.

After the patch we have an ii of 28, a stage count of 3, and no moves,
so in theory, one iteration should take 28 cycles.  We also don't spill.
So I think the difference really is genuine.  (The large difference
in moves between ii=27 and ii=28 is because in the ii=27 schedule,
a lot of A--(T,N,0)-->B (intra-cycle true) dependencies were scheduled
with time(B) == time(A) + ii + 1.)

I also saw benefits in one test in a "real" benchmark, which I can't
post here.

Richard

___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain


Re: Turn on SMS by default on ARM

2011-08-24 Thread Ramana Radhakrishnan
On 23 August 2011 12:20, Revital Eres  wrote:

> SMS flags to use for testing:
>
> -O3 -fmodulo-sched-allow-regmoves -fmodulo-sched
> -funsafe-loop-optimizations  -fno-auto-inc-dec

Would we expect -funsafe-loop-optimizations to be turned on as well by
default ? I would expect that to be used for peak tuning rather than
anything else.

Also I suspect we only want to turn this on for v7-a cores and not for
anything earlier and probably not for the M-class cores given that
code size is of higher importance there than the wins we'd get with
SMS.

cheers
Ramana


>
> Thanks,
> Revital
>
> ___
> linaro-toolchain mailing list
> linaro-toolchain@lists.linaro.org
> http://lists.linaro.org/mailman/listinfo/linaro-toolchain
>

___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain