On Tue, Aug 16, 2011 at 11:32 PM, Richard Sandiford <richard.sandif...@linaro.org> wrote: > Michael Hope <michael.h...@linaro.org> writes: >> I put a build harness around libav and gathered some profiling data. See: >> bzr branch lp:~linaro-toolchain-dev/+junk/libav-suite >> >> It includes a Makefile that builds a C only, h.264 only decoder and >> two Creative Commons licensed videos to use as input. > > Thanks for putting this together. > >> README.rst has the basic commands for running ffmpeg and initial perf >> results showing the hot functions. Dave, 20 % of the time is spent in >> memcpy() so you might want to have a look. >> >> The vectoriser has no effect. GCC 4.5 is ~17 % faster than 4.6. I'll >> look into extracting and harnessing the functions themselves later >> this week. > > I had a look why auto-vectorisation wasn't having much effect. > It looks from your profile that most of the hot functions are > operating on 16x16 blocks of pixels with an unknown line stride. > So the C code looks like: > > for (i = 0; i < 16; i++) > { > x[0] = OP (x[0]); > ... > x[15] = OP (x[15]); > x += stride; > } > > Because of the unknown stride, we're relying on SLP rather than > loop-based vectorisation to handle this kind of loop. The problem > is that SLP is being run _as_ a loop optimisation. At the moment, > the gimple data-ref analysis code assumes that, during a loop > optimisation, only simple induction variables are of interest, > so it treats all of the x[...] references above as unrepresentable. > If I move SLP outside the loop optimisations (just as a proof of concept), > then that problem goes away. > > I talked about this with Ira, who said that SLP had been placed > where it is because ivopts (a later loop optimisation) obfuscates > things too much. As Ira said, we should probably look at (conditionally) > removing the assumption that only IVs are of interest during loop > optimisations. > > Another problem is that SLP supports a much smaller range of > optimisations than the loop-based vectoriser. There's no support > for promotion, demotion, or conditional expressions. This affects > things like the weight_h264_pixels* functions, which contain > conditional moves.
I had a poke about. GCC isn't too happy about unrolled loops either. put_h264_chroma_mc8_8_c() is defined via a macro in dsputil_template.c and is manually unwound by eight as: for(i=0; i<h; i++){\ OP(dst[0], (A*src[0] + B*src[1] + C*src[stride+0] + D*src[stride+1]));\ OP(dst[1], (A*src[1] + B*src[2] + C*src[stride+1] + D*src[stride+2]));\ OP(dst[2], (A*src[2] + B*src[3] + C*src[stride+2] + D*src[stride+3]));\ OP(dst[3], (A*src[3] + B*src[4] + C*src[stride+3] + D*src[stride+4]));\ OP(dst[4], (A*src[4] + B*src[5] + C*src[stride+4] + D*src[stride+5]));\ OP(dst[5], (A*src[5] + B*src[6] + C*src[stride+5] + D*src[stride+6]));\ OP(dst[6], (A*src[6] + B*src[7] + C*src[stride+6] + D*src[stride+7]));\ OP(dst[7], (A*src[7] + B*src[8] + C*src[stride+7] + D*src[stride+8]));\ dst+= stride;\ src+= stride;\ }\ where OP is an assignment. Reducing this to: #define A 3 #define B 4 void unrolled(uint8_t * __restrict dst, uint8_t * __restrict src, int h) { h /= 8; for (int i = 0; i < h; i++) { dst[0] = A*src[0] + B*src[0+1]; dst[1] = A*src[1] + B*src[1+1]; dst[2] = A*src[2] + B*src[2+1]; dst[3] = A*src[3] + B*src[3+1]; dst[4] = A*src[4] + B*src[4+1]; dst[5] = A*src[5] + B*src[5+1]; dst[6] = A*src[6] + B*src[6+1]; dst[7] = A*src[7] + B*src[7+1]; dst += 8; src += 8; } } void plain(uint8_t * __restrict dst, uint8_t * __restrict src, int h) { for (int i = 0; i < h; i++) { dst[i] = A*src[i] + B*src[i+1]; } } plain() gets vectorised where unrolled() doesn't. -- Michael _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain