On 18 August 2011 02:43, Michael Hope <michael.h...@linaro.org> wrote: > On Thu, Aug 18, 2011 at 11:11 AM, Michael Hope <michael.h...@linaro.org> > wrote: >> On Tue, Aug 16, 2011 at 11:32 PM, Richard Sandiford >> <richard.sandif...@linaro.org> wrote: >>> Michael Hope <michael.h...@linaro.org> writes: >>>> I put a build harness around libav and gathered some profiling data. See: >>>> bzr branch lp:~linaro-toolchain-dev/+junk/libav-suite >>>> >>>> It includes a Makefile that builds a C only, h.264 only decoder and >>>> two Creative Commons licensed videos to use as input. >>> >>> Thanks for putting this together. >>> >>>> README.rst has the basic commands for running ffmpeg and initial perf >>>> results showing the hot functions. Dave, 20 % of the time is spent in >>>> memcpy() so you might want to have a look. >>>> >>>> The vectoriser has no effect. GCC 4.5 is ~17 % faster than 4.6. I'll >>>> look into extracting and harnessing the functions themselves later >>>> this week. >>> >>> I had a look why auto-vectorisation wasn't having much effect. >>> It looks from your profile that most of the hot functions are >>> operating on 16x16 blocks of pixels with an unknown line stride. >>> So the C code looks like: >>> >>> for (i = 0; i < 16; i++) >>> { >>> x[0] = OP (x[0]); >>> ... >>> x[15] = OP (x[15]); >>> x += stride; >>> } >>> >>> Because of the unknown stride, we're relying on SLP rather than >>> loop-based vectorisation to handle this kind of loop. The problem >>> is that SLP is being run _as_ a loop optimisation. At the moment, >>> the gimple data-ref analysis code assumes that, during a loop >>> optimisation, only simple induction variables are of interest, >>> so it treats all of the x[...] references above as unrepresentable. >>> If I move SLP outside the loop optimisations (just as a proof of concept), >>> then that problem goes away. >>> >>> I talked about this with Ira, who said that SLP had been placed >>> where it is because ivopts (a later loop optimisation) obfuscates >>> things too much. As Ira said, we should probably look at (conditionally) >>> removing the assumption that only IVs are of interest during loop >>> optimisations. >>> >>> Another problem is that SLP supports a much smaller range of >>> optimisations than the loop-based vectoriser. There's no support >>> for promotion, demotion, or conditional expressions. This affects >>> things like the weight_h264_pixels* functions, which contain >>> conditional moves. >> >> I had a poke about. GCC isn't too happy about unrolled loops either. >> put_h264_chroma_mc8_8_c() is defined via a macro in dsputil_template.c >> and is manually unwound by eight as: >> >> for(i=0; i<h; i++){\ >> OP(dst[0], (A*src[0] + B*src[1] + C*src[stride+0] + >> D*src[stride+1]));\ >> OP(dst[1], (A*src[1] + B*src[2] + C*src[stride+1] + >> D*src[stride+2]));\ >> OP(dst[2], (A*src[2] + B*src[3] + C*src[stride+2] + >> D*src[stride+3]));\ >> OP(dst[3], (A*src[3] + B*src[4] + C*src[stride+3] + >> D*src[stride+4]));\ >> OP(dst[4], (A*src[4] + B*src[5] + C*src[stride+4] + >> D*src[stride+5]));\ >> OP(dst[5], (A*src[5] + B*src[6] + C*src[stride+5] + >> D*src[stride+6]));\ >> OP(dst[6], (A*src[6] + B*src[7] + C*src[stride+6] + >> D*src[stride+7]));\ >> OP(dst[7], (A*src[7] + B*src[8] + C*src[stride+7] + >> D*src[stride+8]));\ >> dst+= stride;\ >> src+= stride;\ >> }\ >> >> where OP is an assignment. >> >> Reducing this to: >> >> #define A 3 >> #define B 4 >> >> void unrolled(uint8_t * __restrict dst, uint8_t * __restrict src, int h) >> { >> h /= 8; >> for (int i = 0; i < h; i++) { >> dst[0] = A*src[0] + B*src[0+1]; >> dst[1] = A*src[1] + B*src[1+1]; >> dst[2] = A*src[2] + B*src[2+1]; >> dst[3] = A*src[3] + B*src[3+1]; >> dst[4] = A*src[4] + B*src[4+1]; >> dst[5] = A*src[5] + B*src[5+1]; >> dst[6] = A*src[6] + B*src[6+1]; >> dst[7] = A*src[7] + B*src[7+1]; >> dst += 8; >> src += 8; >> } >> } >> >> void plain(uint8_t * __restrict dst, uint8_t * __restrict src, int h) >> { >> for (int i = 0; i < h; i++) { >> dst[i] = A*src[i] + B*src[i+1]; >> } >> } >> >> plain() gets vectorised where unrolled() doesn't. > > How can I tell the vectoriser that a input is a multiple of something?
Unfortunately, I don't think you can. > For example, this code: > > struct image > { > uint8_t d[4096]; > } __attribute__((aligned(128))); > > void fixed(struct image * __restrict dst, struct image * __restrict src, int > h) > { > for (int i = 0; i < 16; i++) { > dst->d[i] = A*src->d[i] + B*src->d[i+1]; > } > } > > is lovely with no peeling or argument checking. > > I'd like to do a specialisation of a function where I assert that the > height is a multiple of 16 without unrolling the loop myself. > Something like: > > void multiple(struct image * __restrict dst, struct image * __restrict > src, int h) > { > h &= ~15; > > for (int i = 0; i < h; i++) { > dst->d[i] = A*src->d[i] + B*src->d[i+1]; > } > } > > The inner loop looks good but it still includes a prologue that tests > for h < vector size and an epilogue that handles any remaining bytes. > The epilogue is only a code size problem as it's normally skipped. > Still, the skipping requires a branch... Yes, that would be a nice feature, although I think such hints are rare. Ira > > -- Michael > > _______________________________________________ > linaro-toolchain mailing list > linaro-toolchain@lists.linaro.org > http://lists.linaro.org/mailman/listinfo/linaro-toolchain > _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain