> 128 is about upper bound you can expand with sse moves. > Tuning did not take into account code size and measured only when code > is in tigth loop. > For GPR-moves limit is around 64. Thanks for the data - I've not performed measurements with this implementation yet, but we surely should adjust thresholds to avoid performance degradations on small sizes.
Michael On 10 April 2013 22:53, Ondřej Bílka <[email protected]> wrote: > On Wed, Apr 10, 2013 at 09:53:09PM +0400, Michael Zolotukhin wrote: >> > Hi, I am writing memcpy for libc. It avoids computed jump and has is >> > much faster on small strings (variant for sandy bridge attached. >> >> I'm not sure I get what you meant - could you please explain what is >> computed jumps? > computed goto. See Duff's device it works almost exactly same. >> >> > You must also check performance with cold instruction cache. >> > Now memcpy(x,y,128) takes 126 bytes which is too much. >> >> > Do not align for small sizes. Dependency caused by this erases any gains >> > that you migth get. Keep in mind that in 55% of cases data are already >> > aligned. >> >> Other algorithms are still available and we can use them for small >> sizes. E.g. for sizes <128 we could emit loop with GPR-moves and don't >> use vector instructions in it. > > 128 is about upper bound you can expand with sse moves. > Tuning did not take into account code size and measured only when code > is in tigth loop. > For GPR-moves limit is around 64. > > What matters which code has best performance/size ratio. >> But that's tuning and I haven't worked on it yet - I'm going to >> measure performance of all algorithms on all sizes and thus defines on >> which sizes which algorithm is preferable. >> What I did in this patch is introducing some infrastructure to allow >> emitting of vector moves in movmem expanding - tuning is certainly >> possible and needed, but that's out of the scope of the patch. >> >> On 10 April 2013 21:43, Ondřej Bílka <[email protected]> wrote: >> > On Wed, Apr 10, 2013 at 08:14:30PM +0400, Michael Zolotukhin wrote: >> >> Hi, >> >> This patch adds a new algorithm of expanding movmem in x86 and a bit >> >> refactor existing implementation. This is a reincarnation of the patch >> >> that was sent wasn't checked couple of years ago - now I reworked it >> >> from scratch and divide into several more manageable parts. >> >> >> > Hi, I am writing memcpy for libc. It avoids computed jump and has is >> > much faster on small strings (variant for sandy bridge attached. >> > >> >> For now this algorithm isn't used, because cost_models are tuned to >> >> use existing ones. I believe the new algorithm will give better >> >> performance, but I'll leave cost-models tuning for a separate patch. >> >> >> > You must also check performance with cold instruction cache. >> > Now memcpy(x,y,128) takes 126 bytes which is too much. >> > >> >> Also, I changed get_mem_align_offset to make it handle MEM_REFs as >> >> well. Probably, there is another way of getting info about alignment - >> >> if so, please let me know. >> >> >> > Do not align for small sizes. Dependency caused by this erases any gains >> > that you migth get. Keep in mind that in 55% of cases data are already >> > aligned. >> > >> > Also in my tests best way to handle prologue is first copy last 16 >> > bytes and then loop. >> > >> >> Similar improvements could be done in expanding of memset, but that's >> >> in progress now and I'm going to proceed with it if this patch is ok. >> >> >> >> Bootstrap/make check/Specs2k are passing on i686 and x86_64. >> >> >> >> Is it ok for trunk? >> >> >> >> Changelog entry: >> >> >> >> 2013-04-10 Michael Zolotukhin <[email protected]> >> >> >> >> * config/i386/i386-opts.h (enum stringop_alg): Add vector_loop. >> >> * config/i386/i386.c (expand_set_or_movmem_via_loop): Use >> >> adjust_address instead of change_address to keep info about >> >> alignment. >> >> (emit_strmov): Remove. >> >> (emit_memmov): New function. >> >> (expand_movmem_epilogue): Refactor to properly handle bigger >> >> sizes. >> >> (expand_movmem_epilogue): Likewise and return updated rtx for >> >> destination. >> >> (expand_constant_movmem_prologue): Likewise and return updated >> >> rtx for >> >> destination and source. >> >> (decide_alignment): Refactor, handle vector_loop. >> >> (ix86_expand_movmem): Likewise. >> >> (ix86_expand_setmem): Likewise. >> >> * config/i386/i386.opt (Enum): Add vector_loop to option >> >> stringop_alg. >> >> * emit-rtl.c (get_mem_align_offset): Compute alignment for >> >> MEM_REF. -- --- Best regards, Michael V. Zolotukhin, Software Engineer Intel Corporation.
