> > Do you know what of the three changes (preferring reps/stosb, > > CLEAR_RATIO and algorithm choice changes) cause the two speedups > > on eebmc? > > A extracted testcase from nnet_test in https://godbolt.org/z/c8KdsohTP > > This loop is transformed to builtin_memcpy and builtin_memset with size 280. > > Current strategy for skylake is {512, unrolled_loop, false} for such > size, so it will generate unrolled loops with mov, while the patch > generates memcpy/memset libcall and uses vector move.
This is good - I originally set the table based on this micro-benchmarking script and apparently glibc used at that time had more expensive memcpy for small blocks. One thing to consider is, however, that calling external memcpy has also additional cost of clobbering all caller saved registers. Especially for code that uses SSE this is painful since all needs to go to stack in that case. So I am not completely sure how representative the micro-benchmark is to this respect since it does not use any SSE and register pressure is generally small. So with current glibc it seems libcall is win for blocks of size greater than 64 or 128 at least if the register pressure is not big. With this respect your change looks good. > > > > My patch generates "rep movsb" only in a very limited cases: > > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > > load and store for up to 16 * 16 (256) bytes when the data size is > > fixed and known. > > 2. Inline only if data size is known to be <= 256. > > a. Use "rep movsb/stosb" with a simple code sequence if the data size > > is a constant. > > b. Use loop if data size is not a constant. Aha, this is very hard to read from the algorithm descriptor. So we still have the check that maxsize==minsize and use rep mosb only for constant sized blocks when the corresponding TARGET macro is defined. I think it would be more readable if we introduced rep_1_byte_constant. The descriptor is supposed to read as a sequence of rules where fist applies. It is not obvious that we have another TARGET_* macro that makes rep_1_byte to be ignored in some cases. (TARGET macro will also interfere with the microbenchmarking script). Still I do not understand why compile time constant makes rep mosb/stosb better than loop. Is it CPU special casing it at decoder time and requiring explicit mov instruction? Or is it only becuase rep mosb is not good for blocks smaller than 128bit? > > > > As a result, "rep stosb" is generated only when 128 < data size < 256 > > with -mno-sse. > > > > > Do you have some data for blocks in size 8...256 to be faster with rep1 > > > compared to unrolled loop for perhaps more real world benchmarks? > > > > "rep movsb" isn't generated with my patch in this case since > > MOVE_RATIO == 17 can copy up to 16 * 16 (256) bytes with > > XMM registers. OK, so I guess: {libcall, {{256, rep_1_byte, true}, {256, unrolled_loop, false}, {-1, libcall, false}}}, {libcall, {{256, rep_1_loop, true}, {256, unrolled_loop, false}, {-1, libcall, false}}}}; may still perform better but the differnece between loop and unrolled loop is within 10% margin.. So i guess patch is OK and we should look into cleaning up the descriptors. I can make patch for that once I understand the logic above. Honza > > > > > The difference seems to get quite big for small locks in range 8...16 > > > bytes. I noticed that before and sort of conlcuded that it is probably > > > the branch prediction playing relatively well for those small block > > > sizes. On the other hand winding up the relatively long unrolled loop is > > > not very cool just to catch this case. > > > > > > Do you know what of the three changes (preferring reps/stosb, > > > CLEAR_RATIO and algorithm choice changes) cause the two speedups > > > on eebmc? > > > > Hongyu, can you find out where the speedup came from? > > > > Thanks. > > > > -- > > H.J.