Hi, I thought about optimizing memcpy and have an idea to transform patterns without having to deal with aliasing. When we are not sure about aliasing we can still replace loop with call of this function (provided that we know that n is large):
static int __memcpy_loop(char *to,char *from, size_t n, int diff) { size_t i; if (!overlap) memcpy(to, from, n); else for (i=0; i<n; i++) { memmove(to,from,diff); from+=diff; to+=diff; } } We could extract bit of performance by changing a function to nonstatic one after linking. Then a gcc would provide its version and glibc could add its own version and by symbol resolution it would be called when present. A second improvement is that patterns short x[n]; // or int x[n]; for (i=0;i<n;i++) x[i]=c; we could be replaced with call to wmemset. For initializing blocks of 8/16 bytes it would be easy to add memset8/memset16 that use suitable arguments. We could apply same trick for compatibility. Performance would be nearly identical as they could be implemented as short prolog followed by jump to memset. Comments?