On Wed, Jul 30, 2008 at 5:57 PM, Denys Vlasenko <[EMAIL PROTECTED]> wrote: > On Fri, Jul 25, 2008 at 9:08 AM, Agner Fog <[EMAIL PROTECTED]> wrote: >> Raksit Ashok wrote: >>>There is a more optimized version for 64-bit: >>>http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/amd64/gen/memcpy.s >>>I think this looks similar to your implementation, Agner. >> >> Yes it is similar to my code. > > 3164 line source file which implements memcpy(). > You got to be kidding. > How much of L1 icache it blows away in the process? > I bet it performs wonderfully on microbenchmarks though. > > 2991 .balign 16 # sadistic alignment strikes > again > 2992 L(bkPxQx): .int L(bkP0Q0)-L(bkPxQx) # why use two bytes when > we can use four? > > Seriously. What possible reason there can be to align > a randomly accessed data table to 16 bytes? > 4 bytes I understand, but 16?
I'm afraid I sounded a bit confrontational above, here comes the clarification. I have nothing against making code faster. But there should be some balance between -O999 mindset and -Os midset. If you just found a tweak which gives you 1.2% speedup in microbencmark but code grew 4 times bigger, *stop*. Think about it. "We unrolled the loop two gazillion times and it's 3% faster now" is a similarly bad idea. I must admit that I didn't look too closely at http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/amd64/gen/memcpy.s but at the first glance it sure looks like someone got carried away a bit. -- vda