Herve,
I think you code just confirms what I said -- for small nt for() wins,
otherwise memcpy wins. Taking your measurements (they are a bit crude since
they measure overhead as well):
ginaz:sandbox$ time ./hmc.mc 1
real0m7.294s
user0m7.239s
sys 0m0.054s
ginaz:sandbox$ time ./hmc 1
real0m3.773s
user0m3.746s
sys 0m0.024s
so for() is about 2x faster
ginaz:sandbox$ time ./hmc 3
real0m4.751s
user0m4.718s
sys 0m0.023s
ginaz:sandbox$ time ./hmc.mc 3
real0m3.098s
user0m3.051s
sys 0m0.045s
memcpy is about 50% faster.
It also proves me right when I said we should only special-case the common case
of scalar recycling and use memcpy for everything else.
Cheers,
Simon
On Apr 23, 2010, at 9:21 PM, Hervé Pagès wrote:
> Follow up...
>
> Hervé Pagès wrote:
>> Hi Matthew,
>> Matthew Dowle wrote:
>>> Just to add some clarification, the suggestion wasn't motivated by speeding
>>> up a length 3 vector being recycled 3.3 million times. But its a good
>>> point that any change should not make that case slower. I don't know how
>>> much vectorCopy is called really, DUPLICATE_ATOMIC_VECTOR seems more
>>> significant, which doesn't recycle, and already had the FIXME next to it.
>>>
>>> Where copyVector is passed a large source though, then memcpy should be
>>> faster than any of the methods using a for loop through each element
>>> (whether recycling or not), allowing for the usual caveats. What are the
>>> timings like if you repeat the for loop 100 times to get a more robust
>>> timing ? It needs to be a repeat around the for loop only, not the
>>> allocVector whose variance looks to be included in those timings below.
>>> Then increase the size of the source vector, and compare to memcpy.
>> On my system (DELL LATITUDE laptop with 64-bit 9.04 Ubuntu):
>> #include
>> #include
>> #include
>> void *memcpy2(char *dest, const char *src, size_t n)
>> {
>>int i;
>>for (i = 0; i < n; i++) *(dest++) = *(src++);
>>return dest;
>> }
>> int main()
>> {
>>int n, kmax, k;
>>char *x, *y;
>>n = 2500;
>>kmax = 100;
>>x = (char *) malloc(n);
>>y = (char *) malloc(n);
>>for (k = 0; k < kmax; k++)
>>//memcpy2(y, x, n);
>>memcpy(y, x, n);
>>return 0;
>> }
>> Benchmarks:
>> n = 2500, kmax = 100, memcpy2:
>> real0m8.123s
>> user0m8.077s
>> sys0m0.040s
>> n = 2500, k = 100, memcpy:
>> real0m1.076s
>> user0m1.004s
>> sys0m0.060s
>> n = 25000, kmax = 10, memcpy2:
>> real0m8.033s
>> user0m8.005s
>> sys0m0.012s
>> n = 25000, kmax = 10, memcpy:
>> real0m0.353s
>> user0m0.352s
>> sys0m0.000s
>> n = 25, kmax = 1, memcpy2:
>> real0m8.351s
>> user0m8.313s
>> sys0m0.008s
>> n = 25, kmax = 1, memcpy:
>> real0m0.628s
>> user0m0.624s
>> sys0m0.004s
>> So depending on the size of the memory area to copy, GNU memcpy() is
>> between 7.5x and 22x faster than using a for() loop. You can reasonably
>> expect that the authors of memcpy() have done their best to optimize
>> the code for most platforms they support, for big and small memory
>> areas, and that if there was a need to branch based on the size of the
>> area, that's already done *inside* memcpy() (I'm just speculating here,
>> I didn't look at memcpy's source code).
>
> So for copying a vector of integer (with recycling of the source),
> yes, a memcpy-based implementation is much faster, for long and small
> vectors (even for a length 3 vector being recycled 3.3 million
> times ;-) ), at least on my system:
>
> nt = 3; ns = 1000; kmax = 100; copy_ints:
>
> real 0m1.206s
> user 0m1.168s
> sys 0m0.040s
>
> nt = 3; ns = 1000; kmax = 100; copy_ints2:
>
> real 0m6.326s
> user 0m6.264s
> sys 0m0.052s
>
>
> Code:
> ===
> #include
> #include
> #include
>
> void memcpy_with_recycling_of_src(char *dest, size_t dest_nblocks,
> const char *src, size_t src_nblocks,
> size_t blocksize)
> {
> int i, imax, q;
> size_t src_size;
>
> imax = dest_nblocks - src_nblocks;
> src_size = src_nblocks * blocksize;
> for (i = 0; i <= imax; i += src_nblocks) {
> memcpy(dest, src, src_size);
> dest += src_size;
> i += src_nblocks;
> }
> q = dest_nblocks - i;
> if (q > 0)
> memcpy(dest, src, q * blocksize);
> return;
> }
>
> void copy_ints(int *dest, int dest_length,
> const int *src, int src_length)
> {
> memcpy_with_recycling_of_src((char *) dest, dest_length,
>(char *) src, src_length,
>sizeof(int));
> }
>
> /* the copyVector() way */
> void copy_ints2(int