Re: [zeromq-dev] Memory pool for zmq_msg_t

Jens Auer Wed, 14 Aug 2019 12:41:21 -0700

Hi,

Maybe this can be combined with a request that I have seen a couple of times to 
be able to configure the allocator used in libzmq? I am thinking of something 
like


struct zmq_allocator {
    void* obj;
    void* (&allocate)(size_t n, void* obj);
    void (&release)(void* ptr, void* obj); 
};

void* useMalloc(size_t n, void*) {return malloc(n);}
void freeMalloc(void* ptr) {free(ptr);}
 
zmq_allocator& zmg_default_allocator() {
    static zmg_allocator defaultAllocator = {nullptr, useMalloc, freeMalloc};
    Return defaultAllocator;
}

The context could then store the allocator for libzmq, and users could set a 
specific allocator as a context option, e.g. with a zmq_ctx_set. A socket 
created for a context can then inherit the default allocator or set a special 
allocator as a socket option.

class MemoryPool {…}; // hopefully thread-safe
void* poolAllocate(size_t n) {return 

MemoryPool pool;

void* allocatePool(size_t n, void* pool) {return 
static_cast<MemoryPool*>(pool)->allocate(n);}
void releasePool(void* ptr, void* pool) 
{static_cast<MemoryPool*>(pool)->release(ptr);}

zmq_allocator pooledAllocator {
    &pool, allocatePool, releasePool
}

void* cdx = zmq_ctx_new();
zmq_ctx_set(ZMQ_ALLOCATOR, &pooledAllocator);

Cheers,
Jens

> Am 13.08.2019 um 13:24 schrieb Francesco <[email protected]>:
> 
> Hi all,
> 
> today I've taken some time to attempt building a memory-pooling
> mechanism in ZMQ local_thr/remote_thr benchmarking utilities.
> Here's the result:
> https://github.com/zeromq/libzmq/pull/3631 
> <https://github.com/zeromq/libzmq/pull/3631>
> This PR is a work in progress and is a simple modification to show the
> effects of avoiding malloc/free when creating zmq_msg_t with the
> standard benchmark utils of ZMQ.
> 
> In particular the very fast, zero-lock,
> single-producer/single-consumer queue from:
> https://github.com/cameron314/readerwriterqueue 
> <https://github.com/cameron314/readerwriterqueue>
> is used to maintain between the "remote_thr" main thread and its ZMQ
> background IO thread a list of free buffers that can be used.
> 
> Here are the graphical results:
> with mallocs / no memory pool:
>   
> https://cdn1.imggmi.com/uploads/2019/8/13/9f009b91df394fa945cd2519fd993f50-full.png
>  
> <https://cdn1.imggmi.com/uploads/2019/8/13/9f009b91df394fa945cd2519fd993f50-full.png>
> with memory pool:
>   
> https://cdn1.imggmi.com/uploads/2019/8/13/f3ae0d6d58e9721b63129c23fe7347a6-full.png
>  
> <https://cdn1.imggmi.com/uploads/2019/8/13/f3ae0d6d58e9721b63129c23fe7347a6-full.png>
> 
> Doing the math the memory pooled approach shows:
> 
> mostly the same performances for messages <= 32B
> +15% pps/throughput increase @ 64B,
> +60% pps/throughput increase @ 128B,
> +70% pps/throughput increase @ 210B
> 
> [the tests were stopped at 210B because my current quick-dirty memory
> pool approach has fixed max msg size of about 210B].
> 
> Honestly this is not a huge speedup, even if still interesting.
> Indeed with these changes the performances now seem to be bounded by
> the "local_thr" side and not by the "remote_thr" anymore. Indeed the
> zmq background IO thread for local_thr is the only thread at 100% in
> the 2 systems and its "perf top" now shows:
> 
>  15,02%  libzmq.so.5.2.3     [.] zmq::metadata_t::add_ref
>  14,91%  libzmq.so.5.2.3     [.] zmq::v2_decoder_t::size_ready
>   8,94%  libzmq.so.5.2.3     [.] zmq::ypipe_t<zmq::msg_t, 256>::write
>   6,97%  libzmq.so.5.2.3     [.] zmq::msg_t::close
>   5,48%  libzmq.so.5.2.3     [.]
> zmq::decoder_base_t<zmq::v2_decoder_t, zmq::shared_message_memory_allo
>   5,40%  libzmq.so.5.2.3     [.] zmq::pipe_t::write
>   4,94%  libzmq.so.5.2.3     [.] zmq::shared_message_memory_allocator::inc_ref
>   2,59%  libzmq.so.5.2.3     [.] zmq::msg_t::init_external_storage
>   1,63%  [kernel]            [k] copy_user_enhanced_fast_string
>   1,56%  libzmq.so.5.2.3     [.] zmq::msg_t::data
>   1,43%  libzmq.so.5.2.3     [.] zmq::msg_t::init
>   1,34%  libzmq.so.5.2.3     [.] zmq::pipe_t::check_write
>   1,24%  libzmq.so.5.2.3     [.] zmq::stream_engine_base_t::in_event_internal
>   1,24%  libzmq.so.5.2.3     [.] zmq::msg_t::size
> 
> Do you know what this stacktrace might mean?
> I would expect to have that ZMQ background thread topping in its
> read() system call (from TCP socket)...
> 
> Thanks,
> Francesco
> 
> 
> Il giorno ven 19 lug 2019 alle ore 18:15 Francesco
> <[email protected] <mailto:[email protected]>> ha 
> scritto:
>> 
>> Hi Yan,
>> Unfortunately I have interrupted my attempts in this area after getting some 
>> strange results (possibly due to the fact that I tried in a complex 
>> application context... I should probably try hacking a simple zeromq example 
>> instead!).
>> 
>> I'm also a bit surprised that nobody has tried and posted online a way to 
>> achieve something similar (Memory pool zmq send) ... But anyway It remains 
>> in my plans to try that out when I have a bit more spare time...
>> If you manage to have some results earlier, I would be eager to know :-)
>> 
>> Francesco
>> 
>> 
>> Il ven 19 lug 2019, 04:02 Yan, Liming (NSB - CN/Hangzhou) 
>> <[email protected]> ha scritto:
>>> 
>>> Hi,  Francesco
>>>   Could you please share the final solution and benchmark result for plan 
>>> 2?  Big Thanks.
>>>   I'm concerning this because I had tried the similar before with 
>>> zmq_msg_init_data() and zmq_msg_send() but failed because of two issues.  
>>> 1)  My process is running in background for long time and finally I found 
>>> it occupies more and more memory, until it exhausted the system memory. It 
>>> seems there's memory leak with this way.   2) I provided *ffn for 
>>> deallocation but the memory freed back is much slower than consumer. So 
>>> finally my own customized pool could also be exhausted. How do you solve 
>>> this?
>>>   I had to turn back to use zmq_send(). I know it has memory copy penalty 
>>> but it's the easiest and most stable way to send message. I'm still using 
>>> 0MQ 4.1.x.
>>>   Thanks.
>>> 
>>> BR
>>> Yan Limin
>>> 
>>> -----Original Message-----
>>> From: zeromq-dev [mailto:[email protected]] On Behalf Of 
>>> Luca Boccassi
>>> Sent: Friday, July 05, 2019 4:58 PM
>>> To: ZeroMQ development list <[email protected]>
>>> Subject: Re: [zeromq-dev] Memory pool for zmq_msg_t
>>> 
>>> There's no need to change the source for experimenting, you can just use 
>>> _init_data without a callback and with a callback (yes the first case will 
>>> leak memory but it's just a test), and measure the difference between the 
>>> two cases. You can then immediately see if it's worth pursuing further 
>>> optimisations or not.
>>> 
>>> _external_storage is an implementation detail, and it's non-shared because 
>>> it's used in the receive case only, as it's used with a reference to the 
>>> TCP buffer used in the system call for zero-copy receives. Exposing that 
>>> means that those kind of messages could not be used with pub-sub or 
>>> radio-dish, as they can't have multiple references without copying them, 
>>> which means there would be a semantic difference between the different 
>>> message initialisation APIs, unlike now when the difference is only in who 
>>> owns the buffer. It would make the API quite messy in my opinion, and be 
>>> quite confusing as pub/sub is probably the most well known pattern.
>>> 
>>> On Thu, 2019-07-04 at 23:20 +0200, Francesco wrote:
>>>> Hi Luca,
>>>> thanks for the details. Indeed I understand why the "content_t" needs
>>>> to be allocated dynamically: it's just like the control block used by
>>>> STL's std::shared_ptr<>.
>>>> 
>>>> And you're right: I'm not sure how much gain there is in removing 100%
>>>> of malloc operations from my TX path... still I would be curious to
>>>> find it out but right now it seems I need to patch ZMQ source code to
>>>> achieve that.
>>>> 
>>>> Anyway I wonder if it could be possible to expose in the public API a
>>>> method like "zmq::msg_t::init_external_storage()" that, AFAICS, allows
>>>> to create a non-shared zero-copy long message... it appears to be used
>>>> only by v2 decoder internally right now...
>>>> Is there a specific reason why that's not accessible from the public
>>>> API?
>>>> 
>>>> Thanks,
>>>> Francesco
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Il giorno gio 4 lug 2019 alle ore 20:25 Luca Boccassi <
>>>> [email protected]> ha scritto:
>>>>> Another reason for that small struct to be on the heap is so that it
>>>>> can be shared among all the copies of the message (eg: a pub socket
>>>>> has N copies of the message on the stack, one for each subscriber).
>>>>> The struct has an atomic counter in it, so that when all the copies
>>>>> of the message on the stack have been closed, the userspace buffer
>>>>> deallocation callback can be invoked. If the atomic counter were on
>>>>> the stack inlined in the message, this wouldn't work.
>>>>> So even if room were to be found, a malloc would still be needed.
>>>>> 
>>>>> If you _really_ are worried about it, and testing shows it makes a
>>>>> difference, then one option could be to pre-allocate a set of these
>>>>> metadata structures at startup, and just assign them when the
>>>>> message is created. It's possible, but increases complexity quite a
>>>>> bit, so it needs to be worth it.
>>>>> 
>>>>> On Thu, 2019-07-04 at 17:42 +0100, Luca Boccassi wrote:
>>>>>> The second malloc cannot be avoided, but it's tiny and fixed in
>>>>> size
>>>>>> at
>>>>>> compile time, so the compiler and glibc will be able to optimize
>>>>> it
>>>>>> to
>>>>>> death.
>>>>>> 
>>>>>> The reason for that is that there's not enough room in the 64
>>>>> bytes
>>>>>> to
>>>>>> store that structure, and increasing the message allocation on
>>>>> the
>>>>>> stack past 64 bytes means it will no longer fit in a single cache
>>>>>> line, which will incur in a performance penalty far worse than the
>>>>> small
>>>>>> malloc (I tested this some time ago). That is of course unless
>>>>> you
>>>>>> are
>>>>>> running on s390 or a POWER with 256 bytes cacheline, but given
>>>>> it's
>>>>>> part of the ABI it would be a bit of a mess for the benefit of
>>>>> very
>>>>>> few
>>>>>> users if any.
>>>>>> 
>>>>>> So I'd recommend to just go with the second plan, and compare
>>>>> what
>>>>>> the
>>>>>> result is when passing a deallocation function vs not passing it
>>>>> (yes
>>>>>> it will leak the memory but it's just for the test). My bet is
>>>>> that
>>>>>> the
>>>>>> difference will not be that large.
>>>>>> 
>>>>>> On Thu, 2019-07-04 at 16:30 +0200, Francesco wrote:
>>>>>>> Hi Stephan, Hi Luca,
>>>>>>> 
>>>>>>> thanks for your hints. However I inspected
>>>>>>> 
>>>>> https://github.com/dasys-lab/capnzero/blob/master/capnzero/src/Publi
>>>>> sher.cpp
>>>>>>> 
>>>>>>> and I don't think it's saving from malloc()...  see my point
>>>>> 2)
>>>>>>> below:
>>>>>>> 
>>>>>>> Indeed I realized that probably current ZMQ API does not allow
>>>>> me
>>>>>>> to
>>>>>>> achieve the 100% of what I intended to do.
>>>>>>> Let me rephrase my target: my target is to be able to
>>>>>>> - memory pool creation: do a large memory allocation of, say,
>>>>> 1M
>>>>>>> zmq_msg_t only at the start of my program; let's say I create
>>>>> all
>>>>>>> these zmq_msg_t of a size of 2k bytes each (let's assume this
>>>>> is
>>>>>>> the
>>>>>>> max size of message possible in my app)
>>>>>>> - during application lifetime: call zmq_msg_send() at anytime
>>>>>>> always avoiding malloc() operations (just picking the first
>>>>>>> available unused entry of zmq_msg_t from the memory pool).
>>>>>>> 
>>>>>>> Initially I thought that was possible but I think I have
>>>>> identified
>>>>>>> 2
>>>>>>> blocking issues:
>>>>>>> 1) If I try to recycle zmq_msg_t directly: in this case I will
>>>>> fail
>>>>>>> because I cannot really change only the "size" member of a
>>>>>>> zmq_msg_t without reallocating it... so that I'm forced (in my
>>>>>>> example)
>>>>> to
>>>>>>> always send 2k bytes out (!!)
>>>>>>> 2) if I do create only a memory pool of buffers of 2k bytes and
>>>>>>> then wrap the first available buffer inside a zmq_msg_t
>>>>>>> (allocated
>>>>> on
>>>>>>> the
>>>>>>> stack, not in the heap): in this case I need to know when the
>>>>>>> internals of ZMQ have completed using the zmq_msg_t and thus
>>>>> when I
>>>>>>> can mark that buffer as available again in my memory pool.
>>>>> However
>>>>>>> I
>>>>>>> see that zmq_msg_init_data() ZMQ code contains:
>>>>>>> 
>>>>>>>    //  Initialize constant message if there's no need to
>>>>>>> deallocate
>>>>>>>    if (ffn_ == NULL) {
>>>>>>> ...
>>>>>>>        _u.cmsg.data = data_;
>>>>>>>        _u.cmsg.size = size_;
>>>>>>> ...
>>>>>>>    } else {
>>>>>>> ...
>>>>>>>        _u.lmsg.content =
>>>>>>>          static_cast<content_t *> (malloc (sizeof
>>>>> (content_t)));
>>>>>>> ...
>>>>>>>        _u.lmsg.content->data = data_;
>>>>>>>        _u.lmsg.content->size = size_;
>>>>>>>        _u.lmsg.content->ffn = ffn_;
>>>>>>>        _u.lmsg.content->hint = hint_;
>>>>>>>        new (&_u.lmsg.content->refcnt) zmq::atomic_counter_t
>>>>> ();
>>>>>>>    }
>>>>>>> 
>>>>>>> So that I skip malloc() operation only if I pass ffn_ == NULL.
>>>>> The
>>>>>>> problem is that if I pass ffn_ == NULL, then I have no way to
>>>>> know
>>>>>>> when the internals of ZMQ have completed using the zmq_msg_t...
>>>>>>> 
>>>>>>> Any way to workaround either issue 1) or issue 2) ?
>>>>>>> 
>>>>>>> I understand that the malloc is just of size(content_t)~=
>>>>> 40B...
>>>>>>> but
>>>>>>> still I'd like to avoid it...
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> Francesco
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Il giorno gio 4 lug 2019 alle ore 14:58 Stephan Opfer <
>>>>>>> [email protected]
>>>>>>>> ha scritto:
>>>>>>>> On 04.07.19 14:29, Luca Boccassi wrote:
>>>>>>>>> How users make use of these primitives is up to them
>>>>> though, I
>>>>>>>> 
>>>>>>>> don't
>>>>>>>>> think anything special was shared before, as far as I
>>>>> remember.
>>>>>>>> 
>>>>>>>> Some example can be found here:
>>>>>>>> 
>>>>> https://github.com/dasys-lab/capnzero/tree/master/capnzero/src
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The classes Publisher and Subscriber should replace the
>>>>> publisher
>>>>>>>> and
>>>>>>>> subscriber in a former Robot-Operating-System-based System. I
>>>>>>>> hope that the subscriber is actually using the method Luca is
>>>>>>>> talking
>>>>> about
>>>>>>>> on the
>>>>>>>> receiving side.
>>>>>>>> 
>>>>>>>> The message data here is a Cap'n Proto container that we
>>>>>>>> "simply"
>>>>>>>> serialize and send via ZeroMQ -> therefore the name Cap'nZero
>>>>> ;-)
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> zeromq-dev mailing list
>>>>>>>> [email protected]
>>>>>>>> 
>>>>>>>> 
>>>>>>>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> _______________________________________________
>>>>> zeromq-dev mailing list
>>>>> [email protected]
>>>>> 
>>>>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>>> 
>>>> 
>>>> 
>>> --
>>> Kind regards,
>>> Luca Boccassi
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> [email protected]
>>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
> _______________________________________________
> zeromq-dev mailing list
> [email protected] <mailto:[email protected]>
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev 
> <https://lists.zeromq.org/mailman/listinfo/zeromq-dev>

_______________________________________________
zeromq-dev mailing list
[email protected]
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Re: [zeromq-dev] Memory pool for zmq_msg_t

Reply via email to