Hi,
Maybe this can be combined with a request that I have seen a couple of times to
be able to configure the allocator used in libzmq? I am thinking of something
like
struct zmq_allocator {
void* obj;
void* (&allocate)(size_t n, void* obj);
void (&release)(void* ptr, void* obj);
};
void* useMalloc(size_t n, void*) {return malloc(n);}
void freeMalloc(void* ptr) {free(ptr);}
zmq_allocator& zmg_default_allocator() {
static zmg_allocator defaultAllocator = {nullptr, useMalloc, freeMalloc};
Return defaultAllocator;
}
The context could then store the allocator for libzmq, and users could set a
specific allocator as a context option, e.g. with a zmq_ctx_set. A socket
created for a context can then inherit the default allocator or set a special
allocator as a socket option.
class MemoryPool {…}; // hopefully thread-safe
void* poolAllocate(size_t n) {return
MemoryPool pool;
void* allocatePool(size_t n, void* pool) {return
static_cast<MemoryPool*>(pool)->allocate(n);}
void releasePool(void* ptr, void* pool)
{static_cast<MemoryPool*>(pool)->release(ptr);}
zmq_allocator pooledAllocator {
&pool, allocatePool, releasePool
}
void* cdx = zmq_ctx_new();
zmq_ctx_set(ZMQ_ALLOCATOR, &pooledAllocator);
Cheers,
Jens
> Am 13.08.2019 um 13:24 schrieb Francesco <[email protected]>:
>
> Hi all,
>
> today I've taken some time to attempt building a memory-pooling
> mechanism in ZMQ local_thr/remote_thr benchmarking utilities.
> Here's the result:
> https://github.com/zeromq/libzmq/pull/3631
> <https://github.com/zeromq/libzmq/pull/3631>
> This PR is a work in progress and is a simple modification to show the
> effects of avoiding malloc/free when creating zmq_msg_t with the
> standard benchmark utils of ZMQ.
>
> In particular the very fast, zero-lock,
> single-producer/single-consumer queue from:
> https://github.com/cameron314/readerwriterqueue
> <https://github.com/cameron314/readerwriterqueue>
> is used to maintain between the "remote_thr" main thread and its ZMQ
> background IO thread a list of free buffers that can be used.
>
> Here are the graphical results:
> with mallocs / no memory pool:
>
> https://cdn1.imggmi.com/uploads/2019/8/13/9f009b91df394fa945cd2519fd993f50-full.png
>
> <https://cdn1.imggmi.com/uploads/2019/8/13/9f009b91df394fa945cd2519fd993f50-full.png>
> with memory pool:
>
> https://cdn1.imggmi.com/uploads/2019/8/13/f3ae0d6d58e9721b63129c23fe7347a6-full.png
>
> <https://cdn1.imggmi.com/uploads/2019/8/13/f3ae0d6d58e9721b63129c23fe7347a6-full.png>
>
> Doing the math the memory pooled approach shows:
>
> mostly the same performances for messages <= 32B
> +15% pps/throughput increase @ 64B,
> +60% pps/throughput increase @ 128B,
> +70% pps/throughput increase @ 210B
>
> [the tests were stopped at 210B because my current quick-dirty memory
> pool approach has fixed max msg size of about 210B].
>
> Honestly this is not a huge speedup, even if still interesting.
> Indeed with these changes the performances now seem to be bounded by
> the "local_thr" side and not by the "remote_thr" anymore. Indeed the
> zmq background IO thread for local_thr is the only thread at 100% in
> the 2 systems and its "perf top" now shows:
>
> 15,02% libzmq.so.5.2.3 [.] zmq::metadata_t::add_ref
> 14,91% libzmq.so.5.2.3 [.] zmq::v2_decoder_t::size_ready
> 8,94% libzmq.so.5.2.3 [.] zmq::ypipe_t<zmq::msg_t, 256>::write
> 6,97% libzmq.so.5.2.3 [.] zmq::msg_t::close
> 5,48% libzmq.so.5.2.3 [.]
> zmq::decoder_base_t<zmq::v2_decoder_t, zmq::shared_message_memory_allo
> 5,40% libzmq.so.5.2.3 [.] zmq::pipe_t::write
> 4,94% libzmq.so.5.2.3 [.] zmq::shared_message_memory_allocator::inc_ref
> 2,59% libzmq.so.5.2.3 [.] zmq::msg_t::init_external_storage
> 1,63% [kernel] [k] copy_user_enhanced_fast_string
> 1,56% libzmq.so.5.2.3 [.] zmq::msg_t::data
> 1,43% libzmq.so.5.2.3 [.] zmq::msg_t::init
> 1,34% libzmq.so.5.2.3 [.] zmq::pipe_t::check_write
> 1,24% libzmq.so.5.2.3 [.] zmq::stream_engine_base_t::in_event_internal
> 1,24% libzmq.so.5.2.3 [.] zmq::msg_t::size
>
> Do you know what this stacktrace might mean?
> I would expect to have that ZMQ background thread topping in its
> read() system call (from TCP socket)...
>
> Thanks,
> Francesco
>
>
> Il giorno ven 19 lug 2019 alle ore 18:15 Francesco
> <[email protected] <mailto:[email protected]>> ha
> scritto:
>>
>> Hi Yan,
>> Unfortunately I have interrupted my attempts in this area after getting some
>> strange results (possibly due to the fact that I tried in a complex
>> application context... I should probably try hacking a simple zeromq example
>> instead!).
>>
>> I'm also a bit surprised that nobody has tried and posted online a way to
>> achieve something similar (Memory pool zmq send) ... But anyway It remains
>> in my plans to try that out when I have a bit more spare time...
>> If you manage to have some results earlier, I would be eager to know :-)
>>
>> Francesco
>>
>>
>> Il ven 19 lug 2019, 04:02 Yan, Liming (NSB - CN/Hangzhou)
>> <[email protected]> ha scritto:
>>>
>>> Hi, Francesco
>>> Could you please share the final solution and benchmark result for plan
>>> 2? Big Thanks.
>>> I'm concerning this because I had tried the similar before with
>>> zmq_msg_init_data() and zmq_msg_send() but failed because of two issues.
>>> 1) My process is running in background for long time and finally I found
>>> it occupies more and more memory, until it exhausted the system memory. It
>>> seems there's memory leak with this way. 2) I provided *ffn for
>>> deallocation but the memory freed back is much slower than consumer. So
>>> finally my own customized pool could also be exhausted. How do you solve
>>> this?
>>> I had to turn back to use zmq_send(). I know it has memory copy penalty
>>> but it's the easiest and most stable way to send message. I'm still using
>>> 0MQ 4.1.x.
>>> Thanks.
>>>
>>> BR
>>> Yan Limin
>>>
>>> -----Original Message-----
>>> From: zeromq-dev [mailto:[email protected]] On Behalf Of
>>> Luca Boccassi
>>> Sent: Friday, July 05, 2019 4:58 PM
>>> To: ZeroMQ development list <[email protected]>
>>> Subject: Re: [zeromq-dev] Memory pool for zmq_msg_t
>>>
>>> There's no need to change the source for experimenting, you can just use
>>> _init_data without a callback and with a callback (yes the first case will
>>> leak memory but it's just a test), and measure the difference between the
>>> two cases. You can then immediately see if it's worth pursuing further
>>> optimisations or not.
>>>
>>> _external_storage is an implementation detail, and it's non-shared because
>>> it's used in the receive case only, as it's used with a reference to the
>>> TCP buffer used in the system call for zero-copy receives. Exposing that
>>> means that those kind of messages could not be used with pub-sub or
>>> radio-dish, as they can't have multiple references without copying them,
>>> which means there would be a semantic difference between the different
>>> message initialisation APIs, unlike now when the difference is only in who
>>> owns the buffer. It would make the API quite messy in my opinion, and be
>>> quite confusing as pub/sub is probably the most well known pattern.
>>>
>>> On Thu, 2019-07-04 at 23:20 +0200, Francesco wrote:
>>>> Hi Luca,
>>>> thanks for the details. Indeed I understand why the "content_t" needs
>>>> to be allocated dynamically: it's just like the control block used by
>>>> STL's std::shared_ptr<>.
>>>>
>>>> And you're right: I'm not sure how much gain there is in removing 100%
>>>> of malloc operations from my TX path... still I would be curious to
>>>> find it out but right now it seems I need to patch ZMQ source code to
>>>> achieve that.
>>>>
>>>> Anyway I wonder if it could be possible to expose in the public API a
>>>> method like "zmq::msg_t::init_external_storage()" that, AFAICS, allows
>>>> to create a non-shared zero-copy long message... it appears to be used
>>>> only by v2 decoder internally right now...
>>>> Is there a specific reason why that's not accessible from the public
>>>> API?
>>>>
>>>> Thanks,
>>>> Francesco
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Il giorno gio 4 lug 2019 alle ore 20:25 Luca Boccassi <
>>>> [email protected]> ha scritto:
>>>>> Another reason for that small struct to be on the heap is so that it
>>>>> can be shared among all the copies of the message (eg: a pub socket
>>>>> has N copies of the message on the stack, one for each subscriber).
>>>>> The struct has an atomic counter in it, so that when all the copies
>>>>> of the message on the stack have been closed, the userspace buffer
>>>>> deallocation callback can be invoked. If the atomic counter were on
>>>>> the stack inlined in the message, this wouldn't work.
>>>>> So even if room were to be found, a malloc would still be needed.
>>>>>
>>>>> If you _really_ are worried about it, and testing shows it makes a
>>>>> difference, then one option could be to pre-allocate a set of these
>>>>> metadata structures at startup, and just assign them when the
>>>>> message is created. It's possible, but increases complexity quite a
>>>>> bit, so it needs to be worth it.
>>>>>
>>>>> On Thu, 2019-07-04 at 17:42 +0100, Luca Boccassi wrote:
>>>>>> The second malloc cannot be avoided, but it's tiny and fixed in
>>>>> size
>>>>>> at
>>>>>> compile time, so the compiler and glibc will be able to optimize
>>>>> it
>>>>>> to
>>>>>> death.
>>>>>>
>>>>>> The reason for that is that there's not enough room in the 64
>>>>> bytes
>>>>>> to
>>>>>> store that structure, and increasing the message allocation on
>>>>> the
>>>>>> stack past 64 bytes means it will no longer fit in a single cache
>>>>>> line, which will incur in a performance penalty far worse than the
>>>>> small
>>>>>> malloc (I tested this some time ago). That is of course unless
>>>>> you
>>>>>> are
>>>>>> running on s390 or a POWER with 256 bytes cacheline, but given
>>>>> it's
>>>>>> part of the ABI it would be a bit of a mess for the benefit of
>>>>> very
>>>>>> few
>>>>>> users if any.
>>>>>>
>>>>>> So I'd recommend to just go with the second plan, and compare
>>>>> what
>>>>>> the
>>>>>> result is when passing a deallocation function vs not passing it
>>>>> (yes
>>>>>> it will leak the memory but it's just for the test). My bet is
>>>>> that
>>>>>> the
>>>>>> difference will not be that large.
>>>>>>
>>>>>> On Thu, 2019-07-04 at 16:30 +0200, Francesco wrote:
>>>>>>> Hi Stephan, Hi Luca,
>>>>>>>
>>>>>>> thanks for your hints. However I inspected
>>>>>>>
>>>>> https://github.com/dasys-lab/capnzero/blob/master/capnzero/src/Publi
>>>>> sher.cpp
>>>>>>>
>>>>>>> and I don't think it's saving from malloc()... see my point
>>>>> 2)
>>>>>>> below:
>>>>>>>
>>>>>>> Indeed I realized that probably current ZMQ API does not allow
>>>>> me
>>>>>>> to
>>>>>>> achieve the 100% of what I intended to do.
>>>>>>> Let me rephrase my target: my target is to be able to
>>>>>>> - memory pool creation: do a large memory allocation of, say,
>>>>> 1M
>>>>>>> zmq_msg_t only at the start of my program; let's say I create
>>>>> all
>>>>>>> these zmq_msg_t of a size of 2k bytes each (let's assume this
>>>>> is
>>>>>>> the
>>>>>>> max size of message possible in my app)
>>>>>>> - during application lifetime: call zmq_msg_send() at anytime
>>>>>>> always avoiding malloc() operations (just picking the first
>>>>>>> available unused entry of zmq_msg_t from the memory pool).
>>>>>>>
>>>>>>> Initially I thought that was possible but I think I have
>>>>> identified
>>>>>>> 2
>>>>>>> blocking issues:
>>>>>>> 1) If I try to recycle zmq_msg_t directly: in this case I will
>>>>> fail
>>>>>>> because I cannot really change only the "size" member of a
>>>>>>> zmq_msg_t without reallocating it... so that I'm forced (in my
>>>>>>> example)
>>>>> to
>>>>>>> always send 2k bytes out (!!)
>>>>>>> 2) if I do create only a memory pool of buffers of 2k bytes and
>>>>>>> then wrap the first available buffer inside a zmq_msg_t
>>>>>>> (allocated
>>>>> on
>>>>>>> the
>>>>>>> stack, not in the heap): in this case I need to know when the
>>>>>>> internals of ZMQ have completed using the zmq_msg_t and thus
>>>>> when I
>>>>>>> can mark that buffer as available again in my memory pool.
>>>>> However
>>>>>>> I
>>>>>>> see that zmq_msg_init_data() ZMQ code contains:
>>>>>>>
>>>>>>> // Initialize constant message if there's no need to
>>>>>>> deallocate
>>>>>>> if (ffn_ == NULL) {
>>>>>>> ...
>>>>>>> _u.cmsg.data = data_;
>>>>>>> _u.cmsg.size = size_;
>>>>>>> ...
>>>>>>> } else {
>>>>>>> ...
>>>>>>> _u.lmsg.content =
>>>>>>> static_cast<content_t *> (malloc (sizeof
>>>>> (content_t)));
>>>>>>> ...
>>>>>>> _u.lmsg.content->data = data_;
>>>>>>> _u.lmsg.content->size = size_;
>>>>>>> _u.lmsg.content->ffn = ffn_;
>>>>>>> _u.lmsg.content->hint = hint_;
>>>>>>> new (&_u.lmsg.content->refcnt) zmq::atomic_counter_t
>>>>> ();
>>>>>>> }
>>>>>>>
>>>>>>> So that I skip malloc() operation only if I pass ffn_ == NULL.
>>>>> The
>>>>>>> problem is that if I pass ffn_ == NULL, then I have no way to
>>>>> know
>>>>>>> when the internals of ZMQ have completed using the zmq_msg_t...
>>>>>>>
>>>>>>> Any way to workaround either issue 1) or issue 2) ?
>>>>>>>
>>>>>>> I understand that the malloc is just of size(content_t)~=
>>>>> 40B...
>>>>>>> but
>>>>>>> still I'd like to avoid it...
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Francesco
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Il giorno gio 4 lug 2019 alle ore 14:58 Stephan Opfer <
>>>>>>> [email protected]
>>>>>>>> ha scritto:
>>>>>>>> On 04.07.19 14:29, Luca Boccassi wrote:
>>>>>>>>> How users make use of these primitives is up to them
>>>>> though, I
>>>>>>>>
>>>>>>>> don't
>>>>>>>>> think anything special was shared before, as far as I
>>>>> remember.
>>>>>>>>
>>>>>>>> Some example can be found here:
>>>>>>>>
>>>>> https://github.com/dasys-lab/capnzero/tree/master/capnzero/src
>>>>>>>>
>>>>>>>>
>>>>>>>> The classes Publisher and Subscriber should replace the
>>>>> publisher
>>>>>>>> and
>>>>>>>> subscriber in a former Robot-Operating-System-based System. I
>>>>>>>> hope that the subscriber is actually using the method Luca is
>>>>>>>> talking
>>>>> about
>>>>>>>> on the
>>>>>>>> receiving side.
>>>>>>>>
>>>>>>>> The message data here is a Cap'n Proto container that we
>>>>>>>> "simply"
>>>>>>>> serialize and send via ZeroMQ -> therefore the name Cap'nZero
>>>>> ;-)
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> zeromq-dev mailing list
>>>>>>>> [email protected]
>>>>>>>>
>>>>>>>>
>>>>>>>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>> _______________________________________________
>>>>> zeromq-dev mailing list
>>>>> [email protected]
>>>>>
>>>>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>>>
>>>>
>>>>
>>> --
>>> Kind regards,
>>> Luca Boccassi
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> [email protected]
>>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
> _______________________________________________
> zeromq-dev mailing list
> [email protected] <mailto:[email protected]>
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
> <https://lists.zeromq.org/mailman/listinfo/zeromq-dev>
_______________________________________________
zeromq-dev mailing list
[email protected]
https://lists.zeromq.org/mailman/listinfo/zeromq-dev