Hi Brett,
thanks for your email, it helped a lot!
I think you're right. Here's what I get varying the spin_loop duration:

* spin_loop_time = 2.5usec (my original test)  -> ZMQ background thread
100% cpu usage
* spin_loop_time = 2.0usec (my original test)  -> ZMQ background thread
100% cpu usage
* spin_loop_time = *1.6usec* (my original test)  -> ZMQ background thread
100% cpu usage
* spin_loop_time =* 1.590usec* (my original test)  -> *ZMQ background
thread 100% cpu usage *
* spin_loop_time =* 1.585usec* (my original test)  -> *ZMQ background
thread 10% cpu usage  (a bit weird)*
* spin_loop_time =* 1.580usec* (my original test)  -> *ZMQ background
thread 17% cpu usage *
* spin_loop_time =* 1.570usec* (my original test)  -> *ZMQ background
thread 15% cpu usage *
* spin_loop_time =* 1.560usec* (my original test)  -> *ZMQ background
thread 15% cpu usage *
* spin_loop_time =* 1.550usec* (my original test)  -> *ZMQ background
thread 10% cpu usage*
* spin_loop_time = 1.40usec (my original test)  -> ZMQ background thread
10% cpu usage
* spin_loop_time = 0.01usec (my original test)  -> ZMQ background thread
10% cpu usage

All of these using the same pkt size of 300B.
In practice this is a step function I would say. It's enough a 10-20 nsec
difference to jump from 10-15% to 100% cpu usage. It's incredible.
My spin loop routine has probably a fixed offset of about 150nsec using
the clock_gettime(CLOCK_REALTIME) routine, but I think it's important to
have the delta, not the absolute numbers here. Any small variation in the
pkt size, HW of pub&sub or NIC would probably change slightly the slope of
the "step function".

Many thanks for suggesting this test. TCP "inefficiency" was my first
suspect (see my other email thread "Inefficient TCP connection for my
PUB-SUB zmq communication") but I was expecting a much much smoother
transition.
Do you think the cpu load of zmq background thread would be caused by the
much more frequent TCP ACKs coming from the SUB when the "batching"
suddenly stops happening ?

I guess the relevant point of zmq code is this one:

void zmq::stream_engine_base_t::out_event ()
{
[...]
        _outpos = NULL;
        _outsize = _encoder->encode (&_outpos, 0);

        while (_outsize < static_cast<size_t> (*_options.out_batch_size*)) {
            if ((this->*_next_msg) (&_tx_msg) == -1)
                break;
            _encoder->load_msg (&_tx_msg);
            unsigned char *bufptr = _outpos + _outsize;
            const size_t n =
              _encoder->encode (&bufptr, _options.out_batch_size -
_outsize);
            zmq_assert (n > 0);
            if (_outpos == NULL)
                _outpos = bufptr;
            _outsize += n;
        }
  [...]
    //  If there are any data to write in write buffer, write as much as
    //  possible to the socket. Note that amount of data to write can be
    //  arbitrarily large. However, we assume that underlying TCP layer has
    //  limited transmission buffer and thus the actual number of bytes
    //  written should be reasonably modest.
    const int nbytes = *write *(_outpos, _outsize); //* this
calls zmq::tcp_write() which calls send() syscall*
  [...]
}

Your suggestion is that if the application thread is fast enough (spin loop
is "short enough") then the while() loop body is actually executed 2-3-4
times and we send() a large TCP packet, thereby reducing both syscall
overhead and number of TCP acks from the SUB (and thus kernel overhead).
If instead the  application thread is not fast enough (spin loop is "too
long") then the while() loop body executes only once and we send my 300B
frames one by one to the zmq::tcp_write() and send() syscall. That would
kill performances of zmq background thread.
Is that correct?

Now the other 1M$ question: if that's the case, is there any tuning I can
do to force the zmq background thread to wait for some time before invoking
send() ?
I'm thinking that I could try to change the option TCP_NODELAY that is set
on the tcp socket with the option TCP_CORK instead and see what happens. In
this way I basically go to the opposite direction in the
throughput-vs-latency tradeoff ...
Or maybe I could change libzmq source code to invoke tcp_write() only e.g.
every N times out_event() is invoked? I think I risk getting some bytes
stuck into the stream engine if at some point I stop sending out messages
though....

Any other suggestion?

Thanks again a lot!

Francesco


Il giorno gio 1 apr 2021 alle ore 14:51 Brett Viren <[email protected]>
ha scritto:

> Francesco <[email protected]> writes:
>
> > So in first scenario the zmq background thread used only 12% of cpu to
> fill 914Mbps ; in second
> > scenario it uses 97% to fill 700Mbps...
> >
> > how's that possible?
>
> This is a pure guess: are you experiencing ZeroMQ's internal Nagle's
> algorithm?
>
> The guess applies if a) your messages are "small enough" and b) you
> send() them "fast enough".  In this case, libzmq will pack multiple
> messages into a single TCP packet.
>
> If this indeed applies, then your spin_loop() scenario is slowing things
> down beyond "fast enough" and libzmq transitions to sending out more
> (and smaller) TCP packets for the same set of messages.  This means more
> work for the I/O thread and that leads to the near 100% CPU usage and
> finally that becomes a throughput bottleneck.
>
> You could test this guess by repeating your two scenarios with two
> different changes:
>
> 1) rerun with progressively faster spin_loop() and measure CPU% and
> throughput to "detect" the Nagel timeout.  If the guess is right, as you
> make the spin_loop() time delay smaller you'll see a transition as
> libzmq goes from packing one message per TCP packet (message rate is
> "too slow") to as many as will fit (message rate is "fast enough").
> Plotting CPU% vs spin_loop() delay time should show a step function or
> maybe a softer sigmoid.  I expect the slope of the transition depends on
> the message size.
>
> Which brings to:
>
> 2) rerun with dramatically bigger messages, say 10 kB?  (I don't know
> libzmq's Nagle parameter values, but this is bigger than MTU).  If my
> guess is right, then throughput and CPU% will become linearly and weakly
> sensitive to the spin_loop() delay time.
>
>
> Right or wrong, I'd be curious to hear the results!
>
> -Brett.
>
_______________________________________________
zeromq-dev mailing list
[email protected]
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Reply via email to