On Fri, Sep 12, 2025 at 12:27:52PM +0100, Daniel P. Berrangé wrote: > On Thu, Sep 11, 2025 at 05:23:54PM -0400, Peter Xu wrote: > > No issue I hit, the change is only from code observation when I am looking > > at a TLS premature termination issue. > > > > qio_channel_tls_bye() API needs to be synchronous. When it's not, the > > previous impl will attach an asynchronous task retrying but only until when > > the channel gets the relevant GIO event. It may be problematic, because the > > caller of qio_channel_tls_bye() may have invoked channel close() before > > that, leading to premature termination of the TLS session. > > > > Remove the asynchronous handling, instead retry it immediately. Currently, > > the only two possible cases that may lead to async task is either INTERRUPT > > or EAGAIN. It should be suffice to spin retry as of now, until a solid > > proof showing that a more complicated retry logic is needed. > > > > With that, we can remove the whole async model for the bye task. > > > > When at it, making the function return bool, which looks like a common > > pattern in QEMU when errp is used. > > > > Side note on the tracepoints: previously the tracepoint bye_complete() > > isn't used. Start to use it in this patch. bye_pending() and bye_cancel() > > can be dropped now. Adding bye_retry() instead. > > > > Signed-off-by: Peter Xu <[email protected]> > > --- > > include/io/channel-tls.h | 5 ++- > > io/channel-tls.c | 86 +++++----------------------------------- > > io/trace-events | 3 +- > > 3 files changed, 15 insertions(+), 79 deletions(-) > > > > diff --git a/include/io/channel-tls.h b/include/io/channel-tls.h > > index 7e9023570d..bcd14ffbd6 100644 > > --- a/include/io/channel-tls.h > > +++ b/include/io/channel-tls.h > > @@ -49,7 +49,6 @@ struct QIOChannelTLS { > > QCryptoTLSSession *session; > > QIOChannelShutdown shutdown; > > guint hs_ioc_tag; > > - guint bye_ioc_tag; > > }; > > > > /** > > @@ -60,8 +59,10 @@ struct QIOChannelTLS { > > * Perform the TLS session termination. This method will return > > * immediately and the termination will continue in the background, > > * provided the main loop is running. > > + * > > + * Returns: true on success, false on error (with errp set) > > */ > > -void qio_channel_tls_bye(QIOChannelTLS *ioc, Error **errp); > > +bool qio_channel_tls_bye(QIOChannelTLS *ioc, Error **errp); > > > > /** > > * qio_channel_tls_new_server: > > diff --git a/io/channel-tls.c b/io/channel-tls.c > > index 5a2c8188ce..8510a187a8 100644 > > --- a/io/channel-tls.c > > +++ b/io/channel-tls.c > > @@ -253,84 +253,25 @@ void qio_channel_tls_handshake(QIOChannelTLS *ioc, > > qio_channel_tls_handshake_task(ioc, task, context); > > } > > > > -static gboolean qio_channel_tls_bye_io(QIOChannel *ioc, GIOCondition > > condition, > > - gpointer user_data); > > - > > -static void qio_channel_tls_bye_task(QIOChannelTLS *ioc, QIOTask *task, > > - GMainContext *context) > > +bool qio_channel_tls_bye(QIOChannelTLS *ioc, Error **errp) > > { > > - GIOCondition condition; > > - QIOChannelTLSData *data; > > int status; > > - Error *err = NULL; > > > > - status = qcrypto_tls_session_bye(ioc->session, &err); > > + trace_qio_channel_tls_bye_start(ioc); > > +retry: > > + status = qcrypto_tls_session_bye(ioc->session, errp); > > > > if (status < 0) { > > trace_qio_channel_tls_bye_fail(ioc); > > snip > > > + return false; > > + } else if (status != QCRYPTO_TLS_BYE_COMPLETE) { > > + /* BYE event must be synchronous, retry immediately */ > > + trace_qio_channel_tls_bye_retry(ioc, status); > > + goto retry; > > } > > We cannot do this. If the gnutls_bye() API needs to perform > socket I/O, and so when we're running over a non-blocking > socket we must expect EAGAIN. With this handling, QEMU will > busy loop burning 100% CPU when the socket is not ready.
Right. That was the plan when drafting this, the hope is spinning will almost not happen, and even if it happens it'll finish soon (migration is completing, it means network mustn't be so bad), unless the network is stuck exactly at when we send the bye(). > > A second point is that from a QIOChannel POV, we need to > ensure that all APIs can be used in a non-blocking scenario. > This is why in the QIOChannelSocket impl connect/listen APIs > we provide both _sync and _async variants of the APIs, or > in the QIOChannelTLS impl, the handshake API is always > async with a callback to be invokved on completion. I agree. The issue is if so, migration code needs to be always be prepared with a possible async op even if in 99.9999% cases it won't happen... we need to complicate the multifd logic a lot for this, but the gain is little.. This series still used patch 1 to fix the problem (rather than do real BYE on preempt channels, for example) only because it's the easiest, after all it's still a contract in tls channel impl to allow premature termination for explicit shutdown()s on the host. If we want to do 100% graceful shutdowns, we'll need to apply this to all channels, and the async-possible model can definitely add more complexity more than multifd. I hope it won't be necessary.. but just to mention it. > > The QIOChanel 'bye' method is flawed in that it is > asynchronous, but has no callback for completion. > > If migration is /always/ using a blocking socket for the > TLS channels this isn't a problem as gnutls will complete > immediately, but if any async sockets are used we have no > way to wait for completion. This requires improving the > API design in some manner. I recall one of your future series on TLS would start to enable async for all channels? In all cases, we definitely don't want to have this call to be relevant to the blocking mode of the channels. Would it make sense to introduce a _sync() version of it, but keep the original bye(), leaving the rest until a real async user appears? I can also at least drop this patch as of now, because we can still wish it almost always be synchronous. However we have risk forgetting that forever and hit it a few years later.. Thanks, -- Peter Xu
