On Thu, Sep 18, 2025 at 11:47:00AM -0300, Fabiano Rosas wrote:
> Peter Xu <[email protected]> writes:
> 
> > No issue I hit, the change is only from code observation when I am looking
> > at a TLS premature termination issue.
> >
> > qio_channel_tls_bye() API needs to be synchronous.  When it's not, the
> > previous impl will attach an asynchronous task retrying but only until when
> > the channel gets the relevant GIO event. It may be problematic, because the
> > caller of qio_channel_tls_bye() may have invoked channel close() before
> > that, leading to premature termination of the TLS session.
> >
> 
> I'm not super versed on socket APIs, so bear with me: Wouldn't the
> subsequent shutdown() before close() ensure that the io watch gets
> triggered? Assuming we're atomically installing the watch before the
> shutdown() (at the moment, we're not).

I think it won't.

First of all, AFAIU migration_cleanup() must be run in the main thread,
because it can register async tasks like the bye() task, and it registers
against context==NULL (in qio_channel_tls_bye_task(), for example), which I
believe means it'll be registered against the QEMU main loop.

Then, if we do a sequence of this:

  qio_channel_tls_bye()
  shutdown()
  close()

And if we do not yield anywhere within the process, IIUC it means it'll run
in sequence _without_ processing any events on the main loop even if some
events triggered.

So.. I think the close() will see the async task and remove it, never get a
chance to kick it off.

> 
> > Remove the asynchronous handling, instead retry it immediately.  Currently,
> > the only two possible cases that may lead to async task is either INTERRUPT
> > or EAGAIN.  It should be suffice to spin retry as of now, until a solid
> > proof showing that a more complicated retry logic is needed.
> >
> > With that, we can remove the whole async model for the bye task.
> >
> 
> With the bye() being synchronous, do we still have the issue when
> migration fails? I guess it depends on what the answer to my question
> above is...

When migration fails, IMHO it's fine to prematurely terminate the channels,
as I replied to one email that you commented on v1.  But we can discuss, I
am not sure if I missed things alone the lines.

Note, Dan suggested me to change the channel blocking mode as a smaller and
quicker fix, instead of throwing async model away, which seems to be
preferred to keep for any iochannel APIs.  So feel free to ignore this
patch too as of now.  I'll still need to investigate a bit on what would
happen if a concurrent update of fd nonblocking would affect other threads,
though.  In all cases, all results will be reflected in v3, but likely this
patch will be either dropped or replaced.

I know I let you read some of the things that we already planned to throw
away, my apologies. But it's partly your "fault" (to take holidays!). No,
I'm joking. :) It's still good to discuss these.

-- 
Peter Xu


Reply via email to