On Wed, 3 May 2023, [email protected] wrote:
> On 03/05/2023 10:53 pm, Stefano Stabellini wrote:
> > On Wed, 3 May 2023, Julien Grall wrote:
> >> On 03/05/2023 15:38, [email protected] wrote:
> >>> Hello,
> >>>
> >>> After what seems like an unreasonable amount of debugging, we've tracked
> >>> down exactly what is going wrong here.
> >>>
> >>> https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/4219721944
> >>>
> >>> Of note is the smoke.serial log around:
> >>>
> >>> io: IN 0xffff90fec250 d0 20230503 14:20:42 INTRODUCE (1 233473 1 )
> >>> obj: CREATE connection 0xffff90fff1f0
> >>> *** d1 CONN RESET req_cons 00000000, req_prod 0000003a rsp_cons
> >>> 00000000, rsp_prod 00000000
> >>> io: OUT 0xffff9105cef0 d0 20230503 14:20:42 WATCH_EVENT
> >>> (@introduceDomain domlist )
> >>>
> >>> XS_INTRODUCE (in C xenstored at least, not checked O yet) always
> >>> clobbers the ring pointers.  The added pressure on dom0 that the
> >>> xensconsoled adds with it's 4M hypercall bounce buffer occasionally
> >>> defers xenstored long enough that the XS_INTRODUCE clobbers the first
> >>> message that dom1 wrote into the ring.
> >>>
> >>> The other behaviour seen was xenstored observing a header looking like 
> >>> this:
> >>>
> >>> *** d1 HDR { ty 0x746e6f63, rqid 0x2f6c6f72, txid 0x74616c70, len
> >>> 0x6d726f66 }
> >>>
> >>> which was rejected as being too long.  That's "control/platform" in
> >>> ASCII, so the XS_INTRODUCE intersected dom1 between writing the header
> >>> and writing the payload.
> >>>
> >>>
> >>> Anyway, it is buggy for XS_INTRODUCE to be called on a live an
> >>> unsuspecting connection.  It is ultimately init-dom0less's fault for
> >>> telling dom1 it's good to go before having waited for XS_INTRODUCE to
> >>> complete.
> >> So the problem is xenstored will set interface->connection to
> >> XENSTORE_CONNECTED before finalizing the connection. Caqn you try the
> >> following, for now, very hackish patch:
> >>
> >> diff --git a/tools/xenstore/xenstored_domain.c
> >> b/tools/xenstore/xenstored_domain.c
> >> index f62be2245c42..bbf85bbbea3b 100644
> >> --- a/tools/xenstore/xenstored_domain.c
> >> +++ b/tools/xenstore/xenstored_domain.c
> >> @@ -688,6 +688,7 @@ static struct domain *introduce_domain(const void *ctx,
> >>                 talloc_steal(domain->conn, domain);
> >>
> >>                 if (!restore) {
> >> +                       domain_conn_reset(domain);
> >>                         /* Notify the domain that xenstore is available */
> >>                         interface->connection = XENSTORE_CONNECTED;
> >>                         xenevtchn_notify(xce_handle, domain->port);
> >> @@ -730,8 +731,6 @@ int do_introduce(const void *ctx, struct connection 
> >> *conn,
> >>         if (!domain)
> >>                 return errno;
> >>
> >> -       domain_conn_reset(domain);
> >> -
> >>         send_ack(conn, XS_INTRODUCE);
> > Following Jurgen's suggestion, I made this slightly modified version of
> > the patch. With it, the problem is solved:
> >
> > https://gitlab.com/xen-project/people/sstabellini/xen/-/pipelines/856450703
> 
> This fails to solve 3(?) of the 4(?) bugs pointed out between this email
> thread and on IRC.
> 
> Stop with the bull-in-a-china-shop approach.  There is no acceptable fix
> to this mess which starts with anything other than corrections to the
> documentation, and a plan for how to make startup work robustly given
> all the bugs introduced previously by failing to do it properly the
> first time around.

I am not suggesting this is the fix (I didn't add any Signed-off-by or
commit message on purpose). I think it is useful to know if a
theoretical proposal would work in practice as it helps us understand
the problem. In the little time I had in-between meetings I thought I
could help a bit by providing this update.

Like you, I would appreciate a comprehensive fix which includes a
documentation update.

Genuine question: how would you like to proceed? In the project
management sense of who does what and what is the suggested timeline.

Reply via email to