Re: [Qemu-devel] [PATCH 07/18] nbd: Minimal structured read for client

Paolo Bonzini Wed, 15 Feb 2017 08:57:22 -0800


On 07/02/2017 21:14, Eric Blake wrote:
> On 02/03/2017 09:47 AM, Vladimir Sementsov-Ogievskiy wrote:
>> Minimal implementation: always send DF flag, to not deal with fragmented
>> replies.
> 
> This works well with your minimal server implementation, but I worry
> that it will cause us to fall over when talking to a fully-compliant
> server that chooses to send EOVERFLOW errors for any request larger than
> 64k when DF is set; it also makes it impossible to benefit from sparse
> reads.  I guess that means we need to start thinking about followup
> patches to flush out our implementation.  But maybe I can live with this
> patch as is, since the goal of your series was not so much the full
> power of structured reads, but getting to a point where we could use
> structured reply for block status, even if it means your client can only
> communicate with qemu-nbd as server for now, as long as we do get to the
> rest of the patches for a full-blown structured read.


Can you post a diff that expresses this as a comment?  I'll squash the
comment into this commit.

>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com>
>> ---
>>  block/nbd-client.c  |  47 +++++++++++----
>>  block/nbd-client.h  |   2 +
>>  include/block/nbd.h |  15 +++--
>>  nbd/client.c        | 170 
>> ++++++++++++++++++++++++++++++++++++++++++++++------
>>  qemu-nbd.c          |   2 +-
>>  5 files changed, 203 insertions(+), 33 deletions(-)
> 
> Hmm - no change to the testsuite. Structured reads seems like the sort
> of thing that it would be nice to test with some canned server replies,
> particularly with server behavior that is permitted by the NBD protocol
> but does not happen by default in qemu-nbd.

This would require implementing some kind of mock server support.  That
would be a very good thing but not something we have much infrastructure
for (you could use either a real socket or a mock QIOChannel).

Thanks,

Paolo

>>
>> diff --git a/block/nbd-client.c b/block/nbd-client.c
>> index 3779c6c999..ff96bd1635 100644
>> --- a/block/nbd-client.c
>> +++ b/block/nbd-client.c
>> @@ -180,13 +180,20 @@ static void nbd_co_receive_reply(NBDClientSession *s,
>>      *reply = s->reply;
>>      if (reply->handle != request->handle ||
>>          !s->ioc) {
>> +        reply->simple = true;
>>          reply->error = EIO;
> 
> I don't think this is quite right - by setting reply->simple to true,
> you are forcing the caller to treat this as the final packet related to
> this request->handle, even though that might not be the case.
> 
> As it is, I wonder if this code is correct, even before your patch - the
> server is allowed to give responses out-of-order (if we request multiple
> reads without waiting for the first response) - I don't see how setting
> reply->error to EIO if the request->handle indicates that we are
> receiving an out-of-order response to some other packet, but that our
> request is still awaiting traffic.
> 
>>      } else {
>> -        if (qiov && reply->error == 0) {
>> -            ret = nbd_wr_syncv(s->ioc, qiov->iov, qiov->niov, request->len,
>> -                               true);
>> -            if (ret != request->len) {
>> -                reply->error = EIO;
>> +        if (qiov) {
>> +            if ((reply->simple ? reply->error == 0 :
>> +                         reply->type == NBD_REPLY_TYPE_OFFSET_DATA)) {
>> +                ret = nbd_wr_syncv(s->ioc, qiov->iov, qiov->niov, 
>> request->len,
>> +                                   true);
> 
> This works only because you used the DF flag.  If we allow fragmenting,
> then you have to be careful to write the reply into the correct offset
> of the iov.
> 
>> +                if (ret != request->len) {
>> +                    reply->error = EIO;
>> +                }
>> +            } else if (!reply->simple &&
>> +                       reply->type == NBD_REPLY_TYPE_OFFSET_HOLE) {
>> +                qemu_iovec_memset(qiov, 0, 0, request->len);
>>              }
> 
> Up to here, you didn't do any inspection for NBD_REPLY_FLAG_DONE (so you
> don't know if this is the last packet the server is sending for this
> reqeust->handle), and didn't do any special casing for
> NBD_REPLY_TYPE_NONE or for the various error replies.  I'm not sure if
> this will always do what you want.  In fact, I'm not even sure if
> reply->error is set correctly for all structured packets.
> 
>>          }
>>  
>> @@ -227,6 +234,7 @@ int nbd_client_co_preadv(BlockDriverState *bs, uint64_t 
>> offset,
>>          .type = NBD_CMD_READ,
>>          .from = offset,
>>          .len = bytes,
>> +        .flags = client->structured_reply ? NBD_CMD_FLAG_DF : 0,
>>      };
>>      NBDReply reply;
>>      ssize_t ret;
>> @@ -237,12 +245,30 @@ int nbd_client_co_preadv(BlockDriverState *bs, 
>> uint64_t offset,
>>      nbd_coroutine_start(client, &request);
>>      ret = nbd_co_send_request(bs, &request, NULL);
>>      if (ret < 0) {
>> -        reply.error = -ret;
>> -    } else {
>> -        nbd_co_receive_reply(client, &request, &reply, qiov);
>> +        goto out;
>>      }
>> +
>> +    nbd_co_receive_reply(client, &request, &reply, qiov);
>> +    if (reply.error != 0) {
>> +        ret = -reply.error;
>> +    }
>> +
>> +    if (!reply.simple) {
>> +        while (!(reply.flags & NBD_REPLY_FLAG_DONE)) {
>> +            nbd_co_receive_reply(client, &request, &reply, qiov);
>> +            if (reply.error != 0) {
>> +                ret = -reply.error;
>> +            }
>> +            if (reply.simple) {
> 
> Hmm. It looks like this part of the loop is only triggered if
> nbd_co_receive_reply() detects a handle mismatch and slams reply.simple
> to true.  As long as we use the DF flag, it looks like the server should
> never send an error packet followed by a data packet, and your
> particular server implementation always set the DONE flag on the error
> packet, so it got past your testing.  But if we don't rely on the DF
> flag, a server could reasonable send an ERROR_OFFSET packet for half the
> buffer, followed by a data packet for the other half of the buffer,
> which may wipe out reply.error from the error packet.
> 
>> +                ret = -EIO;
>> +                goto out;
>> +            }
>> +        }
>> +    }
>> +
>> +out:
>>      nbd_coroutine_end(client, &request);
>> -    return -reply.error;
>> +    return ret;
>>  }
>>  
>>  int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t offset,
>> @@ -408,7 +434,8 @@ int nbd_client_init(BlockDriverState *bs,
>>                                  &client->nbdflags,
>>                                  tlscreds, hostname,
>>                                  &client->ioc,
>> -                                &client->size, errp);
>> +                                &client->size,
>> +                                &client->structured_reply, errp);
>>      if (ret < 0) {
>>          logout("Failed to negotiate with the NBD server\n");
>>          return ret;
>> diff --git a/block/nbd-client.h b/block/nbd-client.h
>> index f8d6006849..cba1f965bf 100644
>> --- a/block/nbd-client.h
>> +++ b/block/nbd-client.h
>> @@ -32,6 +32,8 @@ typedef struct NBDClientSession {
>>      NBDReply reply;
>>  
>>      bool is_unix;
>> +
>> +    bool structured_reply;
>>  } NBDClientSession;
>>  
>>  NBDClientSession *nbd_get_client_session(BlockDriverState *bs);
>> diff --git a/include/block/nbd.h b/include/block/nbd.h
>> index 58b864f145..dae2e4bd03 100644
>> --- a/include/block/nbd.h
>> +++ b/include/block/nbd.h
> 
> Can you add the use of an order file to list .h files first in your
> diffs?  See
> https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00288.html for
> tips.
> 
>> @@ -57,11 +57,16 @@ struct NBDRequest {
>>  };
>>  typedef struct NBDRequest NBDRequest;
>>  
>> -struct NBDReply {
>> +typedef struct NBDReply {
>> +    bool simple;
>>      uint64_t handle;
>>      uint32_t error;
>> -};
>> -typedef struct NBDReply NBDReply;
>> +
>> +    uint16_t flags;
>> +    uint16_t type;
>> +    uint32_t length;
>> +    uint64_t offset;
>> +} NBDReply;
> 
> I don't know if this is the best way to represent things; I might have
> used a union type, since not all fields are valid in all reply packets.
> 
>>  
>>  struct NBDSimpleReply {
>>      /* uint32_t NBD_SIMPLE_REPLY_MAGIC */
>> @@ -169,10 +174,10 @@ ssize_t nbd_wr_syncv(QIOChannel *ioc,
>>  int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint16_t 
>> *flags,
>>                            QCryptoTLSCreds *tlscreds, const char *hostname,
>>                            QIOChannel **outioc,
>> -                          off_t *size, Error **errp);
>> +                          off_t *size, bool *structured_reply, Error 
>> **errp);
>>  int nbd_init(int fd, QIOChannelSocket *sioc, uint16_t flags, off_t size);
>>  ssize_t nbd_send_request(QIOChannel *ioc, NBDRequest *request);
>> -ssize_t nbd_receive_reply(QIOChannel *ioc, NBDReply *reply);
>> +int nbd_receive_reply(QIOChannel *ioc, NBDReply *reply);
>>  int nbd_client(int fd);
>>  int nbd_disconnect(int fd);
>>  
>> diff --git a/nbd/client.c b/nbd/client.c
>> index 1c274f3012..9225f7e30d 100644
>> --- a/nbd/client.c
>> +++ b/nbd/client.c
>> @@ -472,11 +472,10 @@ static QIOChannel *nbd_receive_starttls(QIOChannel 
>> *ioc,
>>      return QIO_CHANNEL(tioc);
>>  }
>>  
>> -
>>  int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint16_t 
>> *flags,
>>                            QCryptoTLSCreds *tlscreds, const char *hostname,
>>                            QIOChannel **outioc,
>> -                          off_t *size, Error **errp)
>> +                          off_t *size, bool *structured_reply, Error **errp)
>>  {
>>      char buf[256];
>>      uint64_t magic, s;
>> @@ -584,6 +583,12 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char 
>> *name, uint16_t *flags,
>>              if (nbd_receive_query_exports(ioc, name, errp) < 0) {
>>                  goto fail;
>>              }
>> +
>> +            if (structured_reply != NULL) {
>> +                *structured_reply =
>> +                    nbd_receive_simple_option(ioc, NBD_OPT_STRUCTURED_REPLY,
>> +                                              false, NULL) == 0;
> 
> Okay, you're allowing the server to reject the option, in which case we
> set structured_reply to false.  But re-reading patch 4/18,
> nbd_receive_simple_option() can return -1 for multiple reasons, some of
> them where it is still in sync (for sending more options), but others
> where it is out of sync (such as failure to write, in which case the
> connection MUST be dropped rather than trying to carry on).  I don't
> think this handles errors correctly, and therefore I'm not even sure
> that the refactoring in 4/18 is correct.
> 
> I think you may be better off with nbd_receive_simple_option() in 4/18
> being tri-state: return -1 if the connection is unrecoverable (such as
> after a write or read error, where we must not send or receive any more
> data), 0 if the server replied with an error but the connection is still
> in sync for trying something else, and 1 if the server replies with
> success.  Then this code should check if the return is < 0 (kill
> negotiation), == 0 (*structured_reply = false), or == 1
> (*structured_reply = true).
> 
>> +            }
>>          }
>>          /* write the export name request */
>>          if (nbd_send_option_request(ioc, NBD_OPT_EXPORT_NAME, -1, name,
>> @@ -603,6 +608,14 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char 
>> *name, uint16_t *flags,
>>              goto fail;
>>          }
>>          be16_to_cpus(flags);
>> +
>> +        if (!!structured_reply && *structured_reply &&
> 
> Why do you need !! to coerce to bool, when && also coerces to bool?
> 
>> +            !(*flags & NBD_CMD_FLAG_DF))
>> +        {
>> +            error_setg(errp, "Structured reply is negotiated, "
>> +                             "but DF flag is not.");
> 
> No trailing '.' for error_setg() messages.
> 
> Also, I'm not quite sure the NBD protocol allows an implementation that
> supports structured read but does not support the DF flag.  Maybe that's
> an NBD spec bug that we should get clarified.  (Ideally, the server
> always advertises the DF flag if structured replies are negotiated,
> because the common implementation of user-space handshake followed by
> kernel transmission phase works best when the already-existing
> ioctl(NBD_SET_FLAGS) can then be used to tell the kernel to use/expect
> structure replies.)
> 
>> +            goto fail;
>> +        }
>>      } else if (magic == NBD_CLIENT_MAGIC) {
>>          uint32_t oldflags;
>>  
>> @@ -790,20 +803,33 @@ ssize_t nbd_send_request(QIOChannel *ioc, NBDRequest 
>> *request)
>>      return 0;
>>  }
>>  
>> -ssize_t nbd_receive_reply(QIOChannel *ioc, NBDReply *reply)
>> +static inline int read_sync_check(QIOChannel *ioc, void *buffer, size_t 
>> size)
>>  {
>> -    uint8_t buf[NBD_REPLY_SIZE];
>> -    uint32_t magic;
>>      ssize_t ret;
>>  
>> -    ret = read_sync(ioc, buf, sizeof(buf));
>> +    ret = read_sync(ioc, buffer, size);
>>      if (ret < 0) {
>>          return ret;
>>      }
>> -
>> -    if (ret != sizeof(buf)) {
>> +    if (ret != size) {
>>          LOG("read failed");
>> -        return -EINVAL;
>> +        return -EIO;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/* nbd_receive_simple_reply
>> + * Read simple reply except magic field (which should be already read)
>> + */
>> +static int nbd_receive_simple_reply(QIOChannel *ioc, NBDReply *reply)
>> +{
>> +    uint8_t buf[NBD_REPLY_SIZE - 4];
>> +    ssize_t ret;
>> +
>> +    ret = read_sync_check(ioc, buf, sizeof(buf));
> 
> This patch is fairly long; I might have split the creation and initial
> use of the read_sync_check() function into its own patch.
> 
>> +    if (ret < 0) {
>> +        return ret;
>>      }
>>  
>>      /* Reply
>> @@ -812,9 +838,124 @@ ssize_t nbd_receive_reply(QIOChannel *ioc, NBDReply 
>> *reply)
>>         [ 7 .. 15]    handle
>>       */
>>  
>> -    magic = ldl_be_p(buf);
>> -    reply->error  = ldl_be_p(buf + 4);
>> -    reply->handle = ldq_be_p(buf + 8);
>> +    reply->error  = ldl_be_p(buf);
>> +    reply->handle = ldq_be_p(buf + 4);
>> +
>> +    return 0;
>> +}
>> +
>> +/* nbd_receive_structured_reply_chunk
>> + * Read structured reply chunk except magic field (which should be already 
>> read)
>> + * Data for NBD_REPLY_TYPE_OFFSET_DATA is not read too.
>> + * Length field of reply out parameter corresponds to unread part of reply.
> 
> Awkward wording; maybe:
> 
> Read the header portion of a structured reply chunk (except for magic,
> which should already be read).  On success, the length field of reply
> corresponds to number of bytes in the reply that still need to be read.
> 
>> + */
>> +static int nbd_receive_structured_reply_chunk(QIOChannel *ioc, NBDReply 
>> *reply)
>> +{
>> +    NBDStructuredReplyChunk chunk;
>> +    ssize_t ret;
>> +    uint16_t message_size;
>> +
>> +    ret = read_sync_check(ioc, (uint8_t *)&chunk + sizeof(chunk.magic),
> 
> This looks like some ugly pointer math.  I wonder if you'd be better
> served by having a separate packed struct without the magic field,
> particularly since you have to read the magic in first to decide whether
> to dispatch to simple/structured for the rest of the structure.  In
> other words, note how patch 2/18 omits a uint32_t magic in
> NBDSimpleReply; maybe patch 3/18 could do the same with
> NBDStructuredReplyChunk.  Then again, that would impact how you code
> things for subcasses like NBDStructuredRead, so I don't know how much
> churn to request here.
> 
>> +                          sizeof(chunk) - sizeof(chunk.magic));
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    reply->flags = be16_to_cpu(chunk.flags);
>> +    reply->type = be16_to_cpu(chunk.type);
>> +    reply->handle = be64_to_cpu(chunk.handle);
>> +    reply->length = be32_to_cpu(chunk.length);
>> +
>> +    switch (reply->type) {
>> +    case NBD_REPLY_TYPE_NONE:
>> +        break;
>> +    case NBD_REPLY_TYPE_OFFSET_DATA:
>> +    case NBD_REPLY_TYPE_OFFSET_HOLE:
>> +        ret = read_sync_check(ioc, &reply->offset, sizeof(reply->offset));
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +        be64_to_cpus(&reply->offset);
>> +        reply->length -= sizeof(reply->offset);
>> +        break;
>> +    case NBD_REPLY_TYPE_ERROR:
>> +    case NBD_REPLY_TYPE_ERROR_OFFSET:
> 
> Here, I think that you want:
> 
> default:
> if (reply->type & 0x8000) {
> 
> rather than specific NBD_REPLY_TYPE_ERROR labels, so that you can
> gracefully handle ALL error packets sent by a server even if you haven't
> explicitly coded for them (a good server should probably not be sending
> an error packet you don't recognize, but the protocol also went to good
> efforts to describe a common behavior of all error packets).
> 
> /me reads on...
> 
>> +        ret = read_sync_check(ioc, &reply->error, sizeof(reply->error));
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +        be32_to_cpus(&reply->error);
>> +
>> +        ret = read_sync_check(ioc, &message_size, sizeof(message_size));
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +        be16_to_cpus(&message_size);
>> +
>> +        if (message_size > 0) {
>> +            /* TODO: provide error message to user */
>> +            ret = drop_sync(ioc, message_size);
>> +            if (ret < 0) {
>> +                return ret;
>> +            }
>> +        }
>> +
>> +        if (reply->type == NBD_REPLY_TYPE_ERROR_OFFSET) {
>> +            /* drop 64bit offset */
>> +            ret = drop_sync(ioc, 8);
>> +            if (ret < 0) {
>> +                return ret;
>> +            }
>> +        }
>> +        break;
>> +    default:
>> +        if (reply->type & (1 << 15)) {
> 
> ...oh, you did, mostly.
> 
>> +            /* unknown error */
>> +            ret = drop_sync(ioc, reply->length);
> 
> Still, even if it is an unknown error, you should still be able to
> populate reply->error, rather than ignoring it.
> 
>> +            if (ret < 0) {
>> +                return ret;
>> +            }
>> +
>> +            reply->error = NBD_EINVAL;
> 
> Meanwhile, you should also probably ensure that reply->error is non-zero
> even if the server was buggy and sent an error flag without telling you
> an error value (your choice of EINVAL seems reasonable).
> 
>> +            reply->length = 0;
>> +        } else {
>> +            /* unknown non-error reply type */
>> +            return -EINVAL;
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +int nbd_receive_reply(QIOChannel *ioc, NBDReply *reply)
>> +{
>> +    uint32_t magic;
>> +    int ret;
>> +
>> +    ret = read_sync_check(ioc, &magic, sizeof(magic));
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    be32_to_cpus(&magic);
>> +
>> +    switch (magic) {
>> +    case NBD_SIMPLE_REPLY_MAGIC:
>> +        reply->simple = true;
>> +        ret = nbd_receive_simple_reply(ioc, reply);
>> +        break;
>> +    case NBD_STRUCTURED_REPLY_MAGIC:
>> +        reply->simple = false;
>> +        ret = nbd_receive_structured_reply_chunk(ioc, reply);
>> +        break;
>> +    default:
>> +        LOG("invalid magic (got 0x%" PRIx32 ")", magic);
>> +        return -EINVAL;
> 
> This part looks good.
> 
>> +    }
>> +
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>>  
>>      reply->error = nbd_errno_to_system_errno(reply->error);
>>  
>> @@ -827,10 +968,5 @@ ssize_t nbd_receive_reply(QIOChannel *ioc, NBDReply 
>> *reply)
>>            ", handle = %" PRIu64" }",
>>            magic, reply->error, reply->handle);
>>  
>> -    if (magic != NBD_SIMPLE_REPLY_MAGIC) {
>> -        LOG("invalid magic (got 0x%" PRIx32 ")", magic);
>> -        return -EINVAL;
>> -    }
>>      return 0;
>>  }
>> -
>> diff --git a/qemu-nbd.c b/qemu-nbd.c
> 
> Why are you deleting a blank line from the end of the file? Is that
> rebase churn that should have been avoided in an earlier patch?  </me
> goes and reads> - oh, it's been like that since commit 798bfe000 created
> the file.  Thanks for cleaning it up (and I'm surprised checkpatch
> doesn't flag it).
> 
>> index c734f627b4..de0099e333 100644
>> --- a/qemu-nbd.c
>> +++ b/qemu-nbd.c
>> @@ -272,7 +272,7 @@ static void *nbd_client_thread(void *arg)
>>  
>>      ret = nbd_receive_negotiate(QIO_CHANNEL(sioc), NULL, &nbdflags,
>>                                  NULL, NULL, NULL,
>> -                                &size, &local_error);
>> +                                &size, NULL, &local_error);
> 
> We could reasonably allow new-style structured reads when we're handling
> the entire client side; but I agree that when qemu-nbd is used as the
> handshaking phase before handing things over to the kernel that we can't
> request structured reads, at least not until a) the kernel nbd module
> implements structured read supports, and b) we have a way to tell that
> the kernel is willing to accept structured read negotiation (and that's
> where my earlier comments about the DF flag being a witness of
> structured reads comes into play).
> 
> Umm, how does this patch compile?  You changed the signature of
> nbd_receive_negotiate() to add a parameter, but did not modify the call
> in block/nbd-client.c to pass the new parameter.  (So far, I've been
> reviewing based just on email content; I also plan on applying and
> testing your patches before all is said and done, but I sometimes
> surprise myself with my ability to do a quality read-only review even
> without spending time compiling things).
> 
>>      if (ret < 0) {
>>          if (local_error) {
>>              error_report_err(local_error);
>>
> 
> We'll definitely need a v2, but I'm grateful that you've tackled this work.
>

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH 07/18] nbd: Minimal structured read for client

Reply via email to