On 17.11.2016 21:13, Eric Blake wrote: > Discard is advisory, so rounding the requests to alignment > boundaries is never semantically wrong from the data that > the guest sees. But at least the Dell Equallogic iSCSI SANs > has an interesting property that its advertised discard > alignment is 15M, yet documents that discarding a sequence > of 1M slices will eventually result in the 15M page being > marked as discarded, and it is possible to observe which > pages have been discarded. > > Between commits 9f1963b and b8d0a980, we converted the block > layer to a byte-based interface that ultimately ignores any > unaligned head or tail based on the driver's advertised > discard granularity, which means that qemu 2.7 refuses to > pass any discard request smaller than 15M down to the Dell > Equallogic hardware. This is a slight regression in behavior > compared to earlier qemu, where a guest executing discards > in power-of-2 chunks used to be able to get every page > discarded, but is now left with various pages still allocated > because the guest requests did not align with the hardware's > 15M pages. > > Since the SCSI specification says nothing about a minimum > discard granularity, and only documents the preferred > alignment, it is best if the block layer gives the driver > every bit of information about discard requests, rather than > rounding it to alignment boundaries early. > > Rework the block layer discard algorithm to mirror the write > zero algorithm: always peel off any unaligned head or tail > and manage that in isolation, then do the bulk of the request > on an aligned boundary. The fallback when the driver returns > -ENOTSUP for an unaligned request is to silently ignore that > portion of the discard request; but for devices that can pass > the partial request all the way down to hardware, this can > result in the hardware coalescing requests and discarding > aligned pages after all. > > Reported by: Peter Lieven <[email protected]> > CC: [email protected] > Signed-off-by: Eric Blake <[email protected]> > > --- > v2: Split at more boundaries. > --- > block/io.c | 45 ++++++++++++++++++++++++++++++++------------- > 1 file changed, 32 insertions(+), 13 deletions(-) > > diff --git a/block/io.c b/block/io.c > index 085ac34..4f00562 100644 > --- a/block/io.c > +++ b/block/io.c > @@ -2424,7 +2424,7 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, > int64_t offset, > { > BdrvTrackedRequest req; > int max_pdiscard, ret; > - int head, align; > + int head, tail, align; > > if (!bs->drv) { > return -ENOMEDIUM; > @@ -2447,19 +2447,15 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState > *bs, int64_t offset, > return 0; > } > > - /* Discard is advisory, so ignore any unaligned head or tail */ > + /* Discard is advisory, but some devices track and coalesce > + * unaligned requests, so we must pass everything down rather than > + * round here. Still, most devices will just silently ignore > + * unaligned requests (by returning -ENOTSUP), so we must fragment
Returning -ENOTSUP is not quite "silently" in my book.
> + * the request accordingly. */
> align = MAX(bs->bl.pdiscard_alignment, bs->bl.request_alignment);
> assert(align % bs->bl.request_alignment == 0);
> head = offset % align;
> - if (head) {
> - head = MIN(count, align - head);
> - count -= head;
> - offset += head;
> - }
> - count = QEMU_ALIGN_DOWN(count, align);
> - if (!count) {
> - return 0;
> - }
> + tail = (offset + count) % align;
>
> bdrv_inc_in_flight(bs);
> tracked_request_begin(&req, bs, offset, count, BDRV_TRACKED_DISCARD);
> @@ -2471,11 +2467,34 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState
> *bs, int64_t offset,
>
> max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard,
> INT_MAX),
> align);
> - assert(max_pdiscard);
> + assert(max_pdiscard >= bs->bl.request_alignment);
>
> while (count > 0) {
> int ret;
> - int num = MIN(count, max_pdiscard);
> + int num = count;
> +
> + if (head) {
> + /* Make small requests to get to alignment boundaries. */
> + num = MIN(count, align - head);
> + if (!QEMU_IS_ALIGNED(num, bs->bl.request_alignment)) {
> + num %= bs->bl.request_alignment;
> + }
Could be written as
num = num % bs->bl.request_alignment ?: num;
But that's up to you.
More importantly, is it possible that request_alignment >
pdiscard_alignment? In that case, align would be request_alignment, head
would be less than request_alignment but could be more than
pdiscard_alignment.
Thus, if that is possible, this might create a request longer than
pdiscard_alignment.
> + head = (head + num) % align;
> + assert(num < max_pdiscard);
> + } else if (tail) {
> + if (num > align) {
> + /* Shorten the request to the last aligned cluster. */
> + num -= tail;
> + } else if (!QEMU_IS_ALIGNED(tail, bs->bl.request_alignment) &&
> + tail > bs->bl.request_alignment) {
> + tail %= bs->bl.request_alignment;
> + num -= tail;
Same here.
Max
> + }
> + }
> + /* limit request size */
> + if (num > max_pdiscard) {
> + num = max_pdiscard;
> + }
>
> if (bs->drv->bdrv_co_pdiscard) {
> ret = bs->drv->bdrv_co_pdiscard(bs, offset, num);
>
signature.asc
Description: OpenPGP digital signature
