On Sep 29 15:43, Dmitry Fomichev wrote: > > -----Original Message----- > > From: Qemu-block <qemu-block- > > [email protected]> On Behalf Of Klaus > > Jensen > > Sent: Tuesday, September 29, 2020 6:47 AM > > To: Damien Le Moal <[email protected]> > > Cc: Fam Zheng <[email protected]>; Kevin Wolf <[email protected]>; qemu- > > [email protected]; Niklas Cassel <[email protected]>; Klaus Jensen > > <[email protected]>; [email protected]; Alistair Francis > > <[email protected]>; Keith Busch <[email protected]>; Philippe > > Mathieu-Daudé <[email protected]>; Matias Bjorling > > <[email protected]> > > Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types > > and Zoned Namespace Command Set > > > > On Sep 28 22:54, Damien Le Moal wrote: > > > On 2020/09/29 6:25, Keith Busch wrote: > > > > On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote: > > > >> On Sep 28 02:33, Dmitry Fomichev wrote: > > > >>> You are making it sound like the entire WDC series relies on this > > approach. > > > >>> Actually, the persistency is introduced in the second to last patch > > > >>> in the > > > >>> series and it only adds a couple of lines of code in the i/o path to > > > >>> mark > > > >>> zones dirty. This is possible because of using mmap() and I find the > > > >>> way > > > >>> it is done to be quite elegant, not ugly :) > > > >>> > > > >> > > > >> No, I understand that your implementation works fine without > > > >> persistance, but persistance is key. That is why my series adds it in > > > >> the first patch. Without persistence it is just a toy. And the QEMU > > > >> device is not just an "NVMe-version" of null_blk. > > > > > > > > I really think we should be a bit more cautious of commiting to an > > > > on-disk format for the persistent state. Both this and Klaus' persistent > > > > state feels a bit ad-hoc, and with all the other knobs provided, it > > > > looks too easy to have out-of-sync states, or just not being able to > > > > boot at all if a qemu versions have different on-disk formats. > > > > > > > > Is anyone really considering zone emulation for production level stuff > > > > anyway? I can't imagine a real scenario where you'd want put yourself > > > > through that: you are just giving yourself all the downsides of a zoned > > > > block device and none of the benefits. AFAIK, this is provided as a > > > > development vehicle, closer to a "toy". > > > > > > > > I think we should consider trimming this down to a more minimal set that > > > > we *do* agree on and commit for inclusion ASAP. We can iterate all the > > > > bells & whistles and flush out the meta data's data marshalling scheme > > > > for persistence later. > > > > > > +1 on this. Removing the persistence also removes the debate on > > endianess. With > > > that out of the way, it should be straightforward to get agreement on a > > series > > > that can be merged quickly to get developers started with testing ZNS > > software > > > with QEMU. That is the most important goal here. 5.9 is around the corner, > > we > > > need something for people to get started with ZNS quickly. > > > > > > > Wait. What. No. Stop! > > > > It is unmistakably clear that you are invalidating my arguments about > > portability and endianness issues by suggesting that we just remove > > persistent state and deal with it later, but persistence is the killer > > feature that sets the QEMU emulated device apart from other emulation > > options. It is not about using emulation in production (because yeah, > > why would you?), but persistence is what makes it possible to develop > > and test "zoned FTLs" or something that requires recovery at power up. > > This is what allows testing of how your host software deals with opened > > zones being transitioned to FULL on power up and the persistent tracking > > of LBA allocation (in my series) can be used to properly test error > > recovery if you lost state in the app. > > > > Please, work with me on this instead of just removing such an essential > > feature. Since persistence seems to be the only thing we are really > > discussing, we should have plenty of time until the soft-freeze to come > > up with a proper solution on that. > > > > I agree that my version had a format that was pretty ad-hoc and that > > won't fly - it needs magic and version capabilities like in Dmitry's > > series, which incidentially looks a lot like what we did in the > > OpenChannel implementation, so I agree with the strategy. > > Are you insinuating that I somehow took stuff from OCSSD code and try > to claim priority this way? I am not at all that familiar with that code. > And I've already sent you the link to tcmu-runner code that served me > as an inspiration for implementing persistence in WDC patchset. > That code has been around for years, uses mmap, works great and has > nothing to do with you. >
No. I am not insinuating anything. The OpenChannel device also used a
blockdev, but, yes, incidentially (and sorry, I should not have used
that word), it looked like how we did it there and I noted that I agreed
with the strategy.
> >
> > ZNS-wise, the only thing my implementation stores is the zone
> > descriptors (in spec-native little-endian format) and the zone
> > descriptor extensions. So there are no endian issues with those. The
> > allocation tracking bitmap is always stored in little endian, but
> > converted to big-endian if running on a big-endian host.
> >
> > Let me just conjure something up.
> >
> > #define NVME_PSTATE_MAGIC ...
> > #define NVME_PSTATE_V1 1
> >
> > typedef struct NvmePstateHeader {
> > uint32_t magic;
> > uint32_t version;
> >
> > uint64_t blk_len;
> >
> > uint8_t lbads;
> > uint8_t iocs;
> >
> > uint8_t rsvd18[3054];
> >
> > struct {
> > uint64_t zsze;
> > uint8_t zdes;
> > } QEMU_PACKED zns;
> >
> > uint8_t rsvd3089[1007];
> > } QEMU_PACKED NvmePstateHeader;
> >
>
> Why conjure something that already exists in WDC patchset? And that part
> has been published in the very first version of our patches, weeks before
> your entire ZNS series was posted. Add an rsvd[] here and there and that code
> can be as suitable to achieve the stated goals as what you have above.
>
Yes, I read your code. I know you have a header and I also noted above
that "it needs magic and version capabilities like in Dmitry's series".
> > series,
> > With such a header we have all we need. We can bail out if any
> > parameters do not match and similar to nvme data structures it contains
> > reserved areas for future use. I'll be posting a v2 with this. If this
> > still feels too ad-hoc, we can be inspired by QCOW2 and the "extension"
> > feature.
> >
> > I can agree that we drop other optional features like zone excursions
> > and the reset/finish recommended limit simulation, but PLEASE DO NOT
> > remove persistence and upstream a half-baked version when we are so
> > close and have time to get it right.
>
> One important thing IMO is to reduce future need for metadata versioning.
> This requires a really good effort to design and review the proper metadata
> format that would become stable for some time. Think about portability.
> To get out something "conjured up" now and then have to move to V2
> metadata in the next release is even worse than no persistence at all.
> So maybe is makes sense to go with Keith's suggestion.
As I've said, we have time until the soft-freeze to get this right. I
"conjured" something up to show a point. The reason we review and
iterate is to NOT upstream something thats is conjured up.
But we gotta start somewhere, no? So what is so bad about me posting a
v2?
signature.asc
Description: PGP signature
