On Mon, Nov 19, 2018 at 12:48:01PM +0200, Leon Romanovsky wrote:
> Date: Mon, 19 Nov 2018 12:48:01 +0200
> From: Leon Romanovsky <l...@kernel.org>
> To: Kenneth Lee <liguo...@hisilicon.com>
> CC: Tim Sell <timothy.s...@unisys.com>, linux-...@vger.kernel.org,
>  Alexander Shishkin <alexander.shish...@linux.intel.com>, Zaibo Xu
>  <xuza...@huawei.com>, zhangfei....@foxmail.com, linux...@huawei.com,
>  haojian.zhu...@linaro.org, Christoph Lameter <c...@linux.com>, Hao Fang
>  <fangha...@huawei.com>, Gavin Schenk <g.sch...@eckelmann.de>, RDMA mailing
>  list <linux-r...@vger.kernel.org>, Vinod Koul <vk...@kernel.org>, Jason
>  Gunthorpe <j...@ziepe.ca>, Doug Ledford <dledf...@redhat.com>, Uwe
>  Kleine-König <u.kleine-koe...@pengutronix.de>, David Kershner
>  <david.kersh...@unisys.com>, Kenneth Lee <nek.in...@gmail.com>, Johan
>  Hovold <jo...@kernel.org>, Cyrille Pitchen
>  <cyrille.pitc...@free-electrons.com>, Sagar Dharia
>  <sdha...@codeaurora.org>, Jens Axboe <ax...@kernel.dk>,
>  guodong...@linaro.org, linux-netdev <net...@vger.kernel.org>, Randy Dunlap
>  <rdun...@infradead.org>, linux-ker...@vger.kernel.org, Zhou Wang
>  <wangzh...@hisilicon.com>, linux-crypto@vger.kernel.org, Philippe
>  Ombredanne <pombreda...@nexb.com>, Sanyog Kale <sanyog.r.k...@intel.com>,
>  "David S. Miller" <da...@davemloft.net>,
>  linux-accelerat...@lists.ozlabs.org, Jerome Glisse <jgli...@redhat.com>
> Subject: Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce
> User-Agent: Mutt/1.10.1 (2018-07-13)
> Message-ID: <20181119104801.gf8...@mtr-leonro.mtl.com>
> 
> On Mon, Nov 19, 2018 at 05:19:10PM +0800, Kenneth Lee wrote:
> > On Mon, Nov 19, 2018 at 05:14:05PM +0800, Kenneth Lee wrote:
> > > Date: Mon, 19 Nov 2018 17:14:05 +0800
> > > From: Kenneth Lee <liguo...@hisilicon.com>
> > > To: Leon Romanovsky <l...@kernel.org>
> > > CC: Tim Sell <timothy.s...@unisys.com>, linux-...@vger.kernel.org,
> > >  Alexander Shishkin <alexander.shish...@linux.intel.com>, Zaibo Xu
> > >  <xuza...@huawei.com>, zhangfei....@foxmail.com, linux...@huawei.com,
> > >  haojian.zhu...@linaro.org, Christoph Lameter <c...@linux.com>, Hao Fang
> > >  <fangha...@huawei.com>, Gavin Schenk <g.sch...@eckelmann.de>, RDMA 
> > > mailing
> > >  list <linux-r...@vger.kernel.org>, Vinod Koul <vk...@kernel.org>, Jason
> > >  Gunthorpe <j...@ziepe.ca>, Doug Ledford <dledf...@redhat.com>, Uwe
> > >  Kleine-König <u.kleine-koe...@pengutronix.de>, David Kershner
> > >  <david.kersh...@unisys.com>, Kenneth Lee <nek.in...@gmail.com>, Johan
> > >  Hovold <jo...@kernel.org>, Cyrille Pitchen
> > >  <cyrille.pitc...@free-electrons.com>, Sagar Dharia
> > >  <sdha...@codeaurora.org>, Jens Axboe <ax...@kernel.dk>,
> > >  guodong...@linaro.org, linux-netdev <net...@vger.kernel.org>, Randy 
> > > Dunlap
> > >  <rdun...@infradead.org>, linux-ker...@vger.kernel.org, Zhou Wang
> > >  <wangzh...@hisilicon.com>, linux-crypto@vger.kernel.org, Philippe
> > >  Ombredanne <pombreda...@nexb.com>, Sanyog Kale <sanyog.r.k...@intel.com>,
> > >  "David S. Miller" <da...@davemloft.net>,
> > >  linux-accelerat...@lists.ozlabs.org
> > > Subject: Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce
> > > User-Agent: Mutt/1.5.21 (2010-09-15)
> > > Message-ID: <20181119091405.GE157308@Turing-Arch-b>
> > >
> > > On Thu, Nov 15, 2018 at 04:54:55PM +0200, Leon Romanovsky wrote:
> > > > Date: Thu, 15 Nov 2018 16:54:55 +0200
> > > > From: Leon Romanovsky <l...@kernel.org>
> > > > To: Kenneth Lee <liguo...@hisilicon.com>
> > > > CC: Kenneth Lee <nek.in...@gmail.com>, Tim Sell 
> > > > <timothy.s...@unisys.com>,
> > > >  linux-...@vger.kernel.org, Alexander Shishkin
> > > >  <alexander.shish...@linux.intel.com>, Zaibo Xu <xuza...@huawei.com>,
> > > >  zhangfei....@foxmail.com, linux...@huawei.com, 
> > > > haojian.zhu...@linaro.org,
> > > >  Christoph Lameter <c...@linux.com>, Hao Fang <fangha...@huawei.com>, 
> > > > Gavin
> > > >  Schenk <g.sch...@eckelmann.de>, RDMA mailing list
> > > >  <linux-r...@vger.kernel.org>, Zhou Wang <wangzh...@hisilicon.com>, 
> > > > Jason
> > > >  Gunthorpe <j...@ziepe.ca>, Doug Ledford <dledf...@redhat.com>, Uwe
> > > >  Kleine-König <u.kleine-koe...@pengutronix.de>, David Kershner
> > > >  <david.kersh...@unisys.com>, Johan Hovold <jo...@kernel.org>, Cyrille
> > > >  Pitchen <cyrille.pitc...@free-electrons.com>, Sagar Dharia
> > > >  <sdha...@codeaurora.org>, Jens Axboe <ax...@kernel.dk>,
> > > >  guodong...@linaro.org, linux-netdev <net...@vger.kernel.org>, Randy 
> > > > Dunlap
> > > >  <rdun...@infradead.org>, linux-ker...@vger.kernel.org, Vinod Koul
> > > >  <vk...@kernel.org>, linux-crypto@vger.kernel.org, Philippe Ombredanne
> > > >  <pombreda...@nexb.com>, Sanyog Kale <sanyog.r.k...@intel.com>, "David 
> > > > S.
> > > >  Miller" <da...@davemloft.net>, linux-accelerat...@lists.ozlabs.org
> > > > Subject: Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce
> > > > User-Agent: Mutt/1.10.1 (2018-07-13)
> > > > Message-ID: <20181115145455.gn3...@mtr-leonro.mtl.com>
> > > >
> > > > On Thu, Nov 15, 2018 at 04:51:09PM +0800, Kenneth Lee wrote:
> > > > > On Wed, Nov 14, 2018 at 06:00:17PM +0200, Leon Romanovsky wrote:
> > > > > > Date: Wed, 14 Nov 2018 18:00:17 +0200
> > > > > > From: Leon Romanovsky <l...@kernel.org>
> > > > > > To: Kenneth Lee <nek.in...@gmail.com>
> > > > > > CC: Tim Sell <timothy.s...@unisys.com>, linux-...@vger.kernel.org,
> > > > > >  Alexander Shishkin <alexander.shish...@linux.intel.com>, Zaibo Xu
> > > > > >  <xuza...@huawei.com>, zhangfei....@foxmail.com, 
> > > > > > linux...@huawei.com,
> > > > > >  haojian.zhu...@linaro.org, Christoph Lameter <c...@linux.com>, Hao 
> > > > > > Fang
> > > > > >  <fangha...@huawei.com>, Gavin Schenk <g.sch...@eckelmann.de>, RDMA 
> > > > > > mailing
> > > > > >  list <linux-r...@vger.kernel.org>, Zhou Wang 
> > > > > > <wangzh...@hisilicon.com>,
> > > > > >  Jason Gunthorpe <j...@ziepe.ca>, Doug Ledford 
> > > > > > <dledf...@redhat.com>, Uwe
> > > > > >  Kleine-König <u.kleine-koe...@pengutronix.de>, David Kershner
> > > > > >  <david.kersh...@unisys.com>, Johan Hovold <jo...@kernel.org>, 
> > > > > > Cyrille
> > > > > >  Pitchen <cyrille.pitc...@free-electrons.com>, Sagar Dharia
> > > > > >  <sdha...@codeaurora.org>, Jens Axboe <ax...@kernel.dk>,
> > > > > >  guodong...@linaro.org, linux-netdev <net...@vger.kernel.org>, 
> > > > > > Randy Dunlap
> > > > > >  <rdun...@infradead.org>, linux-ker...@vger.kernel.org, Vinod Koul
> > > > > >  <vk...@kernel.org>, linux-crypto@vger.kernel.org, Philippe 
> > > > > > Ombredanne
> > > > > >  <pombreda...@nexb.com>, Sanyog Kale <sanyog.r.k...@intel.com>, 
> > > > > > Kenneth Lee
> > > > > >  <liguo...@hisilicon.com>, "David S. Miller" <da...@davemloft.net>,
> > > > > >  linux-accelerat...@lists.ozlabs.org
> > > > > > Subject: Re: [RFCv3 PATCH 1/6] uacce: Add documents for 
> > > > > > WarpDrive/uacce
> > > > > > User-Agent: Mutt/1.10.1 (2018-07-13)
> > > > > > Message-ID: <20181114160017.gi3...@mtr-leonro.mtl.com>
> > > > > >
> > > > > > On Wed, Nov 14, 2018 at 10:58:09AM +0800, Kenneth Lee wrote:
> > > > > > >
> > > > > > > 在 2018/11/13 上午8:23, Leon Romanovsky 写道:
> > > > > > > > On Mon, Nov 12, 2018 at 03:58:02PM +0800, Kenneth Lee wrote:
> > > > > > > > > From: Kenneth Lee <liguo...@hisilicon.com>
> > > > > > > > >
> > > > > > > > > WarpDrive is a general accelerator framework for the user 
> > > > > > > > > application to
> > > > > > > > > access the hardware without going through the kernel in data 
> > > > > > > > > path.
> > > > > > > > >
> > > > > > > > > The kernel component to provide kernel facility to driver for 
> > > > > > > > > expose the
> > > > > > > > > user interface is called uacce. It a short name for
> > > > > > > > > "Unified/User-space-access-intended Accelerator Framework".
> > > > > > > > >
> > > > > > > > > This patch add document to explain how it works.
> > > > > > > > + RDMA and netdev folks
> > > > > > > >
> > > > > > > > Sorry, to be late in the game, I don't see other patches, but 
> > > > > > > > from
> > > > > > > > the description below it seems like you are reinventing RDMA 
> > > > > > > > verbs
> > > > > > > > model. I have hard time to see the differences in the proposed
> > > > > > > > framework to already implemented in drivers/infiniband/* for 
> > > > > > > > the kernel
> > > > > > > > space and for the https://github.com/linux-rdma/rdma-core/ for 
> > > > > > > > the user
> > > > > > > > space parts.
> > > > > > >
> > > > > > > Thanks Leon,
> > > > > > >
> > > > > > > Yes, we tried to solve similar problem in RDMA. We also learned a 
> > > > > > > lot from
> > > > > > > the exist code of RDMA. But we we have to make a new one because 
> > > > > > > we cannot
> > > > > > > register accelerators such as AI operation, encryption or 
> > > > > > > compression to the
> > > > > > > RDMA framework:)
> > > > > >
> > > > > > Assuming that you did everything right and still failed to use RDMA
> > > > > > framework, you was supposed to fix it and not to reinvent new 
> > > > > > exactly
> > > > > > same one. It is how we develop kernel, by reusing existing code.
> > > > >
> > > > > Yes, but we don't force other system such as NIC or GPU into RDMA, do 
> > > > > we?
> > > >
> > > > You don't introduce new NIC or GPU, but proposing another interface to
> > > > directly access HW memory and bypass kernel for the data path. This is
> > > > whole idea of RDMA and this is why it is already present in the kernel.
> > > >
> > > > Various hardware devices are supported in our stack allow a ton of crazy
> > > > stuff, including GPUs interconnections and NIC functionalities.
> > >
> > > Yes. We don't want to invent new wheel. That is why we did it behind VFIO 
> > > in RFC
> > > v1 and v2. But finally we were persuaded by Mr. Jerome Glisse that VFIO 
> > > was not
> > > a good place to solve the problem.
> 
> I saw a couple of his responses, he constantly said to you that you are
> reinventing the wheel.
> https://lore.kernel.org/lkml/20180904150019.ga4...@redhat.com/
> 

No. I think he asked me did not create trouble in VFIO but just use common
interface from dma_buf and iommu itself. That is exactly what I am doing.

> > >
> > > And currently, as you see, IB is bound with devices doing RDMA. The 
> > > register
> > > function, ib_register_device() hint that it is a netdev (get_netdev() 
> > > callback), it know
> > > about gid, pkey, and Memory Window. IB is not simply a address space 
> > > management
> > > framework. And verbs to IB are not transparent. If we start to add
> > > compression/decompression, AI (RNN, CNN stuff) operations, and 
> > > encryption/decryption
> > > to the verbs set. It will become very complexity. Or maybe I 
> > > misunderstand the
> > > IB idea? But I don't see compression hardware is integrated in the 
> > > mainline
> > > Kernel. Could you directly point out which one I can used as a reference?
> > >
> 
> I strongly advise you to read the code, not all drivers are implementing
> gids, pkeys and get_netdev() callback.
> 
> Yes, you are misunderstanding drivers/infiniband subsystem. We have
> plenty options to expose APIs to the user space applications, starting
> from standard verbs API and ending with private objects which are
> understandable by specific device/driver.
> 
> IB stack provides secure FD to access device, by creating context,
> after that you can send direct commands to the FW (see mlx5 DEVX
> or hfi1) in sane way.
> 
> So actually, you will need to register your device, declare your own
> set of objects (similar to mlx5 include/uapi/rdma/mlx5_user_ioctl_*.h).
> 
> In regards to reference of compression hardware, I don't have.
> But there is an example of how T10-DIF can be implemented in verbs
> layer:
> https://www.openfabrics.org/images/2018workshop/presentations/307_TOved_T10-DIFOffload.pdf
> Or IPsec crypto:
> https://www.spinics.net/lists/linux-rdma/msg48906.html
> 

OK. I will spend some time on it first. But according to current discussion,
Don't you think I should avoid all these complexities but simply use SVM/SVA on
iommu or let the user application use the kernel-allocated VMA and page? It
does not create anything new. Just a new user of IOMMU and its SVM/SVA
capability.

> > > >
> > > > >
> > > > > I assume you would not agree to register a zip accelerator to 
> > > > > infiniband? :)
> > > >
> > > > "infiniband" name in the "drivers/infiniband/" is legacy one and the
> > > > current code supports IB, RoCE, iWARP and OmniPath as a transport 
> > > > layers.
> > > > For a lone time, we wanted to rename that folder to be "drivers/rdma",
> > > > but didn't find enough brave men/women to do it, due to backport mess
> > > > for such move.
> > > >
> > > > The addition of zip accelerator to RDMA is possible and depends on how
> > > > you will model such new functionality - new driver, or maybe new ULP.
> > > >
> > > > >
> > > > > Further, I don't think it is wise to break an exist system (RDMA) to 
> > > > > fulfill a
> > > > > totally new scenario. The better choice is to let them run in 
> > > > > parallel for some
> > > > > time and try to merge them accordingly.
> > > >
> > > > Awesome, so please run your code out-of-tree for now and once you are 
> > > > ready
> > > > for submission let's try to merge it.
> > >
> > > Yes, yes. We know trust need time to gain. But the fact is that there is 
> > > no
> > > accelerator user driver can be added to mainline kernel. We should raise 
> > > the
> > > topic time to time. So to help the communication to fix the gap, right?
> > >
> > > We are also opened to cooperate with IB to do it within the IB framework. 
> > > But
> > > please let me know where to start. I feel it is quite wired to make a
> > > ib_register_device for a zip or RSA accelerator.
> 
> Most of ib_ prefixes in drivers/infinband/ are legacy names. You can
> rename them to be rdma_register_device() if it helps.
> 
> So from implementation point of view, as I wrote above.
> Create minimal driver to register, expose MR to user space, add your own
> objects and capabilities through our new KABI and implement user space part
> in github.com/linux-rdma/rdma-core.

I don't think it is just a name. But anyway, let me spend some time to try the
possibility.

> 
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Another problem we tried to address is the way to pin the memory 
> > > > > > > for dma
> > > > > > > operation. The RDMA way to pin the memory cannot avoid the page 
> > > > > > > lost due to
> > > > > > > copy-on-write operation during the memory is used by the device. 
> > > > > > > This may
> > > > > > > not be important to RDMA library. But it is important to 
> > > > > > > accelerator.
> > > > > >
> > > > > > Such support exists in drivers/infiniband/ from late 2014 and
> > > > > > it is called ODP (on demand paging).
> > > > >
> > > > > I reviewed ODP and I think it is a solution bound to infiniband. It 
> > > > > is part of
> > > > > MR semantics and required a infiniband specific hook
> > > > > (ucontext->invalidate_range()). And the hook requires the device to 
> > > > > be able to
> > > > > stop using the page for a while for the copying. It is ok for 
> > > > > infiniband
> > > > > (actually, only mlx5 uses it). I don't think most accelerators can 
> > > > > support
> > > > > this mode. But WarpDrive works fully on top of IOMMU interface, it 
> > > > > has no this
> > > > > limitation.
> > > >
> > > > 1. It has nothing to do with infiniband.
> > >
> > > But it must be a ib_dev first.
> 
> It is just a name.
> 
> > >
> > > > 2. MR and uncontext are verbs semantics and needed to ensure that host
> > > > memory exposed to user is properly protected from security point of 
> > > > view.
> > > > 3. "stop using the page for a while for the copying" - I'm not fully
> > > > understand this claim, maybe this article will help you to better
> > > > describe : https://lwn.net/Articles/753027/
> > >
> > > This topic was being discussed in RFCv2. The key problem here is that:
> > >
> > > The device need to hold the memory for its own calculation, but the 
> > > CPU/software
> > > want to stop it for a while for synchronizing with disk or COW.
> > >
> > > If the hardware support SVM/SVA (Shared Virtual Memory/Address), it is 
> > > easy, the
> > > device share page table with CPU, the device will raise a page fault when 
> > > the
> > > CPU downgrade the PTE to read-only.
> > >
> > > If the hardware cannot share page table with the CPU, we then need to have
> > > some way to change the device page table. This is what happen in ODP. It
> > > invalidates the page table in device upon mmu_notifier call back. But 
> > > this cannot
> > > solve the COW problem: if the user process A share a page P with device, 
> > > and A
> > > forks a new process B, and it continue to write to the page. By COW, the
> > > process B will keep the page P, while A will get a new page P'. But you 
> > > have
> > > no way to let the device know it should use P' rather than P.
> 
> I didn't hear about such issue and we supported fork for a long time.
> 
> > >
> > > This may be OK for RDMA application. Because RDMA is a big thing and we 
> > > can ask
> > > the programmer to avoid the situation. But for a accelerator, I don't 
> > > think we
> > > can ask a programmer to care for this when use a zlib.
> > >
> > > In WarpDrive/uacce, we make this simple. If you support IOMMU and it 
> > > support
> > > SVM/SVA. Everything will be fine just like ODP implicit mode. And you 
> > > don't need
> > > to write any code for that. Because it has been done by IOMMU framework. 
> > > If it
> > > dose not, you have to use the kernel allocated memory which has the same 
> > > IOVA as
> > > the VA in user space. So we can still maintain a unify address space 
> > > among the
> > > devices and the applicatin.
> > >
> > > > 4. mlx5 supports ODP not because of being partially IB device,
> > > > but because HW performance oriented implementation is not an easy task.
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Hope this can help the understanding.
> > > > > >
> > > > > > Yes, it helped me a lot.
> > > > > > Now, I'm more than before convinced that this whole patchset 
> > > > > > shouldn't
> > > > > > exist in the first place.
> > > > >
> > > > > Then maybe you can tell me how I can register my accelerator to the 
> > > > > user space?
> > > >
> > > > Write kernel driver and write user space part of it.
> > > > https://github.com/linux-rdma/rdma-core/
> > > >
> > > > I have no doubts that your colleagues who wrote and maintain
> > > > drivers/infiniband/hw/hns driver know best how to do it.
> > > > They did it very successfully.
> > > >
> > > > Thanks
> > > >
> > > > >
> > > > > >
> > > > > > To be clear, NAK.
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > >
> > > > > > > Cheers
> > > > > > >
> > > > > > > >
> > > > > > > > Hard NAK from RDMA side.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > > > > > > Signed-off-by: Kenneth Lee <liguo...@hisilicon.com>
> > > > > > > > > ---
> > > > > > > > >   Documentation/warpdrive/warpdrive.rst       | 260 +++++++
> > > > > > > > >   Documentation/warpdrive/wd-arch.svg         | 764 
> > > > > > > > > ++++++++++++++++++++
> > > > > > > > >   Documentation/warpdrive/wd.svg              | 526 
> > > > > > > > > ++++++++++++++
> > > > > > > > >   Documentation/warpdrive/wd_q_addr_space.svg | 359 +++++++++
> > > > > > > > >   4 files changed, 1909 insertions(+)
> > > > > > > > >   create mode 100644 Documentation/warpdrive/warpdrive.rst
> > > > > > > > >   create mode 100644 Documentation/warpdrive/wd-arch.svg
> > > > > > > > >   create mode 100644 Documentation/warpdrive/wd.svg
> > > > > > > > >   create mode 100644 
> > > > > > > > > Documentation/warpdrive/wd_q_addr_space.svg
> > > > > > > > >
> > > > > > > > > diff --git a/Documentation/warpdrive/warpdrive.rst 
> > > > > > > > > b/Documentation/warpdrive/warpdrive.rst
> > > > > > > > > new file mode 100644
> > > > > > > > > index 000000000000..ef84d3a2d462
> > > > > > > > > --- /dev/null
> > > > > > > > > +++ b/Documentation/warpdrive/warpdrive.rst
> > > > > > > > > @@ -0,0 +1,260 @@
> > > > > > > > > +Introduction of WarpDrive
> > > > > > > > > +=========================
> > > > > > > > > +
> > > > > > > > > +*WarpDrive* is a general accelerator framework for the user 
> > > > > > > > > application to
> > > > > > > > > +access the hardware without going through the kernel in data 
> > > > > > > > > path.
> > > > > > > > > +
> > > > > > > > > +It can be used as the quick channel for accelerators, 
> > > > > > > > > network adaptors or
> > > > > > > > > +other hardware for application in user space.
> > > > > > > > > +
> > > > > > > > > +This may make some implementation simpler.  E.g.  you can 
> > > > > > > > > reuse most of the
> > > > > > > > > +*netdev* driver in kernel and just share some ring buffer to 
> > > > > > > > > the user space
> > > > > > > > > +driver for *DPDK* [4] or *ODP* [5]. Or you can combine the 
> > > > > > > > > RSA accelerator with
> > > > > > > > > +the *netdev* in the user space as a https reversed proxy, 
> > > > > > > > > etc.
> > > > > > > > > +
> > > > > > > > > +*WarpDrive* takes the hardware accelerator as a 
> > > > > > > > > heterogeneous processor which
> > > > > > > > > +can share particular load from the CPU:
> > > > > > > > > +
> > > > > > > > > +.. image:: wd.svg
> > > > > > > > > +        :alt: WarpDrive Concept
> > > > > > > > > +
> > > > > > > > > +The virtual concept, queue, is used to manage the requests 
> > > > > > > > > sent to the
> > > > > > > > > +accelerator. The application send requests to the queue by 
> > > > > > > > > writing to some
> > > > > > > > > +particular address, while the hardware takes the requests 
> > > > > > > > > directly from the
> > > > > > > > > +address and send feedback accordingly.
> > > > > > > > > +
> > > > > > > > > +The format of the queue may differ from hardware to 
> > > > > > > > > hardware. But the
> > > > > > > > > +application need not to make any system call for the 
> > > > > > > > > communication.
> > > > > > > > > +
> > > > > > > > > +*WarpDrive* tries to create a shared virtual address space 
> > > > > > > > > for all involved
> > > > > > > > > +accelerators. Within this space, the requests sent to queue 
> > > > > > > > > can refer to any
> > > > > > > > > +virtual address, which will be valid to the application and 
> > > > > > > > > all involved
> > > > > > > > > +accelerators.
> > > > > > > > > +
> > > > > > > > > +The name *WarpDrive* is simply a cool and general name 
> > > > > > > > > meaning the framework
> > > > > > > > > +makes the application faster. It includes general user 
> > > > > > > > > library, kernel
> > > > > > > > > +management module and drivers for the hardware. In kernel, 
> > > > > > > > > the management
> > > > > > > > > +module is called *uacce*, meaning 
> > > > > > > > > "Unified/User-space-access-intended
> > > > > > > > > +Accelerator Framework".
> > > > > > > > > +
> > > > > > > > > +
> > > > > > > > > +How does it work
> > > > > > > > > +================
> > > > > > > > > +
> > > > > > > > > +*WarpDrive* uses *mmap* and *IOMMU* to play the trick.
> > > > > > > > > +
> > > > > > > > > +*Uacce* creates a chrdev for the device registered to it. A 
> > > > > > > > > "queue" will be
> > > > > > > > > +created when the chrdev is opened. The application access 
> > > > > > > > > the queue by mmap
> > > > > > > > > +different address region of the queue file.
> > > > > > > > > +
> > > > > > > > > +The following figure demonstrated the queue file address 
> > > > > > > > > space:
> > > > > > > > > +
> > > > > > > > > +.. image:: wd_q_addr_space.svg
> > > > > > > > > +        :alt: WarpDrive Queue Address Space
> > > > > > > > > +
> > > > > > > > > +The first region of the space, device region, is used for 
> > > > > > > > > the application to
> > > > > > > > > +write request or read answer to or from the hardware.
> > > > > > > > > +
> > > > > > > > > +Normally, there can be three types of device regions mmio 
> > > > > > > > > and memory regions.
> > > > > > > > > +It is recommended to use common memory for request/answer 
> > > > > > > > > descriptors and use
> > > > > > > > > +the mmio space for device notification, such as doorbell. 
> > > > > > > > > But of course, this
> > > > > > > > > +is all up to the interface designer.
> > > > > > > > > +
> > > > > > > > > +There can be two types of device memory regions, kernel-only 
> > > > > > > > > and user-shared.
> > > > > > > > > +This will be explained in the "kernel APIs" section.
> > > > > > > > > +
> > > > > > > > > +The Static Share Virtual Memory region is necessary only 
> > > > > > > > > when the device IOMMU
> > > > > > > > > +does not support "Share Virtual Memory". This will be 
> > > > > > > > > explained after the
> > > > > > > > > +*IOMMU* idea.
> > > > > > > > > +
> > > > > > > > > +
> > > > > > > > > +Architecture
> > > > > > > > > +------------
> > > > > > > > > +
> > > > > > > > > +The full *WarpDrive* architecture is represented in the 
> > > > > > > > > following class
> > > > > > > > > +diagram:
> > > > > > > > > +
> > > > > > > > > +.. image:: wd-arch.svg
> > > > > > > > > +        :alt: WarpDrive Architecture
> > > > > > > > > +
> > > > > > > > > +
> > > > > > > > > +The user API
> > > > > > > > > +------------
> > > > > > > > > +
> > > > > > > > > +We adopt a polling style interface in the user space: ::
> > > > > > > > > +
> > > > > > > > > +        int wd_request_queue(struct wd_queue *q);
> > > > > > > > > +        void wd_release_queue(struct wd_queue *q);
> > > > > > > > > +
> > > > > > > > > +        int wd_send(struct wd_queue *q, void *req);
> > > > > > > > > +        int wd_recv(struct wd_queue *q, void **req);
> > > > > > > > > +        int wd_recv_sync(struct wd_queue *q, void **req);
> > > > > > > > > +        void wd_flush(struct wd_queue *q);
> > > > > > > > > +
> > > > > > > > > +wd_recv_sync() is a wrapper to its non-sync version. It will 
> > > > > > > > > trapped into
> > > > > > > > > +kernel and waits until the queue become available.
> > > > > > > > > +
> > > > > > > > > +If the queue do not support SVA/SVM. The following helper 
> > > > > > > > > function
> > > > > > > > > +can be used to create Static Virtual Share Memory: ::
> > > > > > > > > +
> > > > > > > > > +        void *wd_preserve_share_memory(struct wd_queue *q, 
> > > > > > > > > size_t size);
> > > > > > > > > +
> > > > > > > > > +The user API is not mandatory. It is simply a suggestion and 
> > > > > > > > > hint what the
> > > > > > > > > +kernel interface is supposed to support.
> > > > > > > > > +
> > > > > > > > > +
> > > > > > > > > +The user driver
> > > > > > > > > +---------------
> > > > > > > > > +
> > > > > > > > > +The queue file mmap space will need a user driver to wrap 
> > > > > > > > > the communication
> > > > > > > > > +protocol. *UACCE* provides some attributes in sysfs for the 
> > > > > > > > > user driver to
> > > > > > > > > +match the right accelerator accordingly.
> > > > > > > > > +
> > > > > > > > > +The *UACCE* device attribute is under the following 
> > > > > > > > > directory:
> > > > > > > > > +
> > > > > > > > > +/sys/class/uacce/<dev-name>/params
> > > > > > > > > +
> > > > > > > > > +The following attributes is supported:
> > > > > > > > > +
> > > > > > > > > +nr_queue_remained (ro)
> > > > > > > > > +        number of queue remained
> > > > > > > > > +
> > > > > > > > > +api_version (ro)
> > > > > > > > > +        a string to identify the queue mmap space format and 
> > > > > > > > > its version
> > > > > > > > > +
> > > > > > > > > +device_attr (ro)
> > > > > > > > > +        attributes of the device, see UACCE_DEV_xxx flag 
> > > > > > > > > defined in uacce.h
> > > > > > > > > +
> > > > > > > > > +numa_node (ro)
> > > > > > > > > +        id of numa node
> > > > > > > > > +
> > > > > > > > > +priority (rw)
> > > > > > > > > +        Priority or the device, bigger is higher
> > > > > > > > > +
> > > > > > > > > +(This is not yet implemented in RFC version)
> > > > > > > > > +
> > > > > > > > > +
> > > > > > > > > +The kernel API
> > > > > > > > > +--------------
> > > > > > > > > +
> > > > > > > > > +The *uacce* kernel API is defined in uacce.h. If the 
> > > > > > > > > hardware support SVM/SVA,
> > > > > > > > > +The driver need only the following API functions: ::
> > > > > > > > > +
> > > > > > > > > +        int uacce_register(uacce);
> > > > > > > > > +        void uacce_unregister(uacce);
> > > > > > > > > +        void uacce_wake_up(q);
> > > > > > > > > +
> > > > > > > > > +*uacce_wake_up* is used to notify the process who epoll() on 
> > > > > > > > > the queue file.
> > > > > > > > > +
> > > > > > > > > +According to the IOMMU capability, *uacce* categories the 
> > > > > > > > > devices as follow:
> > > > > > > > > +
> > > > > > > > > +UACCE_DEV_NOIOMMU
> > > > > > > > > +        The device has no IOMMU. The user process cannot use 
> > > > > > > > > VA on the hardware
> > > > > > > > > +        This mode is not recommended.
> > > > > > > > > +
> > > > > > > > > +UACCE_DEV_SVA (UACCE_DEV_PASID | UACCE_DEV_FAULT_FROM_DEV)
> > > > > > > > > +        The device has IOMMU which can share the same page 
> > > > > > > > > table with user
> > > > > > > > > +        process
> > > > > > > > > +
> > > > > > > > > +UACCE_DEV_SHARE_DOMAIN
> > > > > > > > > +        The device has IOMMU which has no multiple page 
> > > > > > > > > table and device page
> > > > > > > > > +        fault support
> > > > > > > > > +
> > > > > > > > > +If the device works in mode other than UACCE_DEV_NOIOMMU, 
> > > > > > > > > *uacce* will set its
> > > > > > > > > +IOMMU to IOMMU_DOMAIN_UNMANAGED. So the driver must not use 
> > > > > > > > > any kernel
> > > > > > > > > +DMA API but the following ones from *uacce* instead: ::
> > > > > > > > > +
> > > > > > > > > +        uacce_dma_map(q, va, size, prot);
> > > > > > > > > +        uacce_dma_unmap(q, va, size, prot);
> > > > > > > > > +
> > > > > > > > > +*uacce_dma_map/unmap* is valid only for UACCE_DEV_SVA 
> > > > > > > > > device. It creates a
> > > > > > > > > +particular PASID and page table for the kernel in the IOMMU 
> > > > > > > > > (Not yet
> > > > > > > > > +implemented in the RFC)
> > > > > > > > > +
> > > > > > > > > +For the UACCE_DEV_SHARE_DOMAIN device, uacce_dma_map/unmap 
> > > > > > > > > is not valid.
> > > > > > > > > +*Uacce* call back start_queue only when the DUS and DKO 
> > > > > > > > > region is mmapped. The
> > > > > > > > > +accelerator driver must use those dma buffer, via 
> > > > > > > > > uacce_queue->qfrs[], on
> > > > > > > > > +start_queue call back. The size of the queue file region is 
> > > > > > > > > defined by
> > > > > > > > > +uacce->ops->qf_pg_start[].
> > > > > > > > > +
> > > > > > > > > +We have to do it this way because most of current IOMMU 
> > > > > > > > > cannot support the
> > > > > > > > > +kernel and user virtual address at the same time. So we have 
> > > > > > > > > to let them both
> > > > > > > > > +share the same user virtual address space.
> > > > > > > > > +
> > > > > > > > > +If the device have to support kernel and user at the same 
> > > > > > > > > time, both kernel
> > > > > > > > > +and the user should use these DMA API. This is not 
> > > > > > > > > convenient. A better
> > > > > > > > > +solution is to change the future DMA/IOMMU design to let 
> > > > > > > > > them separate the
> > > > > > > > > +address space between the user and kernel space. But it is 
> > > > > > > > > not going to be in
> > > > > > > > > +a short time.
> > > > > > > > > +
> > > > > > > > > +
> > > > > > > > > +Multiple processes support
> > > > > > > > > +==========================
> > > > > > > > > +
> > > > > > > > > +In the latest mainline kernel (4.19) when this document is 
> > > > > > > > > written, the IOMMU
> > > > > > > > > +subsystem do not support multiple process page tables yet.
> > > > > > > > > +
> > > > > > > > > +Most IOMMU hardware implementation support multi-process 
> > > > > > > > > with the concept
> > > > > > > > > +of PASID. But they may use different name, e.g. it is call 
> > > > > > > > > sub-stream-id in
> > > > > > > > > +SMMU of ARM. With PASID or similar design, multi page table 
> > > > > > > > > can be added to
> > > > > > > > > +the IOMMU and referred by its PASID.
> > > > > > > > > +
> > > > > > > > > +*JPB* has a patchset to enable this[1]_. We have tested it 
> > > > > > > > > with our hardware
> > > > > > > > > +(which is known as *D06*). It works well. *WarpDrive* rely 
> > > > > > > > > on them to support
> > > > > > > > > +UACCE_DEV_SVA. If it is not enabled, *WarpDrive* can still 
> > > > > > > > > work. But it
> > > > > > > > > +support only one process, the device will be set to 
> > > > > > > > > UACCE_DEV_SHARE_DOMAIN
> > > > > > > > > +even it is set to UACCE_DEV_SVA initially.
> > > > > > > > > +
> > > > > > > > > +Static Share Virtual Memory is mainly used by 
> > > > > > > > > UACCE_DEV_SHARE_DOMAIN device.
> > > > > > > > > +
> > > > > > > > > +
> > > > > > > > > +Legacy Mode Support
> > > > > > > > > +===================
> > > > > > > > > +For the hardware without IOMMU, WarpDrive can still work, 
> > > > > > > > > the only problem is
> > > > > > > > > +VA cannot be used in the device. The driver should adopt 
> > > > > > > > > another strategy for
> > > > > > > > > +the shared memory. It is only for testing, and not 
> > > > > > > > > recommended.
> > > > > > > > > +
> > > > > > > > > +
> > > > > > > > > +The Folk Scenario
> > > > > > > > > +=================
> > > > > > > > > +For a process with allocated queues and shared memory, what 
> > > > > > > > > happen if it forks
> > > > > > > > > +a child?
> > > > > > > > > +
> > > > > > > > > +The fd of the queue will be duplicated on folk, so the child 
> > > > > > > > > can send request
> > > > > > > > > +to the same queue as its parent. But the requests which is 
> > > > > > > > > sent from processes
> > > > > > > > > +except for the one who open the queue will be blocked.
> > > > > > > > > +
> > > > > > > > > +It is recommended to add O_CLOEXEC to the queue file.
> > > > > > > > > +
> > > > > > > > > +The queue mmap space has a VM_DONTCOPY in its VMA. So the 
> > > > > > > > > child will lost all
> > > > > > > > > +those VMAs.
> > > > > > > > > +
> > > > > > > > > +This is why *WarpDrive* does not adopt the mode used in 
> > > > > > > > > *VFIO* and *InfiniBand*.
> > > > > > > > > +Both solutions can set any user pointer for hardware 
> > > > > > > > > sharing. But they cannot
> > > > > > > > > +support fork when the dma is in process. Or the 
> > > > > > > > > "Copy-On-Write" procedure will
> > > > > > > > > +make the parent process lost its physical pages.
> > > > > > > > > +
> > > > > > > > > +
> > > > > > > > > +The Sample Code
> > > > > > > > > +===============
> > > > > > > > > +There is a sample user land implementation with a simple 
> > > > > > > > > driver for Hisilicon
> > > > > > > > > +Hi1620 ZIP Accelerator.
> > > > > > > > > +
> > > > > > > > > +To test, do the following in samples/warpdrive (for the case 
> > > > > > > > > of PC host): ::
> > > > > > > > > +        ./autogen.sh
> > > > > > > > > +        ./conf.sh       # or simply ./configure if you build 
> > > > > > > > > on target system
> > > > > > > > > +        make
> > > > > > > > > +
> > > > > > > > > +Then you can get test_hisi_zip in the test subdirectory. 
> > > > > > > > > Copy it to the target
> > > > > > > > > +system and make sure the hisi_zip driver is enabled (the 
> > > > > > > > > major and minor of
> > > > > > > > > +the uacce chrdev can be gotten from the dmesg or sysfs), and 
> > > > > > > > > run: ::
> > > > > > > > > +        mknod /dev/ua1 c <major> <minior>
> > > > > > > > > +        test/test_hisi_zip -z < data > data.zip
> > > > > > > > > +        test/test_hisi_zip -g < data > data.gzip
> > > > > > > > > +
> > > > > > > > > +
> > > > > > > > > +References
> > > > > > > > > +==========
> > > > > > > > > +.. [1] https://patchwork.kernel.org/patch/10394851/
> > > > > > > > > +
> > > > > > > > > +.. vim: tw=78
> > > > > [...]
> > > > > > > > > --
> > > > > > > > > 2.17.1
> > > > > > > > >
> >
> > I don't know if Mr. Jerome Glisse in the list. I think I should cc him for 
> > my
> > respectation  to his help on last RFC.
> >
> > - Kenneth


Reply via email to