Thanks @hjiang for the comments.
[quote="hjiang, post:6, topic:6676"] #1 about “cloud device may use PCIE instead of memory share”, that make sense, but seems like a new driver with pcie support would can fix and no need such big change, [/quote] #1 a new driver with PCIe support is not enough. As there is no mechanism to deal with mix of CPU and FPGA ops. We have to insert a *device_copy* op if two adjacent layers resident in different devices. The current VTA allocate all the memory in FPGA, and both CPU (ARM) ops and FPGA ops are accessing the same memory area. [quote="hjiang, post:6, topic:6676"] #2 about “different programming models”, could you help to give more detailed information about this part? do we have any plan to address scalibility issue for cloud fpga performance concern? [/quote] #2 "different programming models" mainly means the differences during hardware implementation (e.g., OpenCL vs Xilinx HLS). What do you mean by the scalability issue? Could you give more details? [quote="hjiang, post:6, topic:6676"] but this 2 part “any OpenCL-compatible devices” and “vendor-specific optimization” are conflict, could you give more detail about what the plan here to balance this 2 parts and how to reduce related complexity to minus developer efforts? [/quote] @remotego Could you help elaborate on this part a bit? [quote="hjiang, post:6, topic:6676"] for “major work”, about “To avoid frequent PCIe copies” “we propose to let all middle layers of a computation graph to completely run in FPGA devices”, about this part, I have couple questions first does that means this proposal would put all params data(input data, weights, bias) into FPGA sram one time? in such case if the model params size is bigger then FPGA capability how to handle such issue? [/quote] It is not necessary to put all params data into FPGA sram one time. Actually we do not change the original behaviour. That's, all the params data are put in FPGA DRAM during initialisation, and we run the graph layer by layer. The only thing we do is to ensure that all the ops of middle layers can be run in FPGA (implement vta compute and schedule for all middle layers). [quote="hjiang, post:6, topic:6676"] second, the data transfer may cause big latency, could I know do we have any solution to hiding the memory latency? [/quote] I think we do not change anything for this part, compared with original VTA. Since weights/bias are only transferred once for one model, I think the cost should be ok? [quote="hjiang, post:6, topic:6676"] third, even with PCIE device, DMA should still work, could I know some detail about which “PCIe transmission is costly”? [/quote] Yes. DMA is used for PCIe transmission. But the setup cost for DMA is non-negligible. Compared with DRAM bus, PCIe DMA is costly. [quote="hjiang, post:6, topic:6676"] #4 about “auto-copy between layers” seems like this is talking about inter-operator parallel, as I know tvm currently not analysis and do inter-operator parallel yet, do this proposal plan to add such support to tvm? [/quote] I think "auto-copy" here is not dealing with inter-operator parallel. It is used to make data accesible by corresponding devices. Here is an example. `MaxPool (on CPU) -> Conv2D (on FPGA) -> xxx` In order to have this to work, we have to insert a *device_copy* between the CPU op and FPGA op. After the insertion, it will become: `MaxPool (on CPU) -> device_copy -> Conv2D (on FPGA) -> xxx` [quote="hjiang, post:6, topic:6676"] for “Major changes” , about “there is no need to launch additional service”(e.g., rpc server), this is a existing feature related deploy, after build network moodle, vta can running locally with any language (c++/python etc), here is a deploy example https://github.com/apache/incubator-tvm-vta/pull/5 , [/quote] Thanks for the information. For this *Major change*, we actually mean to the end user. In other words, how the end users run an inference on FPGA. There is no much code changes. Currently, we re-use the simulation code (i.e., LocalSession). We'll take a look at the new feature, and see what can we borrow. Thanks. [quote="hjiang, post:6, topic:6676"] about “Change VTA runtime to support batch queue synchronization”, seems like this is current VTA logic, could I know some detail about the different between existing logic and this new synchronization logic? [/quote] Current VTA does the synchronization for every layer. We propose to provide an option to do this for every inference (multiple layers). [quote="hjiang, post:6, topic:6676"] about “DFS traversal”, as I know tvm seems like do network node compute sequentially instead of DFS, could I know what does this “DFS traversal” means? [/quote] For this, I mean the `device annotation code` [here](https://github.com/apache/incubator-tvm/blob/c55ed371693740291b82cc8d88bf09c830d029c7/src/relay/transforms/device_annotation.cc#L369). [quote="hjiang, post:6, topic:6676"] about “except first layer and last layer (in FPGA)”, currently lot solution include vta(fist conv) do this, but there are also some solution offloaded all conv include #1 layer into FPGA,could I know what is the concern for putting first/last layer in cpu at this proposal? [/quote] Actually we do put the first/last layer in CPU. Currently I think VTA does not support channels size < BLOCK, and max/avg_pools are not supported neither. So we just let these layers run in CPU. Did I get your point? [quote="hjiang, post:6, topic:6676"] for “Limitations” , about “all instructions are running sequentially”, this may cause big performance problem because memory hiding by pipe line TLPP. [/quote] Yes. This may be a potential performance issue. @remotego Could you elaborate more on this? --- [Visit Topic](https://discuss.tvm.ai/t/rfc-vta-support-for-cloud-devices-opencl-compatible/6676/7) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/2e32445f9c5e9adc391e358443da4d5dfeb0d82a78e92cb3f346a887eeba7260).