Hi zhanghaohit,
thanks for this proposal, it is a very intersting topic, this proposal seems like be a very big change, but some parts of this proposal i am not quite understand and need your help for clarify, first about the motivation part, this topic mentioned #1 about "cloud device may use PCIE instead of memory share", that make sense, but seems like a new driver with pcie support would can fix and no need such big change, #2 about "different programming models", could you help to give more detailed information about this part? do we have any plan to address scalibility issue for cloud fpga performance concern? based on current information of motivation, it is little confused why we need to do this big change. for "proposal" , this part mentioned "framework where any OpenCL-compatible devices can be easily integrated" and "Vendor-specific optimizations are built-in .. SDK", but "does not limit to specific SDK", seems like the goal is to create a cross platform framework, this idea really awsome, but this 2 part "any OpenCL-compatible devices" and "vendor-specific optimization" are conflict, could you give more detail about what the plan here to balance this 2 parts and how to reduce related complexity to minus developer efforts? for "major work", about "To avoid frequent PCIe copies" "we propose to let all middle layers of a computation graph to completely run in FPGA devices", about this part, I have couple questions first does that means this proposal would put all params data(input data, weights, bias) into FPGA sram one time? in such case if the model params size is bigger then FPGA capability how to handle such issue? second, the data transfer may cause big latency, could I know do we have any solution to hiding the memory latency? third, even with PCIE device, DMA should still work, could I know some detail about which "PCIe transmission is costly"? #4 about "auto-copy between layers" seems like this is talking about inter-operator parallel, as I know tvm currently not analysis and do inter-operator parallel yet, do this proposal plan to add such support to tvm? for "Major changes" , about "there is no need to launch additional service"(e.g., rpc server), this is a existing feature related deploy, after build network moodle, vta can running locally with any language (c++/python etc), here is a deploy example https://github.com/apache/incubator-tvm-vta/pull/5 , about "Change VTA runtime to support batch queue synchronization", seems like this is current VTA logic, could I know some detail about the different between existing logic and this new synchronization logic? about "DFS traversal", as I know tvm seems like do network node compute sequentially instead of DFS, could I know what does this "DFS traversal" means? about "except first layer and last layer (in FPGA)", currently lot solution include vta(fist conv) do this, but there are also some solution offload all conv include #1 layer into FPGA for example xilinx vitis,could I know what is the concern for putting first/last layer in cpu at this proposal? for "Limitations" , about "all instructions are running sequentially", this may cause big performance problem because memory hiding by pipe line TLPP. Regards Hua --- [Visit Topic](https://discuss.tvm.ai/t/rfc-vta-support-for-cloud-devices-opencl-compatible/6676/6) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/e75243520632d4546277b511b96dfdf75d8c89a44daa242e06019a7698fcdf34).