Hi zhanghaohit, 

thanks for this proposal,  it is a very intersting topic, this proposal seems 
like be a very big change, but some parts of
this proposal i am not quite understand and need your help for clarify,

first about the motivation part,  this topic mentioned 

#1 about "cloud device may use PCIE instead of memory share", that make sense, 
but seems like a new driver with pcie support would can fix and no need such 
big change,  

#2 about "different programming models", could you help to give more detailed 
information about this part?  do we have any plan to address scalibility issue 
for cloud fpga performance concern?

based on current information of motivation, it is little confused why we need 
to do this big change.

for  "proposal" , this part mentioned "framework where any OpenCL-compatible 
devices can be easily integrated" and "Vendor-specific optimizations are 
built-in .. SDK", but "does not limit to specific SDK", seems like the goal is 
to create a cross platform framework, this idea really awsome, but this 2 part 
"any OpenCL-compatible devices" and "vendor-specific optimization" are 
conflict, could you give
more detail about what the plan here to balance this 2 parts and how to reduce 
related complexity to minus developer efforts?

for "major work", about "To avoid frequent PCIe copies" "we propose to let all 
middle layers of a computation graph to completely run in FPGA devices", about 
this part, I have couple questions 
first does that means this proposal would put all params data(input data, 
weights, bias) into FPGA sram one time? in such case if the model params size 
is bigger then FPGA capability how to handle such issue?  

second, the data transfer may cause big latency, could I know do we have any 
solution to hiding the memory latency? 

third, even with PCIE device, DMA should still work, could I know some detail 
about  which "PCIe transmission is costly"?

#4 about "auto-copy between layers" seems like this is talking about 
inter-operator parallel, as I know tvm currently not analysis and do
inter-operator parallel yet, do this proposal plan to add such support to tvm?

for "Major changes" , about "there is no need to launch additional 
service"(e.g., rpc server), this is a existing feature related deploy,
after build network moodle, vta can running locally with any language 
(c++/python etc), here is a deploy example 
https://github.com/apache/incubator-tvm-vta/pull/5 , 

about "Change VTA runtime to support batch queue synchronization", seems like 
this is current VTA logic, 
could I know some detail about the different between
existing logic and this new synchronization logic?

about "DFS traversal", as I know tvm seems like do network node compute 
sequentially instead of DFS, could I know what does this "DFS traversal" means?

about "except first layer and last layer (in FPGA)", currently lot solution 
include vta(fist conv) do this, but there are
also some solution offload  all conv include #1 layer into FPGA for example 
xilinx vitis,could I know what is the concern for putting first/last layer in 
cpu at this proposal? 


for "Limitations" , about "all instructions are running sequentially", this may 
cause big performance problem because memory hiding by pipe line TLPP.


Regards

Hua





---
[Visit 
Topic](https://discuss.tvm.ai/t/rfc-vta-support-for-cloud-devices-opencl-compatible/6676/6)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/e75243520632d4546277b511b96dfdf75d8c89a44daa242e06019a7698fcdf34).

Reply via email to