Thanks @hjiang for the comments. 

[quote="hjiang, post:6, topic:6676"]
#1 about “cloud device may use PCIE instead of memory share”, that make sense, 
but seems like a new driver with pcie support would can fix and no need such 
big change,
[/quote]

#1 a new driver with PCIe support is not enough. As there is no mechanism to 
deal with mix of CPU and FPGA ops. We have to insert a *device_copy* op if two 
adjacent layers resident in different devices. The current VTA allocate all the 
memory in FPGA, and both CPU (ARM) ops and FPGA ops are accessing the same 
memory area.

[quote="hjiang, post:6, topic:6676"]
#2 about “different programming models”, could you help to give more detailed 
information about this part? do we have any plan to address scalibility issue 
for cloud fpga performance concern?
[/quote]

#2 "different programming models" mainly means the differences during hardware 
implementation (e.g., OpenCL vs Xilinx HLS). What do you mean by the 
scalability issue? Could you give more details?

[quote="hjiang, post:6, topic:6676"]
but this 2 part “any OpenCL-compatible devices” and “vendor-specific 
optimization” are conflict, could you give more detail about what the plan here 
to balance this 2 parts and how to reduce related complexity to minus developer 
efforts?
[/quote]

@remotego Could you help elaborate on this part a bit?

[quote="hjiang, post:6, topic:6676"]
for “major work”, about “To avoid frequent PCIe copies” “we propose to let all 
middle layers of a computation graph to completely run in FPGA devices”, about 
this part, I have couple questions first does that means this proposal would 
put all params data(input data, weights, bias) into FPGA sram one time? in such 
case if the model params size is bigger then FPGA capability how to handle such 
issue?
[/quote]

It is not necessary to put all params data into FPGA sram one time. Actually we 
do not change the original behaviour. That's, all the params data are put in 
FPGA DRAM during initialisation, and we run the graph layer by layer. 
The only thing we do is to ensure that all the ops of middle layers can be run 
in FPGA (implement vta compute and schedule for all middle layers).

[quote="hjiang, post:6, topic:6676"]
second, the data transfer may cause big latency, could I know do we have any 
solution to hiding the memory latency?
[/quote]

I think we do not change anything for this part, compared with original VTA. 
Since weights/bias are only transferred once for one model, I think the cost 
should be ok?

[quote="hjiang, post:6, topic:6676"]
third, even with PCIE device, DMA should still work, could I know some detail 
about which “PCIe transmission is costly”?
[/quote]

Yes. DMA is used for PCIe transmission. But the setup cost for DMA is 
non-negligible. Compared with DRAM bus, PCIe DMA is costly.


[quote="hjiang, post:6, topic:6676"]
#4 about “auto-copy between layers” seems like this is talking about 
inter-operator parallel, as I know tvm currently not analysis and do 
inter-operator parallel yet, do this proposal plan to add such support to tvm?
[/quote]

I think "auto-copy" here is not dealing with inter-operator parallel. It is 
used to make data accesible by corresponding devices. Here is an example. 


`MaxPool (on CPU) -> Conv2D (on FPGA) -> xxx`

In order to have this to work, we have to insert a *device_copy* between the 
CPU op and FPGA op. After the insertion, it will become:

`MaxPool (on CPU) -> device_copy -> Conv2D (on FPGA) -> xxx`

[quote="hjiang, post:6, topic:6676"]
for “Major changes” , about “there is no need to launch additional 
service”(e.g., rpc server), this is a existing feature related deploy, after 
build network moodle, vta can running locally with any language (c++/python 
etc), here is a deploy example 
https://github.com/apache/incubator-tvm-vta/pull/5 ,
[/quote]

Thanks for the information. For this *Major change*, we actually mean to the 
end user. In other words, how the end users run an inference on FPGA. There is 
no much code changes. Currently, we re-use the simulation code (i.e., 
LocalSession). We'll take a look at the new feature, and see what can we 
borrow. Thanks.

[quote="hjiang, post:6, topic:6676"]
about “Change VTA runtime to support batch queue synchronization”, seems like 
this is current VTA logic, could I know some detail about the different between 
existing logic and this new synchronization logic?
[/quote]

Current VTA does the synchronization for every layer. We propose to provide an 
option to do this for every inference (multiple layers).

[quote="hjiang, post:6, topic:6676"]
about “DFS traversal”, as I know tvm seems like do network node compute 
sequentially instead of DFS, could I know what does this “DFS traversal” means?
[/quote]

For this, I mean the `device annotation code` 
[here](https://github.com/apache/incubator-tvm/blob/c55ed371693740291b82cc8d88bf09c830d029c7/src/relay/transforms/device_annotation.cc#L369).

[quote="hjiang, post:6, topic:6676"]
about “except first layer and last layer (in FPGA)”, currently lot solution 
include vta(fist conv) do this, but there are also some solution offloaded all 
conv include #1 layer into FPGA,could I know what is the concern for putting 
first/last layer in cpu at this proposal?
[/quote]

Actually we do put the first/last layer in CPU. Currently I think VTA does not 
support channels size < BLOCK, and max/avg_pools are not supported neither. So 
we just let these layers run in CPU. Did I get your point?

[quote="hjiang, post:6, topic:6676"]
for “Limitations” , about “all instructions are running sequentially”, this may 
cause big performance problem because memory hiding by pipe line TLPP.
[/quote]

Yes. This may be a potential performance issue. @remotego Could you elaborate 
more on this?





---
[Visit 
Topic](https://discuss.tvm.ai/t/rfc-vta-support-for-cloud-devices-opencl-compatible/6676/7)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/2e32445f9c5e9adc391e358443da4d5dfeb0d82a78e92cb3f346a887eeba7260).

Reply via email to