The hardware guide on VTA says that the dependence tokens (which are used to
synchronize between RAW and WAR Queues) are information less. This effectively
kills all the task level parallelism because, whenever Producer Q has some
items left to be notified, the consumer cannot continue with it
I am going through the VTA tutorial for matrix multiplication given in
https://tvm.apache.org/docs/vta/tutorials/matrix_multiply.html
A_2 is the input vector tensor and B_2 is the weight tensor.
As we can see in the lowered code below, the last parameter's value (which
corresponds to memory st