@yzhliu let me explain it another way.
suppose we work only at the scalar level. people had proved that reverse mode 
only take 3x more computation. This does not need any optimization - the 
gradient of f will only be a (small, typicallty less then 10) constant factor 
times slower then the computation of f itself.

In this PR, the gradient of f might be MANY times more expensive.
This is because it is calculating the jacobian, rather then the product of a 
vector/matrix/tensor with that jacobian, which can be fused, so the computation 
can be expressed in a way simpler form.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-tvm/issues/1996#issuecomment-595980942

Reply via email to