We discussed this RFC at the TVM Community meeting. Here are some notes beyond 
the [presentation 
content](https://docs.google.com/presentation/d/1sRz8hVy_619VmbSRwJPGmcMfgBzaqMOAVaVNxlSgGFM/edit#slide=id.p):
- @mbs-octoml notes that this RFC isn't particularly set in stone and it may 
grow/change as this effort proceeds. It's a bit speculative right now.
- note: "on TVM" means running regular Relay operators on TVM
- in the autotuning case you'd do autotuning on each path from beginning to end?
  - currently we are doing autotuning on the fly. for autotvm this isn't too 
bad. for new metaschedule, every candidate kernel will be treated as its own 
tuning task, so there are a lot more candidate kernels to explore (tvm's 
present FuseOps pass is greedy and always combines kernels, but MetaScheduler 
does not necessarily
- will we still use the cache mechanism to avoid re-tuning identical operators? 
  - yes, and we hope to get a good hit rate from the cache.
- by cache do you mean something you've created yourself? or is it an online 
service or tuning log on github?
  - @mbs-octoml: there's an abstract cost-estimator interface (given an 
IRModule, return a double). in prototype, there's only 1 instantiation of such 
interface which runs using the TVM runners and the standard benchmarking 
machinery in Python. Inside OctoML, we'll have different instantiations of this 
interface which will consult our internal infrastructure. the caching in the 
prototype works by a naïve in-memory cache coupled with a bit of a hack to 
populate it with standard AutoTVM tuning records. 
- @manupa-arm: after the candidate partitioning is searched, will collage 
consider merging adjacent offloaded subgraphs to the same compiler?
  - yes, there is a cleanup pass that will merge adjacent subgraphs. this could 
happen due to heuristics in how partitions are chosen. this is a little bit 
like how MergeComplierRegions works now. generally speaking we expect this to 
be a bit orthogonal to the results of the search, because we expect transition 
latency to be small. if we do see a difference, it means Collage was not trying 
to offload the right size of subgraphs.
- if we identify
  - when i say partitioning, what do I mean? can you explore serial vs 
parallel? what about notions of inlining (e.g. a reshape shared 38 times)? 
should I be exploring doing the reshape and sharing the result? none of this is 
being done right now.
- the changes needed are just to expand the search model and the way we compute 
end-to-end measurement latency?
  - the constraint we need to stay in here is that the transform you want to 
try is local--the changes can be confined to a subgraph and you can time that 
subgraph. beyond that, you get outside the bounds of dynamic programming.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/62#issuecomment-1090430092
You are receiving this because you are subscribed to this thread.

Message ID: <apache/tvm-rfcs/pull/62/c1090430...@github.com>

Reply via email to