Yeah, it is unfriendly for Ansor. However, I think it is not contradict. We 
could not expect we could generate asm like ACL, but we could expect we could 
achieve the same optimization. For example, your example is we can not do 
`register blocking` optimization easily, but we could expect we have done `FMA` 
optimization like ACL, so we generate `fmla` correctly too. For the CPU part, 
in my opinion, even we can not generate the same asm snippet, but we maybe 
could get the same level of performance if we could generate key instruction 
like `fmla`. If we can not, there must be one factor we ignore, maybe memory 
access unfriendly so that we have high rate of cache miss or what else.

back to ansor, we of course should improve our ansor's performance, however, 
for the most performance gemm micro part, I think the most practical way in the 
current time, is we should leverage micro gemm kernel (4x4/8x8) and let ansor 
or metaschedule to schedule other part (like tiling parameter / unroll / 
parallel or what else)





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/do-we-have-any-way-to-process-codegen-with-more-fine-grade-control/9908/7)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/8df7cf6538086479a1f6a0a7c82725d74ed35fba9b1abb148392751b9fcb1448).

Reply via email to