Yeah, it is unfriendly for Ansor. However, I think it is not contradict. We could not expect we could generate asm like ACL, but we could expect we could achieve the same optimization. For example, your example is we can not do `register blocking` optimization easily, but we could expect we have done `FMA` optimization like ACL, so we generate `fmla` correctly too. For the CPU part, in my opinion, even we can not generate the same asm snippet, but we maybe could get the same level of performance if we could generate key instruction like `fmla`. If we can not, there must be one factor we ignore, maybe memory access unfriendly so that we have high rate of cache miss or what else.
back to ansor, we of course should improve our ansor's performance, however, for the most performance gemm micro part, I think the most practical way in the current time, is we should leverage micro gemm kernel (4x4/8x8) and let ansor or metaschedule to schedule other part (like tiling parameter / unroll / parallel or what else) --- [Visit Topic](https://discuss.tvm.apache.org/t/do-we-have-any-way-to-process-codegen-with-more-fine-grade-control/9908/7) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/8df7cf6538086479a1f6a0a7c82725d74ed35fba9b1abb148392751b9fcb1448).