andykaylor wrote: I gave this some more thought, and I'm still not sure doing this in CIR is a good idea.
I verified that the X86 backend (at least) does generate worse code for <3 x i32> loads and stores than it does for <4 x i32>. https://godbolt.org/z/fGs8Exza6 The reason, I think, is that nothing in the IR gives a clear indication that it's OK to write garbage to the fourth memory slot, so the x86 backend takes extra measures to preserve whatever value is already there. It feels to me like there is a certain ambiguity in the LLVM IR specification here. Consider: ``` @a = dso_local global <3 x i32> zeroinitializer, align 16 @b = dso_local global <3 x i32> zeroinitializer, align 16 define void @store_load() { entry: %loadVecN = load <3 x i32>, ptr @a, align 16 store <3 x i32> %loadVecN, ptr @b, align 16 ret void } ``` How much data is represented by `<3 x i32>`? At least for x86 targets, I believe it has a store size of 16 bytes but a raw size of 12 bytes. The x86 backend correctly reserves 16 bytes for it when you use it for an alloca or global variable, but the backend seems to think it only has permission to write 12 bytes of data for the `store <3 x i32>` instruction, even though the store size is 16 bytes. I'm undecided if this is a bug in the x86 backend or just the way things are, because if we assume that <3 x i32> represents 16 bytes, I think we'd have to say the last four bytes are always poison, and I couldn't find any place that we say that. The lang ref definition for the `store` instruction says: "If <value> is of scalar type then the number of bytes written does not exceed the minimum number of bytes needed to hold all bits of the type. For example, storing an i24 writes at most three bytes. When writing a value of a type like i20 with a size that is not an integral number of bytes, it is unspecified what happens to the extra bits that do not belong to the type, but they will typically be overwritten. If <value> is of aggregate type, padding is filled with `undef`" Notice that it doesn't say what happens for vector types. I guess there is some consensus that they act like scalar types for stores? So far, this is just a brain dump of what I thought about while trying to understand the simple case. The thing that I'm concerned about is that if we generate <4 x i32> loads and stores in CIR, it puts a barrier in the way of any optimization that might be trying to reorganize vector sizes to handle a large number of <3 x i32> operations more efficiently. I'm not sure how much my imagination is going into the realm of the purely theoretical here and how much this will be possible in the near future. What I'm imagining is a block of memory that contains a huge number of triples, and we're loading them into vectors and doing something with them. There's a good chance that using <3 x i32> will be the most natural way to write the code, but depending on the target hardware, we may want to slice up the operations entirely differently. If we generate CIR that says we're loading and storing <4 x i32>, the optimizer is going to have to deal with that. @AnastasiaStulova can you provide any input on whether <3 x i32> loads and stores would be more useful to the optimizer than <4 x i32> loads and stores with poison values being shuffled in and out of the dead lane? https://github.com/llvm/llvm-project/pull/161232 _______________________________________________ cfe-commits mailing list [email protected] https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
