andykaylor wrote:

I gave this some more thought, and I'm still not sure doing this in CIR is a 
good idea.

I verified that the X86 backend (at least) does generate worse code for <3 x 
i32> loads and stores than it does for <4 x i32>.

https://godbolt.org/z/fGs8Exza6

The reason, I think, is that nothing in the IR gives a clear indication that 
it's OK to write garbage to the fourth memory slot, so the x86 backend takes 
extra measures to preserve whatever value is already there.

It feels to me like there is a certain ambiguity in the LLVM IR specification 
here. Consider:

```
@a = dso_local global <3 x i32> zeroinitializer, align 16
@b = dso_local global <3 x i32> zeroinitializer, align 16

define void @store_load() {
entry:
  %loadVecN = load <3 x i32>, ptr @a, align 16
  store <3 x i32> %loadVecN, ptr @b, align 16
  ret void
}
```
How much data is represented by `<3 x i32>`? At least for x86 targets, I 
believe it has a store size of 16 bytes but a raw size of 12 bytes. The x86 
backend correctly reserves 16 bytes for it when you use it for an alloca or 
global variable, but the backend seems to think it only has permission to write 
12 bytes of data for the `store <3 x i32>` instruction, even though the store 
size is 16 bytes.

I'm undecided if this is a bug in the x86 backend or just the way things are, 
because if we assume that <3 x i32> represents 16 bytes, I think we'd have to 
say the last four bytes are always poison, and I couldn't find any place that 
we say that. The lang ref definition for the `store` instruction says:

"If <value> is of scalar type then the number of bytes written does not exceed 
the minimum number of bytes needed to hold all bits of the type. For example, 
storing an i24 writes at most three bytes. When writing a value of a type like 
i20 with a size that is not an integral number of bytes, it is unspecified what 
happens to the extra bits that do not belong to the type, but they will 
typically be overwritten. If <value> is of aggregate type, padding is filled 
with `undef`"

Notice that it doesn't say what happens for vector types. I guess there is some 
consensus that they act like scalar types for stores?

So far, this is just a brain dump of what I thought about while trying to 
understand the simple case.

The thing that I'm concerned about is that if we generate <4 x i32> loads and 
stores in CIR, it puts a barrier in the way of any optimization that might be 
trying to reorganize vector sizes to handle a large number of <3 x i32> 
operations more efficiently. I'm not sure how much my imagination is going into 
the realm of the purely theoretical here and how much this will be possible in 
the near future. What I'm imagining is a block of memory that contains a huge 
number of triples, and we're loading them into vectors and doing something with 
them. There's a good chance that using <3 x i32> will be the most natural way 
to write the code, but depending on the target hardware, we may want to slice 
up the operations entirely differently. If we generate CIR that says we're 
loading and storing <4 x i32>, the optimizer is going to have to deal with that.

@AnastasiaStulova can you provide any input on whether <3 x i32> loads and 
stores would be more useful to the optimizer than <4 x i32> loads and stores 
with poison values being shuffled in and out of the dead lane?

https://github.com/llvm/llvm-project/pull/161232
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to