On Tue, 5 Mar 2024, Jakub Jelinek wrote: > On Tue, Mar 05, 2024 at 09:27:22AM +0100, Richard Biener wrote: > > On Tue, 5 Mar 2024, Jakub Jelinek wrote: > > > The following patch adds support for BIT_FIELD_REF lowering with > > > large/huge _BitInt lhs. BIT_FIELD_REF requires mode argument first > > > operand, so the operand shouldn't be any huge _BitInt. > > > If we only access limbs from inside of BIT_FIELD_REF using constant > > > indexes, we can just create a new BIT_FIELD_REF to extract the limb, > > > but if we need to use variable index in a loop, I'm afraid we need > > > to spill it into memory, which is what the following patch does. > > > > :/ > > > > If it's only ever "small" _BitInt and we'd want to optimize we could > > fully unroll the loop at code generation time and thus avoid the > > variable indices? You could also lower the BIT_FIELD_REF to > > variable shifts & masking I suppose. > > Not really sure if one can have some of the SVE/RISCV modes in there, > that couldn't be small anymore. But otherwise yes, likely right now at most > 64 byte vectors aka 512 bits. Now, if it is say extraction of _BitInt(448) > out of it (so that it isn't just VCE instead), that would still mean > e.g. on ia32 unrolling the loop with 7 iterations handling 2 limbs each. > 14 is already huge I'm afraid especially when it can be hidden somewhere in > the middle of a large expression which is all mergeable. > But more importantly, currently there are simple rules, large _BitInt > implies straight line code, huge _BitInt implies a loop and the loop handles > just 2 limbs (for other operations just 1 limb) per iteration. Changing > that depending on what trees are somewhere used would be a nightmare. > The idea was that if it is worth unrolling, unroller can unroll it later > and at that point I'd think e.g. FRE would optimize away the temporary > memory.
Yeah, I would also guess FRE would optimize it though the question is whether the unroller heuristic anticipates it or the loop is small enough. I guess we can worry when it shows to be a problem. > For variable shifts/masking I'd need some type in which I can do it. Ah, sure ... OTOH somehow RTL expansion manages to do it ;) > Sure, perhaps if the inner operand is a vector I could use some non-constant > permutations or similar. If the extraction is byte aligned sure, maybe if the extraction is from a single limb then it can be lowered without an extra temporary. Richard.