On Tue, 5 Mar 2024, Jakub Jelinek wrote:

> On Tue, Mar 05, 2024 at 09:27:22AM +0100, Richard Biener wrote:
> > On Tue, 5 Mar 2024, Jakub Jelinek wrote:
> > > The following patch adds support for BIT_FIELD_REF lowering with
> > > large/huge _BitInt lhs.  BIT_FIELD_REF requires mode argument first
> > > operand, so the operand shouldn't be any huge _BitInt.
> > > If we only access limbs from inside of BIT_FIELD_REF using constant
> > > indexes, we can just create a new BIT_FIELD_REF to extract the limb,
> > > but if we need to use variable index in a loop, I'm afraid we need
> > > to spill it into memory, which is what the following patch does.
> > 
> > :/
> > 
> > If it's only ever "small" _BitInt and we'd want to optimize we could
> > fully unroll the loop at code generation time and thus avoid the
> > variable indices?  You could also lower the BIT_FIELD_REF to
> > variable shifts & masking I suppose.
> 
> Not really sure if one can have some of the SVE/RISCV modes in there,
> that couldn't be small anymore.  But otherwise yes, likely right now at most
> 64 byte vectors aka 512 bits.  Now, if it is say extraction of _BitInt(448)
> out of it (so that it isn't just VCE instead), that would still mean
> e.g. on ia32 unrolling the loop with 7 iterations handling 2 limbs each.
> 14 is already huge I'm afraid especially when it can be hidden somewhere in
> the middle of a large expression which is all mergeable.
> But more importantly, currently there are simple rules, large _BitInt
> implies straight line code, huge _BitInt implies a loop and the loop handles
> just 2 limbs (for other operations just 1 limb) per iteration.  Changing
> that depending on what trees are somewhere used would be a nightmare.
> The idea was that if it is worth unrolling, unroller can unroll it later
> and at that point I'd think e.g. FRE would optimize away the temporary
> memory.

Yeah, I would also guess FRE would optimize it though the question is
whether the unroller heuristic anticipates it or the loop is small
enough.  I guess we can worry when it shows to be a problem.

> For variable shifts/masking I'd need some type in which I can do it.

Ah, sure ... OTOH somehow RTL expansion manages to do it ;)

> Sure, perhaps if the inner operand is a vector I could use some non-constant
> permutations or similar.

If the extraction is byte aligned sure, maybe if the extraction is
from a single limb then it can be lowered without an extra temporary.

Richard.

Reply via email to