So AVX512 has vcompressp{d,s} and vexpandp{d,s} (but nothing for smaller
integer element types). Those could be used for this but they have
a vector result (and element zero would be the first active).
But don't you possibly want the last inactive as well, dependent on
whether this is a peeled/not peeled exit? We could shift the mask
by one for either case.
vcompresspd is not available on AVX2 or SSE4.1, using a vector-vector
permute to get the element 'i' % nunits to lane zero would be another
possibility, also for non-float or double sized elements we need sth
like this.
I do wonder whether we want to have the compress/expand as actual
optabs when we use them. Having an extract_first (without _active,
following extract_last_optab) is probably OK to abstract this to
some extent. extract_last doesn't specify what happens if no
bit is set in the mask, fold_extract_last seems to be the same
but with an else value - I wonder whether we should canonicalize
those and thus have an else value for extract_first.
Without having read the rest yet, riscv has a vcompress for all element sizes
with similar semantics. Also still needs to extract element zero.
--
Regards
Robin