I think it would make sense to leave the exact vector layout, like vlen and lmul, to the caller. Attached is an attempt to implement sin and cos vectorized so it allows lmul values of m1 and m2, while using no more than a quarter of the vector registers. The function could live in libgcc and be used via a special pattern in the machine description that shows the exact list of clobbers.
sin.S
Description: Binary data