Richard Biener <richard.guent...@gmail.com> writes: > On Wed, Oct 23, 2019 at 2:12 PM Richard Sandiford > <richard.sandif...@arm.com> wrote: >> >> Richard Biener <richard.guent...@gmail.com> writes: >> > On Wed, Oct 23, 2019 at 1:51 PM Richard Sandiford >> > <richard.sandif...@arm.com> wrote: >> >> >> >> Richard Biener <richard.guent...@gmail.com> writes: >> >> > On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford >> >> > <richard.sandif...@arm.com> wrote: >> >> >> >> >> >> This patch is the first of a series that tries to remove two >> >> >> assumptions: >> >> >> >> >> >> (1) that all vectors involved in vectorisation must be the same size >> >> >> >> >> >> (2) that there is only one vector mode for a given element mode and >> >> >> number of elements >> >> >> >> >> >> Relaxing (1) helps with targets that support multiple vector sizes or >> >> >> that require the number of elements to stay the same. E.g. if we're >> >> >> vectorising code that operates on narrow and wide elements, and the >> >> >> narrow elements use 64-bit vectors, then on AArch64 it would normally >> >> >> be better to use 128-bit vectors rather than pairs of 64-bit vectors >> >> >> for the wide elements. >> >> >> >> >> >> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce >> >> >> fixed-length code for SVE. It also allows unpacked/half-size SVE >> >> >> vectors to work with -msve-vector-bits=256. >> >> >> >> >> >> The patch adds a new hook that targets can use to control how we >> >> >> move from one vector mode to another. The hook takes a starting vector >> >> >> mode, a new element mode, and (optionally) a new number of elements. >> >> >> The flexibility needed for (1) comes in when the number of elements >> >> >> isn't specified. >> >> >> >> >> >> All callers in this patch specify the number of elements, but a later >> >> >> vectoriser patch doesn't. I won't be posting the vectoriser patch >> >> >> for a few days, hence the RFC/A tag. >> >> >> >> >> >> Tested individually on aarch64-linux-gnu and as a series on >> >> >> x86_64-linux-gnu. OK to install? Or if not yet, does the idea >> >> >> look OK? >> >> > >> >> > In isolation the idea looks good but maybe a bit limited? I see >> >> > how it works for the same-size case but if you consider x86 >> >> > where we have SSE, AVX256 and AVX512 what would it return >> >> > for related_vector_mode (V4SImode, SImode, 0)? Or is this >> >> > kind of query not intended (where the component modes match >> >> > but nunits is zero)? >> >> >> >> In that case we'd normally get V4SImode back. It's an allowed >> >> combination, but not very useful. >> >> >> >> > How do you get from SVE fixed 128bit to NEON fixed 128bit then? Or is >> >> > it just used to stay in the same register set for different component >> >> > modes? >> >> >> >> Yeah, the idea is to use the original vector mode as essentially >> >> a base architecture. >> >> >> >> The follow-on patches replace vec_info::vector_size with >> >> vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes >> >> with targetm.vectorize.autovectorize_vector_modes. These are the >> >> starting modes that would be passed to the hook in the nunits==0 case. >> >> >> >> E.g. for Advanced SIMD on AArch64, it would make more sense for >> >> related_mode (V4HImode, SImode, 0) to be V4SImode rather than V2SImode. >> >> I think things would work in a similar way for the x86_64 vector archs. >> >> >> >> For SVE we'd add both VNx16QImode (the SVE mode) and V16QImode (the >> >> Advanced SIMD mode) to autovectorize_vector_modes, even though they >> >> happen to be the same size for 128-bit SVE. We can then compare >> >> 128-bit SVE with 128-bit Advanced SIMD, with related_mode ensuring >> >> that we consistently use all-SVE modes or all-Advanced SIMD modes >> >> for each attempt. >> >> >> >> The plan for SVE is to add 4(!) modes to autovectorize_vector_modes: >> >> >> >> - VNx16QImode (full vector) >> >> - VNx8QImode (half vector) >> >> - VNx4QImode (quarter vector) >> >> - VNx2QImode (eighth vector) >> >> >> >> and then pick the one with the lowest cost. related_mode would >> >> keep the number of units the same for nunits==0, within the limit >> >> of the vector size. E.g.: >> >> >> >> - related_mode (VNx16QImode, HImode, 0) == VNx8HImode (full vector) >> >> - related_mode (VNx8QImode, HImode, 0) == VNx8HImode (full vector) >> >> - related_mode (VNx4QImode, HImode, 0) == VNx4HImode (half vector) >> >> - related_mode (VNx2QImode, HImode, 0) == VNx2HImode (quarter vector) >> >> >> >> and: >> >> >> >> - related_mode (VNx16QImode, SImode, 0) == VNx4SImode (full vector) >> >> - related_mode (VNx8QImode, SImode, 0) == VNx4SImode (full vector) >> >> - related_mode (VNx4QImode, SImode, 0) == VNx4SImode (full vector) >> >> - related_mode (VNx2QImode, SImode, 0) == VNx2SImode (half vector) >> >> >> >> So when operating on multiple element sizes, the tradeoff is between >> >> trying to make full use of the vector size (higher base nunits) vs. >> >> trying to remove packs and unpacks between multiple vector copies >> >> (lower base nunits). The latter is useful because extending within >> >> a vector is an in-lane rather than cross-lane operation and truncating >> >> within a vector is a no-op. >> >> >> >> With a couple of tweaks, we seem to do a good job of guessing which >> >> version has the lowest cost, at least for the simple cases I've tried >> >> so far. >> >> >> >> Obviously there's going to be a bit of a compile-time cost >> >> for SVE targets, but I think it's worth paying for. >> > >> > I would guess that immediate benefit could be seen with >> > basic-block vectorization which simply fails when conversions >> > are involved. x86_64 should now always support V4SImode >> > and V2SImode so eventually a testcase can be crafted for that >> > as well. >> >> I'd hoped so too, but the problem is that if the cost saving is good >> enough, BB vectorisation simply stops at the conversion frontiers and >> vectorises the rest, rather than considering other vector mode >> combinations that might be able to do more. > > Sure, but when SLP build fails because it thinks it needs to unroll > it could ask for a vector type with the same number of lanes
Do you mean for loop or BB vectorisation? For loop vectorisation the outer loop iterating over vector sizes/modes works fine: if we need to unroll beyond the maximum VF, we'll just retry vectorisation with a different vector size/mode combination, with the narrowest element having smaller vectors. But yeah, I guess if get_vectype_for_scalar_type returns a vector with too many units for BB vectorisation, we could try asking for a different type with the "right" number of units, rather than failing and falling back to the next iteration of the outer loop. I'll give that a go. > (that's probably what we should do from the start - get same number > of lane vector types in BB vectorization). It's still useful to have different numbers of lanes for both loop and BB vectorisation. E.g. if we're applying SLP to an operation on 8 ints and 8 shorts, it's still better to mix V4SI and V8HI for 128-bit vector archs. > It's when you introduce multiple sizes then the outer loop over all > sizes comparing costs becomes somewhat obsolete... this outer > loop should then instead compare different VFs (which also means > possible extra unrolling beyond the vector size need?). This is effectively what we're doing for the SVE arrangement above. But replacing targetm.vectorize.autovectorize_vector_sizes with targetm.vectorize.autovectorize_vector_modes means that we can also try multiple vector archs with the same VF (e.g. 128-bit SVE vs. 128-bit Advanced SIMD). Thanks, Richard