Tamar Christina <[email protected]> writes: >> -----Original Message----- >> From: Richard Sandiford <[email protected]> >> Sent: Thursday, February 1, 2024 4:42 PM >> To: Tamar Christina <[email protected]> >> Cc: Andrew Pinski <[email protected]>; [email protected]; nd >> <[email protected]>; Richard Earnshaw <[email protected]>; Marcus >> Shawcroft <[email protected]>; Kyrylo Tkachov >> <[email protected]> >> Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output >> >> Tamar Christina <[email protected]> writes: >> >> -----Original Message----- >> >> From: Richard Sandiford <[email protected]> >> >> Sent: Thursday, February 1, 2024 2:24 PM >> >> To: Andrew Pinski <[email protected]> >> >> Cc: Tamar Christina <[email protected]>; [email protected]; nd >> >> <[email protected]>; Richard Earnshaw <[email protected]>; Marcus >> >> Shawcroft <[email protected]>; Kyrylo Tkachov >> >> <[email protected]> >> >> Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output >> >> >> >> Andrew Pinski <[email protected]> writes: >> >> > On Thu, Feb 1, 2024 at 1:26 AM Tamar Christina <[email protected]> >> >> wrote: >> >> >> >> >> >> Hi All, >> >> >> >> >> >> In the vget_set_lane_1.c test the following entries now generate a zip1 >> instead >> >> of an INS >> >> >> >> >> >> BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0) >> >> >> BUILD_TEST (int32x2_t, int32x2_t, , , s32, 1, 0) >> >> >> BUILD_TEST (uint32x2_t, uint32x2_t, , , u32, 1, 0) >> >> >> >> >> >> This is because the non-Q variant for indices 0 and 1 are just >> >> >> shuffling values. >> >> >> There is no perf difference between INS SIMD to SIMD and ZIP, as such >> >> >> just >> >> update the >> >> >> test file. >> >> > Hmm, is this true on all cores? I suspect there is a core out there >> >> > where INS is implemented with a much lower latency than ZIP. >> >> > If we look at config/aarch64/thunderx.md, we can see INS is 2 cycles >> >> > while ZIP is 6 cycles (3/7 for q versions). >> >> > Now I don't have any invested interest in that core any more but I >> >> > just wanted to point out that is not exactly true for all cores. >> >> >> >> Thanks for the pointer. In that case, perhaps we should prefer >> >> aarch64_evpc_ins over aarch64_evpc_zip in >> aarch64_expand_vec_perm_const_1? >> >> That's enough to fix this failure, but it'll probably require other >> >> tests to be adjusted... >> > >> > I think given that Thundex-X is a 10 year old micro-architecture that is >> > several >> cases where >> > often used instructions have very high latencies that generic codegen >> > should not >> be blocked >> > from progressing because of it. >> > >> > we use zips in many things and if thunderx codegen is really of that much >> importance then I >> > think the old codegen should be gated behind -mcpu=thunderx rather than >> preventing generic >> > changes. >> >> But you said there was no perf difference between INS and ZIP, so it >> sounds like for all known cases, using INS rather than ZIP is either >> neutral or better. >> >> There's also the possible secondary benefit that the INS patterns use >> standard RTL operations whereas the ZIP patterns use unspecs. >> >> Keeping ZIP seems OK there's a specific reason to prefer it over INS for >> more modern cores though. > > Ok, that's a fair point. Doing some due diligence, Neoverse-E1 and > Cortex-A65 SWoGs seem to imply that there ZIPs have better throughput > than INSs. However the entries are inconsistent and I can't measure the > difference so I believe this to be a documentation bug. > > That said, switching the operands seems to show one issue in that preferring > INS degenerates code in cases where we are inserting the top bits of the first > parameter into the bottom of the second parameter and returning, > > Zip being a Three operand instruction allows us to put the result into the > final > destination register with one operation whereas INS requires an fmov: > > foo_uzp1_s32: > ins v0.s[1], v1.s[0] > fmov d0, d0 > ret > foo_uzp2_s32: > ins v1.s[0], v0.s[1] > fmov d0, d1 > ret
Ah, yeah, I should have thought about that. In that case, the original patch is OK, thanks. Richard
