Re: [Committed] Tweak new gcc.target/i386/pr107548-1.c for -march=cascadelake.
On Dez 24 2022, Roger Sayle wrote: > +/* { dg-final { scan-assembler-times "v?paddd" 6 } } */ Since this is not anchored, the v? pattern is redundant. > +/* { dg-final { scan-assembler-times "v?paddq" 2 } } */ > +/* { dg-final { scan-assembler "v?psrldq" } } */ Likewise. -- Andreas Schwab, sch...@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different."
[committed] wwwdocs: gcc-12: Spelling fixes
A case of British to American English, too, for consistency. Gerald --- htdocs/gcc-12/changes.html | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/htdocs/gcc-12/changes.html b/htdocs/gcc-12/changes.html index 30fa4d6e..b3775f82 100644 --- a/htdocs/gcc-12/changes.html +++ b/htdocs/gcc-12/changes.html @@ -252,7 +252,7 @@ You may also want to check out our will "slide" to the correct lower bound of the subtype. -Generalized Object.Operand notation. The follwing +Generalized Object.Operand notation. The following code is now valid V.Add_Element(42);, with V being a vector, for example. Additional when constructs. Keywords @@ -551,13 +551,13 @@ function Multiply (S1, S2 : Sign) return Sign is -fcheck=, enables or disables the code generation of specific run-time contract checks. - -fcheckaction=, controls the run-time behaviour on an + -fcheckaction=, controls the run-time behavior on an assert, array bounds check, or final switch contract failure. The default is -fcheckaction=throw. -fdump-c++-spec=, dumps all compiled extern(C++) declarations as C++ code to the given file. - The supplimentary option -fdump-c++-spec-verbose turns on + The supplementary option -fdump-c++-spec-verbose turns on emission of comments for ignored declarations in the generated spec. -fextern-std=, controls which C++ standard @@ -611,7 +611,7 @@ function Multiply (S1, S2 : Sign) return Sign is option now selects the IEEE 128-bit floating point format for REAL(KIND=16). R16_IBM and R16_IEEE have been added to the --fconvert option, the CONVERT specifyer of +-fconvert option, the CONVERT specifier of the OPEN statement and the GFORTRAN_CONVERT_UNIT environment variable. @@ -625,7 +625,7 @@ function Multiply (S1, S2 : Sign) return Sign is The libgccjit API gained 30 new entry points: - 17 new "reflection" entrypoints for querying functions and types (https://gcc.gnu.org/onlinedocs/gcc-12.1.0/jit/topics/compatibility.html#libgccjit-abi-16";>LIBGCCJIT_ABI_16) + 17 new "reflection" entry points for querying functions and types (https://gcc.gnu.org/onlinedocs/gcc-12.1.0/jit/topics/compatibility.html#libgccjit-abi-16";>LIBGCCJIT_ABI_16) https://gcc.gnu.org/onlinedocs/gcc-12.1.0/jit/topics/expressions.html#c.gcc_jit_lvalue_set_tls_model";>gcc_jit_lvalue_set_tls_model @@ -638,7 +638,7 @@ function Multiply (S1, S2 : Sign) return Sign is https://gcc.gnu.org/onlinedocs/gcc-12.1.0/gcc/Common-Variable-Attributes.html#index-section-variable-attribute";>__attribute__((section(".section"))) (https://gcc.gnu.org/onlinedocs/gcc-12.1.0/jit/topics/compatibility.html#libgccjit-abi-18";>LIBGCCJIT_ABI_18) - 4 new entrypoints for initializing global variables and creating + 4 new entry points for initializing global variables and creating constructors for rvalues (https://gcc.gnu.org/onlinedocs/gcc-12.1.0/jit/topics/compatibility.html#libgccjit-abi-19";>LIBGCCJIT_ABI_19) @@ -659,7 +659,7 @@ function Multiply (S1, S2 : Sign) return Sign is (https://gcc.gnu.org/onlinedocs/gcc-12.1.0/jit/topics/compatibility.html#libgccjit-abi-23";>LIBGCCJIT_ABI_23) - 2 new entrypoints for setting the alignment of a variable + 2 new entry points for setting the alignment of a variable (https://gcc.gnu.org/onlinedocs/gcc-12.1.0/jit/topics/compatibility.html#libgccjit-abi-24";>LIBGCCJIT_ABI_24) @@ -984,7 +984,7 @@ function Multiply (S1, S2 : Sign) return Sign is "tainted" values, for consistency with the new warnings. A new https://gcc.gnu.org/onlinedocs/gcc-12.1.0/gcc/Common-Function-Attributes.html#index-tainted_005fargs-function-attribute";>__attribute__ ((tainted_args)) has been - added to the C and C++ frontends, usable on functions, and on + added to the C and C++ front ends, usable on functions, and on function pointer callback fields in structs. The analyzer's taint mode will treat all parameters and buffers pointed to by parameters of such functions as being attacker-controlled, such as for -- 2.38.1
RE: [x86 PATCH] Use movss/movsd to implement V4SI/V2DI VEC_PERM.
Hi Uros, Many thanks and merry Christmas. Here's the version as committed, implemented using your preferred idiom with mode iterators for movss/movsd. Thanks again. 2022-12-25 Roger Sayle Uroš Bizjak gcc/ChangeLog * config/i386/i386-builtin.def (__builtin_ia32_movss): Update CODE_FOR_sse_movss to CODE_FOR_sse_movss_v4sf. (__builtin_ia32_movsd): Likewise, update CODE_FOR_sse2_movsd to CODE_FOR_sse2_movsd_v2df. * config/i386/i386-expand.cc (split_convert_uns_si_sse): Update gen_sse_movss call to gen_sse_movss_v4sf, and gen_sse2_movsd call to gen_sse2_movsd_v2df. (expand_vec_perm_movs): Also allow V4SImode with TARGET_SSE and V2DImode with TARGET_SSE2. * config/i386/sse.md (avx512fp16_fcmaddcsh_v8hf_mask3): Update gen_sse_movss call to gen_sse_movss_v4sf. (avx512fp16_fmaddcsh_v8hf_mask3): Likewise. (sse_movss_): Renamed from sse_movss using VI4F_128 mode iterator to handle both V4SF and V4SI. (sse2_movsd_): Likewise, renamed from sse2_movsd using VI8F_128 mode iterator to handle both V2DF and V2DI. gcc/testsuite/ChangeLog * gcc.target/i386/sse-movss-4.c: New test case. * gcc.target/i386/sse2-movsd-3.c: New test case. Roger -- > -Original Message- > From: Uros Bizjak > Sent: 23 December 2022 17:18 > To: Roger Sayle > Cc: GCC Patches > Subject: Re: [x86 PATCH] Use movss/movsd to implement V4SI/V2DI VEC_PERM. > > On Fri, Dec 23, 2022 at 5:46 PM Roger Sayle > wrote: > > > > > > This patch tweaks the x86 backend to use the movss and movsd > > instructions to perform some vector permutations on integer vectors > > (V4SI and V2DI) in the same way they are used for floating point vectors > > (V4SF > and V2DF). > > > > As a motivating example, consider: > > > > typedef unsigned int v4si __attribute__((vector_size(16))); typedef > > float v4sf __attribute__((vector_size(16))); v4si foo(v4si x,v4si y) { > > return (v4si){y[0],x[1],x[2],x[3]}; } v4sf bar(v4sf x,v4sf y) { return > > (v4sf){y[0],x[1],x[2],x[3]}; } > > > > which is currently compiled with -O2 to: > > > > foo:movdqa %xmm0, %xmm2 > > shufps $80, %xmm0, %xmm1 > > movdqa %xmm1, %xmm0 > > shufps $232, %xmm2, %xmm0 > > ret > > > > bar:movss %xmm1, %xmm0 > > ret > > > > with this patch both functions compile to the same form. > > Likewise for the V2DI case: > > > > typedef unsigned long v2di __attribute__((vector_size(16))); typedef > > double v2df __attribute__((vector_size(16))); > > > > v2di foo(v2di x,v2di y) { return (v2di){y[0],x[1]}; } v2df bar(v2df > > x,v2df y) { return (v2df){y[0],x[1]}; } > > > > which is currently generates: > > > > foo:shufpd $2, %xmm0, %xmm1 > > movdqa %xmm1, %xmm0 > > ret > > > > bar:movsd %xmm1, %xmm0 > > ret > > > > There are two possible approaches to adding integer vector forms of > > the sse_movss and sse2_movsd instructions. One is to use a mode > > iterator > > (VI4F_128 or VI8F_128) on the existing define_insn patterns, but this > > requires renaming the patterns to sse_movss_ which then requires > > changes to i386-builtins.def and through-out the backend to reflect > > the new naming of gen_sse_movss_v4sf. The alternate approach (taken > > here) is to simply clone and specialize the existing patterns. Uros, > > if you'd prefer the first approach, I'm happy to make/test/commit those > changes. > > I would really prefer the variant with VI4F_128/VI8F_128, these two iterators > were introduced specifically for this case (see e.g. > sse_shufps_ and sse2_shufpd_. The internal name of the > pattern is fairly irrelevant and a trivial search and replace operation can > replace > the grand total of 6 occurrences ...) > > Also, changing sse2_movsd to use VI8F_128 mode iterator would enable more > alternatives besides movsd, so we give combine pass some more opportunities > with memory operands. > > So, the patch with those two iterators is pre-approved. > > Uros. > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32}, > > with no new failures. Ok for mainline? > > > > 2022-12-23 Roger Sayle > > > > gcc/ChangeLog > > * config/i386/i386-expand.cc (expand_vec_perm_movs): Also allow > > V4SImode with TARGET_SSE and V2DImode with TARGET_SSE2. > > * config/i386/sse.md (sse_movss_v4si): New define_insn, a V4SI > > specialization of sse_movss. > > (sse2_movsd_v2di): Likewise, a V2DI specialization of sse2_movsd. > > > > gcc/testsuite/ChangeLog > > * gcc.target/i386/sse-movss-4.c: New test case. > > * gcc.target/i386/sse2-movsd-3.c: New test case. > > > > > > Thanks in advance, > > Roger > > -- > > diff --git a/gcc/config/i386/i386-builtin.def b/gcc/config/i386/i386-builtin.def index d85b175..0d1fc34 100644 -
Re: [PATCH V2] Disable sched1 in functions that call setjmp
Hi! On Sat, Dec 24, 2022 at 10:58:41AM +0100, Jose E. Marchesi via Gcc-patches wrote: > Allright, so we have two short-term alternatives for at least remove the > possibility that GCC generates wrong code for valid C when the scheduler > is turned on: > > a) To disable sched1 in functions that call setjmp. That is a heavy hammer. > b) To change deps_analyze_insn so instructions are not moved across >function calls before register allocation (!reload_completed). And this is way heavier still. OTOH, it is possible b) actually improves code (improves performance) in general (and maybe even without such a reload_completed check). > Both patches fix our particular use cases and are regression free in > aarch64-linux-gnu. Did you also check for performance regressions? Segher
Re: [PATCH] libgfortran: Replace mutex with rwlock
On Wed, Dec 21, 2022 at 07:27:11PM -0500, Lipeng Zhu via Fortran wrote: > This patch try to introduce the rwlock and split the read/write to > unit_root tree and unit_cache with rwlock instead of the mutex to > increase CPU efficiency. In the get_gfc_unit function, the percentage > to step into the insert_unit function is around 30%, in most instances, > we can get the unit in the phase of reading the unit_cache or unit_root > tree. So split the read/write phase by rwlock would be an approach to > make it more parallel. > > BTW, the IPC metrics can increase from 0.25 to 2.2 in the Intel > SRP server with 220 cores. The benchmark we used is > https://github.com/rwesson/NEAT > The patch fails bootstrap on x86_64-*-freebsd. gmake[6]: Entering directory '/home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/src/c++17' /bin/sh ../../libtool --tag CXX --tag disable-shared --mode=compile /home/kargl/gcc/obj/./gcc/xgcc -shared-libgcc -B/home/kargl/gcc/obj/./gcc -nostdinc++ -L/home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/src -L/home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/src/.libs -L/home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/libsupc++/.libs -B/home/kargl/work/x86_64-unknown-freebsd14.0/bin/ -B/home/kargl/work/x86_64-unknown-freebsd14.0/lib/ -isystem /home/kargl/work/x86_64-unknown-freebsd14.0/include -isystem /home/kargl/work/x86_64-unknown-freebsd14.0/sys-include -fno-checking -I/home/kargl/gcc/gcc/libstdc++-v3/../libgcc -I/home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/include/x86_64-unknown-freebsd14.0 -I/home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/include -I/home/kargl/gcc/gcc/libstdc++-v3/libsupc++ -std=gnu++17 -nostdinc++ -prefer-pic -D_GLIBCXX_SHARED -fno-implicit-templates -Wall -Wextra -Wwrite-strings -Wcast-qual -Wabi=2 -fdiagnostics-show-location=once -ffunction-sections -fdata-sections -frandom-seed=floating_from_chars.lo -fimplicit-templates -g -O2 -c -o floating_from_chars.lo ../../../../../gcc/libstdc++-v3/src/c++17/floating_from_chars.cc libtool: compile: /home/kargl/gcc/obj/./gcc/xgcc -shared-libgcc -B/home/kargl/gcc/obj/./gcc -nostdinc++ -L/home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/src -L/home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/src/.libs -L/home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/libsupc++/.libs -B/home/kargl/work/x86_64-unknown-freebsd14.0/bin/ -B/home/kargl/work/x86_64-unknown-freebsd14.0/lib/ -isystem /home/kargl/work/x86_64-unknown-freebsd14.0/include -isystem /home/kargl/work/x86_64-unknown-freebsd14.0/sys-include -fno-checking -I/home/kargl/gcc/gcc/libstdc++-v3/../libgcc -I/home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/include/x86_64-unknown-freebsd14.0 -I/home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/include -I/home/kargl/gcc/gcc/libstdc++-v3/libsupc++ -std=gnu++17 -nostdinc++ -D_GLIBCXX_SHARED -fno-implicit-templates -Wall -Wextra -Wwrite-strings -Wcast-qual -Wabi=2 -fdiagnostics-show-location=once -ffunction-sections -fdata-sections -frandom-seed=floating_from_chars.lo -fimplicit-templates -g -O2 -c ../../../../../gcc/libstdc++-v3/src/c++17/floating_from_chars.cc -fPIC -DPIC -D_GLIBCXX_SHARED -o floating_from_chars.o In file included from /home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/include/memory_resource:40, from ../../../../../gcc/libstdc++-v3/src/c++17/floating_from_chars.cc:37: /home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/include/shared_mutex: In function 'int std::__glibcxx_rwlock_rdlock(pthread_rwlock**)': /home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/include/shared_mutex:80:3: error: call of overloaded '__gthrw_pthread_rwlock_rdlock(pthread_rwlock**&)' is ambiguous 80 | _GLIBCXX_GTHRW(rwlock_rdlock) | ^ In file included from /home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/include/x86_64-unknown-freebsd14.0/bits/gthr.h:148, from /home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/include/bits/std_mutex.h:41, from /home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/include/shared_mutex:41: /home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/include/shared_mutex:80:3: note: candidate: 'int std::__gthrw_pthread_rwlock_rdlock(pthread_rwlock**)' 80 | _GLIBCXX_GTHRW(rwlock_rdlock) | ^~ /home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/include/x86_64-unknown-freebsd14.0/bits/gthr-default.h:140:1: note: candidate: 'int __gthrw_pthread_rwlock_rdlock(pthread_rwlock**)' 140 | __gthrw(pthread_rwlock_rdlock) | ^~~ /home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/include/shared_mutex: In function 'int std::__glibcxx_rwlock_tryrdlock(pthread_rwlock**)': /home/kargl/gcc/obj/x86_64-unknown-freebsd14.0/libstdc++-v3/include/shared_mutex:81:3: error: call of overloaded '__gthrw_pt
Re: [aarch64] Use dup and zip1 for interleaving elements in initializing vector
On Tue, 6 Dec 2022 at 07:01, Prathamesh Kulkarni wrote: > > On Mon, 5 Dec 2022 at 16:50, Richard Sandiford > wrote: > > > > Richard Sandiford via Gcc-patches writes: > > > Prathamesh Kulkarni writes: > > >> Hi, > > >> For the following test-case: > > >> > > >> int16x8_t foo(int16_t x, int16_t y) > > >> { > > >> return (int16x8_t) { x, y, x, y, x, y, x, y }; > > >> } > > >> > > >> Code gen at -O3: > > >> foo: > > >> dupv0.8h, w0 > > >> ins v0.h[1], w1 > > >> ins v0.h[3], w1 > > >> ins v0.h[5], w1 > > >> ins v0.h[7], w1 > > >> ret > > >> > > >> For 16 elements, it results in 8 ins instructions which might not be > > >> optimal perhaps. > > >> I guess, the above code-gen would be equivalent to the following ? > > >> dup v0.8h, w0 > > >> dup v1.8h, w1 > > >> zip1 v0.8h, v0.8h, v1.8h > > >> > > >> I have attached patch to do the same, if number of elements >= 8, > > >> which should be possibly better compared to current code-gen ? > > >> Patch passes bootstrap+test on aarch64-linux-gnu. > > >> Does the patch look OK ? > > >> > > >> Thanks, > > >> Prathamesh > > >> > > >> diff --git a/gcc/config/aarch64/aarch64.cc > > >> b/gcc/config/aarch64/aarch64.cc > > >> index c91df6f5006..e5dea70e363 100644 > > >> --- a/gcc/config/aarch64/aarch64.cc > > >> +++ b/gcc/config/aarch64/aarch64.cc > > >> @@ -22028,6 +22028,39 @@ aarch64_expand_vector_init (rtx target, rtx > > >> vals) > > >>return; > > >> } > > >> > > >> + /* Check for interleaving case. > > >> + For eg if initializer is (int16x8_t) {x, y, x, y, x, y, x, y}. > > >> + Generate following code: > > >> + dup v0.h, x > > >> + dup v1.h, y > > >> + zip1 v0.h, v0.h, v1.h > > >> + for "large enough" initializer. */ > > >> + > > >> + if (n_elts >= 8) > > >> +{ > > >> + int i; > > >> + for (i = 2; i < n_elts; i++) > > >> +if (!rtx_equal_p (XVECEXP (vals, 0, i), XVECEXP (vals, 0, i % 2))) > > >> + break; > > >> + > > >> + if (i == n_elts) > > >> +{ > > >> + machine_mode mode = GET_MODE (target); > > >> + rtx dest[2]; > > >> + > > >> + for (int i = 0; i < 2; i++) > > >> +{ > > >> + rtx x = copy_to_mode_reg (GET_MODE_INNER (mode), XVECEXP > > >> (vals, 0, i)); > > > > > > Formatting nit: long line. > > > > > >> + dest[i] = gen_reg_rtx (mode); > > >> + aarch64_emit_move (dest[i], gen_vec_duplicate (mode, x)); > > >> +} > > > > > > This could probably be written: > > > > > > for (int i = 0; i < 2; i++) > > > { > > > rtx x = expand_vector_broadcast (mode, XVECEXP (vals, 0, i)); > > > dest[i] = force_reg (GET_MODE_INNER (mode), x); > > > > Oops, I meant "mode" rather than "GET_MODE_INNER (mode)", sorry. > Thanks, I have pushed the change in > 769370f3e2e04823c8a621d8ffa756dd83ebf21e after running > bootstrap+test on aarch64-linux-gnu. Hi Richard, I have attached a patch that extends the transform if one half is dup and other is set of constants. For eg: int8x16_t f(int8_t x) { return (int8x16_t) { x, 1, x, 2, x, 3, x, 4, x, 5, x, 6, x, 7, x, 8 }; } code-gen trunk: f: adrpx1, .LC0 ldr q0, [x1, #:lo12:.LC0] ins v0.b[0], w0 ins v0.b[2], w0 ins v0.b[4], w0 ins v0.b[6], w0 ins v0.b[8], w0 ins v0.b[10], w0 ins v0.b[12], w0 ins v0.b[14], w0 ret code-gen with patch: f: dup v0.16b, w0 adrpx0, .LC0 ldr q1, [x0, #:lo12:.LC0] zip1v0.16b, v0.16b, v1.16b ret Bootstrapped+tested on aarch64-linux-gnu. Does it look OK ? Thanks, Prathamesh > > Thanks, > Prathamesh > > > > > } > > > > > > which avoids forcing constant elements into a register before the > > > duplication. > > > OK with that change if it works. > > > > > > Thanks, > > > Richard > > > > > >> + > > >> + rtvec v = gen_rtvec (2, dest[0], dest[1]); > > >> + emit_set_insn (target, gen_rtx_UNSPEC (mode, v, UNSPEC_ZIP1)); > > >> + return; > > >> +} > > >> +} > > >> + > > >>enum insn_code icode = optab_handler (vec_set_optab, mode); > > >>gcc_assert (icode != CODE_FOR_nothing); > > >> > > >> diff --git a/gcc/testsuite/gcc.target/aarch64/interleave-init-1.c > > >> b/gcc/testsuite/gcc.target/aarch64/interleave-init-1.c > > >> new file mode 100644 > > >> index 000..ee775048589 > > >> --- /dev/null > > >> +++ b/gcc/testsuite/gcc.target/aarch64/interleave-init-1.c > > >> @@ -0,0 +1,37 @@ > > >> +/* { dg-do compile } */ > > >> +/* { dg-options "-O3" } */ > > >> +/* { dg-final { check-function-bodies "**" "" "" } } */ > > >> + > > >> +#include > > >> + > > >> +/* > > >> +** foo: > > >> +** ... > > >> +** dup v[0-9]+\.8h, w[0-9]+ > > >> +** dup v[0-9]+\.8h, w[0-9]+ > > >> +** zip1v[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h > > >> +** ... > > >> +** ret >
Re: Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner
On Tue, 13 Dec 2022 at 11:35, Prathamesh Kulkarni wrote: > > On Tue, 6 Dec 2022 at 21:00, Richard Sandiford > wrote: > > > > Prathamesh Kulkarni via Gcc-patches writes: > > > On Fri, 4 Nov 2022 at 14:00, Prathamesh Kulkarni > > > wrote: > > >> > > >> On Mon, 31 Oct 2022 at 15:27, Richard Sandiford > > >> wrote: > > >> > > > >> > Prathamesh Kulkarni writes: > > >> > > On Wed, 26 Oct 2022 at 21:07, Richard Sandiford > > >> > > wrote: > > >> > >> > > >> > >> Sorry for the slow response. I wanted to find some time to think > > >> > >> about this a bit more. > > >> > >> > > >> > >> Prathamesh Kulkarni writes: > > >> > >> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford > > >> > >> > wrote: > > >> > >> >> > > >> > >> >> Richard Sandiford via Gcc-patches > > >> > >> >> writes: > > >> > >> >> > Prathamesh Kulkarni writes: > > >> > >> >> >> Sorry to ask a silly question but in which case shall we > > >> > >> >> >> select 2nd vector ? > > >> > >> >> >> For num_poly_int_coeffs == 2, > > >> > >> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x) > > >> > >> >> >> If a1/trunc n1 succeeds, > > >> > >> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0. > > >> > >> >> >> So, a1 has to be < n1.coeffs[0] ? > > >> > >> >> > > > >> > >> >> > Remember that a1 is itself a poly_int. It's not necessarily a > > >> > >> >> > constant. > > >> > >> >> > > > >> > >> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the > > >> > >> >> > selector: > > >> > >> >> > > > >> > >> >> > { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... } > > >> > >> >> > > >> > >> >> Sorry, should have been: > > >> > >> >> > > >> > >> >> { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... } > > >> > >> > Hi Richard, > > >> > >> > Thanks for the clarifications, and sorry for late reply. > > >> > >> > I have attached POC patch that tries to implement the above > > >> > >> > approach. > > >> > >> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu > > >> > >> > for VLS vectors. > > >> > >> > > > >> > >> > For VLA vectors, I have only done limited testing so far. > > >> > >> > It seems to pass couple of tests written in the patch for > > >> > >> > nelts_per_pattern == 3, > > >> > >> > and folds the following svld1rq test: > > >> > >> > int32x4_t v = {1, 2, 3, 4}; > > >> > >> > return svld1rq_s32 (svptrue_b8 (), &v[0]) > > >> > >> > into: > > >> > >> > return {1, 2, 3, 4, ...}; > > >> > >> > I will try to bootstrap+test it on SVE machine to test further > > >> > >> > for VLA folding. > > >> > >> > > > >> > >> > I have a couple of questions: > > >> > >> > 1] When mask selects elements from same vector but from different > > >> > >> > patterns: > > >> > >> > For eg: > > >> > >> > arg0 = {1, 11, 2, 12, 3, 13, ...}, > > >> > >> > arg1 = {21, 31, 22, 32, 23, 33, ...}, > > >> > >> > mask = {0, 0, 0, 1, 0, 2, ... }, > > >> > >> > All have npatterns = 2, nelts_per_pattern = 3. > > >> > >> > > > >> > >> > With above mask, > > >> > >> > Pattern {0, ...} selects arg0[0], ie {1, ...} > > >> > >> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, > > >> > >> > 11, 2, ...} > > >> > >> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs > > >> > >> > to different > > >> > >> > pattern in arg0. > > >> > >> > The result is: > > >> > >> > res = {1, 1, 1, 11, 1, 2, ...} > > >> > >> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with: > > >> > >> > with a0 = 1, a1 = 11, S = -9. > > >> > >> > Is that expected tho ? It seems to create a new encoding which > > >> > >> > wasn't present in the input vector. For instance, the next elem in > > >> > >> > sequence would be -7, > > >> > >> > which is not present originally in arg0. > > >> > >> > > >> > >> Yeah, you're right, sorry. Going back to: > > >> > >> > > >> > >> (2) The explicit encoding can be used to produce a sequence of > > >> > >> N*Ex*Px > > >> > >> elements for any integer N. This extended sequence can be > > >> > >> reencoded > > >> > >> as having N*Px patterns, with Ex staying the same. > > >> > >> > > >> > >> I guess we need to pick an N for the selector such that each new > > >> > >> selector pattern (each one out of the N*Px patterns) selects from > > >> > >> the *same pattern* of the same data input. > > >> > >> > > >> > >> So if a particular pattern in the selector has a step S, and the > > >> > >> data > > >> > >> input it selects from has Pi patterns, N*S must be a multiple of Pi. > > >> > >> N must be a multiple of least_common_multiple(S,Pi)/S. > > >> > >> > > >> > >> I think that means that the total number of patterns in the result > > >> > >> (Pr from previous messages) can safely be: > > >> > >> > > >> > >> Ps * least_common_multiple( > > >> > >> least_common_multiple(S[1], P[input(1)]) / S[1], > > >> > >> ... > > >> > >> least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps] > > >> > >> ) > > >> > >> > > >> > >> where: > > >> > >> > > >> > >> Ps = the number of patterns in the selector > > >> > >>