Re: [PATCH] (not just) AArch64: Fold unsigned ADD + LSR by 1 to UHADD

2025-05-02 Thread Pengfei Li
> Heh. This is a bit of a hobby-horse of mine. IMO we should be trying > to make the generic, target-independent vector operations as useful > as possible, so that people only need to resort to target-specific > intrinsics if they're doing something genuinely target-specific. > At the moment, we

Re: [PATCH] (not just) AArch64: Fold unsigned ADD + LSR by 1 to UHADD

2025-05-01 Thread Pengfei Li
Thank you for the comments. > I don't think we can use an unbounded recursive walk, since that > would become quadratic if we ever used it when optimising one > AND in a chain of ANDs. (And using this function for ANDs > seems plausible.) Maybe we should be handling the information > in a simila

[PATCH] simplify-rtx: Combine bitwise operations in more cases

2025-04-23 Thread Pengfei Li
This patch transforms RTL expressions of the form (subreg (not X) off) into (not (subreg X off)) when the subreg is an operand of a bitwise AND or OR. This transformation can expose opportunities to combine a NOT operation with the bitwise AND/OR. For example, it improves the codegen of the follow

[PATCH v2] simplify-rtx: Combine bitwise operations in more cases

2025-04-28 Thread Pengfei Li
This patch transforms RTL expressions of the form (subreg (not X)) into (not (subreg X)) if the subreg is an operand of another binary logical operation. This transformation can expose opportunities to combine more logical operations. For example, it improves the codegen of the following AArch64 N

Re: [PATCH] simplify-rtx: Combine bitwise operations in more cases

2025-04-28 Thread Pengfei Li
Thanks Richard for all review comments. I have addressed the comments and sent a v2 patch in a new email thread. -- Thanks, Pengfei

[PATCH] AArch64: Fold unsigned ADD + LSR by 1 to UHADD

2025-04-28 Thread Pengfei Li
This patch implements the folding of a vector addition followed by a logical shift right by 1 (add + lsr #1) on AArch64 into an unsigned halving add, allowing GCC to emit NEON or SVE2 UHADD instructions. For example, this patch helps improve the codegen from: add v0.4s, v0.4s, v31.4s

[PATCH v2] match.pd: Fold (x + y) >> 1 into IFN_AVG_FLOOR (x, y) for vectors

2025-05-08 Thread Pengfei Li
This patch folds vector expressions of the form (x + y) >> 1 into IFN_AVG_FLOOR (x, y), reducing instruction count on platforms that support averaging operations. For example, it can help improve the codegen on AArch64 from: add v0.4s, v0.4s, v31.4s ushrv0.4s, v0.4s, 1 to:

[PATCH] vect: Improve vectorization for small-trip-count loops using subvectors

2025-05-08 Thread Pengfei Li
This patch improves the auto-vectorization for loops with known small trip counts by enabling the use of subvectors - bit fields of original wider vectors. A subvector must have the same vector element type as the original vector and enough bits for all vector elements to be processed in the loop.

Re: [PATCH] vect: Improve vectorization for small-trip-count loops using subvectors

2025-05-09 Thread Pengfei Li
Hi Richard Biener, As Richard Sandiford has already addressed your questions in another email, I just wanted to add a few below. > That said, we already have unmasked ABS in the IL: > > vect__1.6_15 = .MASK_LOAD (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, ... }, { 0, ...

[PATCH v3] match.pd: Fold (x + y) >> 1 into IFN_AVG_FLOOR (x, y) for vectors

2025-05-12 Thread Pengfei Li
This patch folds vector expressions of the form (x + y) >> 1 into IFN_AVG_FLOOR (x, y), reducing instruction count on platforms that support averaging operations. For example, it can help improve the codegen on AArch64 from: add v0.4s, v0.4s, v31.4s ushrv0.4s, v0.4s, 1 to:

[PING][PATCH v3] match.pd: Fold (x + y) >> 1 into IFN_AVG_FLOOR (x, y) for vectors

2025-05-22 Thread Pengfei Li
Hi, Just a gentle ping for below patch v3. I’ve made minor changes from v2 to v3, as listed below: - Added check if IFN_AVG_FLOOR is supported. - Wrapped new code in match.pd with macro "#ifdef GIMPLE". > This patch folds vector expressions of the form (x + y) >> 1 into > IFN_AVG_FLOOR (x, y), r

Re: [PATCH] vect: Improve vectorization for small-trip-count loops using subvectors

2025-06-04 Thread Pengfei Li
Thank you for all suggestions above. > > I see. So this clearly is a feature on instructions then, not modes. > > In fact it might be profitable to use unpredicated add to avoid > > computing the loop mask for a specific element width completely even > > when that would require more operation for

[PATCH v2] vect: Use combined peeling and versioning for mutually aligned DRs

2025-06-11 Thread Pengfei Li
Current GCC uses either peeling or versioning, but not in combination, to handle unaligned data references (DRs) during vectorization. This limitation causes some loops with early break to fall back to scalar code at runtime. Consider the following loop with DRs in its early break condition:

Re: [PATCH] vect: Use combined peeling and versioning for mutually aligned DRs

2025-06-10 Thread Pengfei Li
Hi Alex, > It might be nice to at least experiment with supporting DRs with > different steps as follow-on work. I agree that we should leave it out > for the first version to keep things simple. > FWIW, in case it's of interest, I wrote a script to calculate the > possible combinations of align

Re: [PING^2][PATCH v3] match.pd: Fold (x + y) >> 1 into IFN_AVG_FLOOR (x, y) for vectors

2025-06-02 Thread Pengfei Li
PING^2 From: Pengfei Li Sent: 22 May 2025 9:51 To: gcc-patches@gcc.gnu.org Cc: rguent...@suse.de; jeffreya...@gmail.com; pins...@gmail.com Subject: [PING][PATCH v3] match.pd: Fold (x + y) >> 1 into IFN_AVG_FLOOR (x, y) for vectors Hi, Just a

[PING] [PATCH] vect: Improve vectorization for small-trip-count loops using subvectors

2025-06-02 Thread Pengfei Li
k. Thanks again for your time and all your inputs. -- Thanks, Pengfei From: Tamar Christina Sent: 09 May 2025 15:03 To: Richard Biener Cc: Richard Sandiford; Pengfei Li; gcc-patches@gcc.gnu.org; ktkac...@nvidia.com Subject: RE: [PATCH] vect: Improve vect

[PATCH] vect: Use combined peeling and versioning for mutually aligned DRs

2025-06-06 Thread Pengfei Li
Current GCC uses either peeling or versioning, but not in combination, to handle unaligned data references (DRs) during vectorization. This limitation causes some loops with early break to fall back to scalar code at runtime. Consider the following loop with DRs in its early break condition: