Hi, As mentioned in PR, for the following test-case: typedef unsigned char uint8_t;
static inline uint8_t
x264_clip_uint8(uint8_t x)
{
uint8_t t = -x;
uint8_t t1 = x & ~63;
return (t1 != 0) ? t : x;
}
void
mc_weight(uint8_t *restrict dst, uint8_t *restrict src, int n)
{
for (int x = 0; x < n*16; x++)
dst[x] = x264_clip_uint8(src[x]);
}
-O3 -mcpu=generic+sve generates following code for the inner loop:
.L3:
ld1b z0.b, p0/z, [x1, x2]
movprfx z2, z0
and z2.b, z2.b, #0xc0
movprfx z1, z0
neg z1.b, p1/m, z0.b
cmpeq p2.b, p1/z, z2.b, #0
sel z0.b, p2, z0.b, z1.b
st1b z0.b, p0, [x0, x2]
add x2, x2, x4
whilelo p0.b, w2, w3
b.any .L3
The sel is redundant since we could conditionally negate z0 based on
the predicate
comparing z2 with 0.
As suggested in the PR, the attached patch, introduces a new
conditional internal function .COND_NEG, and in gimple-isel replaces
the following sequence:
op2 = -op1
op0 = A cmp B
lhs = op0 ? op1 : op2
with:
op0 = A inverted_cmp B
lhs = .COND_NEG (op0, op1, op1).
lhs = .COD_NEG (op0, op1, op1)
implies
lhs = neg (op1) if cond is true OR fall back to op1 if cond is false.
With patch, it generates the following code-gen:
.L3:
ld1b z0.b, p0/z, [x1, x2]
movprfx z1, z0
and z1.b, z1.b, #0xc0
cmpne p1.b, p2/z, z1.b, #0
neg z0.b, p1/m, z0.b
st1b z0.b, p0, [x0, x2]
add x2, x2, x4
whilelo p0.b, w2, w3
b.any .L3
While it seems to work for this test-case, I am not entirely sure if
the patch is correct. Does it look in the right direction ?
Thanks,
Prathamesh
pr93183-1.diff
Description: Binary data
